B4J Tutorial Parsing huge text files

Erel

Administrator
Staff member
Licensed User
The JVM (Java Virtual Machine) performance is quite amazing.

This means that we can use B4J to build apps that handle huge files.
I downloaded a 3.7GB log file from this server. Each line represents a HTTP request. I wanted to find the most frequent IP addresses.

We can find the IP address with this Regex pattern:
B4X:
Dim m As Matcher = Regex.Matcher("^(\d+.\d+.\d+.\d)", line)
The app will go over all the lines in the file and add each ip address to a Map named ips:

B4X:
Dim count As Int
If ips.ContainsKey(m.Match) Then
   count = ips.Get(m.Match)
Else
   count = 0
End If
ips.Put(m.Match, count + 1)
In order to sort the data we will add the data to a list and call List.SortType:
B4X:
Dim list1 As List
list1.Initialize
For i = 0 To ips.Size - 1
   Dim ic As IpAndCount
   ic.Initialize
   ic.ip = ips.GetKeyAt(i)
   ic.count = ips.GetValueAt(i)
   list1.Add(ic)
Next
list1.SortType("count", False)
As you can see in the logs:



It took 30 seconds to parse 13m lines (the most frequent IP address is a Google Bot IP).

Lets say that you want to do all kinds of things with this map. You can of course add it to a SQL database. However there is another simpler solution which is to serialize the map with RandomAccessFile.WriteObject. This allows you to write the whole map to a binary file.
B4X:
raf.Initialize(File.DirTemp, serializedMapFile, False)
raf.WriteObject(ips, True, 0)
raf.Close
In the next run you can load the map and work with it instead of going over the 3.7GB file.

The complete code:
B4X:
Sub Process_Globals
   Type IpAndCount (ip As String, count As Int)
   Private serializedMapFile As String = "ips.dat"
End Sub

Sub AppStart (Args() As String)
   Dim start As Long = DateTime.Now
   Dim ips As Map
   ips.Initialize
   Log("File size: " & NumberFormat(File.Size("", Args(0)) / (1024 * 1024), 0, 0) & " MB")
   Dim raf As RandomAccessFile
   If File.Exists(File.DirTemp, serializedMapFile) = False Then
     Log("Building map file...")
     Dim tr As TextReader
     tr.Initialize(File.OpenInput("", Args(0)))
     Dim lineCounter As Int
     Dim line As String = tr.ReadLine
     Do While line <> Null
       lineCounter = lineCounter + 1
       Dim m As Matcher = Regex.Matcher("^(\d+.\d+.\d+.\d)", line)
       If m.Find Then
         Dim count As Int
         If ips.ContainsKey(m.Match) Then
           count = ips.Get(m.Match)
         Else
           count = 0
         End If
         ips.Put(m.Match, count + 1)
       Else
         Log(line) 'not expected to happen
       End If
       If lineCounter Mod 1000000 = 0 Then
         Log("Line: " & NumberFormat(lineCounter,0,0) & ", Map size: " & ips.Size & _
           ", time: " & (DateTime.Now - start) & " ms")
       End If
       line = tr.ReadLine
     Loop
     tr.Close
     'save the map for future use
     raf.Initialize(File.DirTemp, serializedMapFile, False)
     raf.WriteObject(ips, True, 0)
   Else
     'read the map from the file
     Log("Loading map file...")
     raf.Initialize(File.DirTemp, serializedMapFile, False)
     ips = raf.ReadObject(0)
   End If
   raf.Close
   
   Dim list1 As List
   list1.Initialize
   For i = 0 To ips.Size - 1
     Dim ic As IpAndCount
     ic.Initialize
     ic.ip = ips.GetKeyAt(i)
     ic.count = ips.GetValueAt(i)
     list1.Add(ic)
   Next
   list1.SortType("count", False)
   ic = list1.Get(0) 'get the top item
   Log(ic.ip & ": " & ic.count)
   
   Log("Total time: " & (DateTime.Now - start) & " ms, number of lines: " & _
     NumberFormat(lineCounter, 0, 0))
   
End Sub
JVM Arguments

In this program we didn't need too much memory. However there are cases where you will need to store many objects in memory and you will find yourself running out of available memory.
You can increase the JVM available memory. This is done with the following JVM arguments:
B4X:
#VirtualMachineArgs: -Xms1024m -Xmx1024m
Note that 32bit JVMs are limited to 2gb.
 

stevel05

Expert
Licensed User
Very impresive, I need to learn more about Regex.
 

stevel05

Expert
Licensed User
Using a Map that way is a fantastic idea, extremely useful for parsing and analyzing and kinds of data throughput for debugging. I've used it several times already. Another great time saver.
 

stevel05

Expert
Licensed User
In the project attributes region at the top of the Main module.
 
Top