The JVM (Java Virtual Machine) performance is quite amazing.
This means that we can use B4J to build apps that handle huge files.
I downloaded a 3.7GB log file from this server. Each line represents a HTTP request. I wanted to find the most frequent IP addresses.
We can find the IP address with this Regex pattern:
The app will go over all the lines in the file and add each ip address to a Map named ips:
In order to sort the data we will add the data to a list and call List.SortType:
As you can see in the logs:
It took 30 seconds to parse 13m lines (the most frequent IP address is a Google Bot IP).
Lets say that you want to do all kinds of things with this map. You can of course add it to a SQL database. However there is another simpler solution which is to serialize the map with RandomAccessFile.WriteObject. This allows you to write the whole map to a binary file.
In the next run you can load the map and work with it instead of going over the 3.7GB file.
The complete code:
JVM Arguments
In this program we didn't need too much memory. However there are cases where you will need to store many objects in memory and you will find yourself running out of available memory.
You can increase the JVM available memory. This is done with the following JVM arguments:
Note that 32bit JVMs are limited to 2gb.
This means that we can use B4J to build apps that handle huge files.
I downloaded a 3.7GB log file from this server. Each line represents a HTTP request. I wanted to find the most frequent IP addresses.
We can find the IP address with this Regex pattern:
B4X:
Dim m As Matcher = Regex.Matcher("^(\d+.\d+.\d+.\d)", line)
The app will go over all the lines in the file and add each ip address to a Map named ips:
B4X:
Dim count As Int
If ips.ContainsKey(m.Match) Then
count = ips.Get(m.Match)
Else
count = 0
End If
ips.Put(m.Match, count + 1)
In order to sort the data we will add the data to a list and call List.SortType:
B4X:
Dim list1 As List
list1.Initialize
For i = 0 To ips.Size - 1
Dim ic As IpAndCount
ic.Initialize
ic.ip = ips.GetKeyAt(i)
ic.count = ips.GetValueAt(i)
list1.Add(ic)
Next
list1.SortType("count", False)
As you can see in the logs:
It took 30 seconds to parse 13m lines (the most frequent IP address is a Google Bot IP).
Lets say that you want to do all kinds of things with this map. You can of course add it to a SQL database. However there is another simpler solution which is to serialize the map with RandomAccessFile.WriteObject. This allows you to write the whole map to a binary file.
B4X:
raf.Initialize(File.DirTemp, serializedMapFile, False)
raf.WriteObject(ips, True, 0)
raf.Close
In the next run you can load the map and work with it instead of going over the 3.7GB file.
The complete code:
B4X:
Sub Process_Globals
Type IpAndCount (ip As String, count As Int)
Private serializedMapFile As String = "ips.dat"
End Sub
Sub AppStart (Args() As String)
Dim start As Long = DateTime.Now
Dim ips As Map
ips.Initialize
Log("File size: " & NumberFormat(File.Size("", Args(0)) / (1024 * 1024), 0, 0) & " MB")
Dim raf As RandomAccessFile
If File.Exists(File.DirTemp, serializedMapFile) = False Then
Log("Building map file...")
Dim tr As TextReader
tr.Initialize(File.OpenInput("", Args(0)))
Dim lineCounter As Int
Dim line As String = tr.ReadLine
Do While line <> Null
lineCounter = lineCounter + 1
Dim m As Matcher = Regex.Matcher("^(\d+.\d+.\d+.\d)", line)
If m.Find Then
Dim count As Int
If ips.ContainsKey(m.Match) Then
count = ips.Get(m.Match)
Else
count = 0
End If
ips.Put(m.Match, count + 1)
Else
Log(line) 'not expected to happen
End If
If lineCounter Mod 1000000 = 0 Then
Log("Line: " & NumberFormat(lineCounter,0,0) & ", Map size: " & ips.Size & _
", time: " & (DateTime.Now - start) & " ms")
End If
line = tr.ReadLine
Loop
tr.Close
'save the map for future use
raf.Initialize(File.DirTemp, serializedMapFile, False)
raf.WriteObject(ips, True, 0)
Else
'read the map from the file
Log("Loading map file...")
raf.Initialize(File.DirTemp, serializedMapFile, False)
ips = raf.ReadObject(0)
End If
raf.Close
Dim list1 As List
list1.Initialize
For i = 0 To ips.Size - 1
Dim ic As IpAndCount
ic.Initialize
ic.ip = ips.GetKeyAt(i)
ic.count = ips.GetValueAt(i)
list1.Add(ic)
Next
list1.SortType("count", False)
ic = list1.Get(0) 'get the top item
Log(ic.ip & ": " & ic.count)
Log("Total time: " & (DateTime.Now - start) & " ms, number of lines: " & _
NumberFormat(lineCounter, 0, 0))
End Sub
JVM Arguments
In this program we didn't need too much memory. However there are cases where you will need to store many objects in memory and you will find yourself running out of available memory.
You can increase the JVM available memory. This is done with the following JVM arguments:
B4X:
#VirtualMachineArgs: -Xms1024m -Xmx1024m
Note that 32bit JVMs are limited to 2gb.