B4J Tutorial Parsing huge text files

Discussion in 'B4J Tutorials' started by Erel, Nov 24, 2013.

  1. Erel

    Erel Administrator Staff Member Licensed User

    The JVM (Java Virtual Machine) performance is quite amazing.

    This means that we can use B4J to build apps that handle huge files.
    I downloaded a 3.7GB log file from this server. Each line represents a HTTP request. I wanted to find the most frequent IP addresses.

    We can find the IP address with this Regex pattern:
    Code:
    Dim m As Matcher = Regex.Matcher("^(\d+.\d+.\d+.\d)", line)
    The app will go over all the lines in the file and add each ip address to a Map named ips:

    Code:
    Dim count As Int
    If ips.ContainsKey(m.Match) Then
       count = ips.Get(m.Match)
    Else
       count = 
    0
    End If
    ips.Put(m.Match, count + 
    1)
    In order to sort the data we will add the data to a list and call List.SortType:
    Code:
    Dim list1 As List
    list1.Initialize
    For i = 0 To ips.Size - 1
       
    Dim ic As IpAndCount
       ic.Initialize
       ic.ip = ips.GetKeyAt(i)
       ic.count = ips.GetValueAt(i)
       list1.Add(ic)
    Next
    list1.SortType(
    "count"False)
    As you can see in the logs:

    [​IMG]

    It took 30 seconds to parse 13m lines (the most frequent IP address is a Google Bot IP).

    Lets say that you want to do all kinds of things with this map. You can of course add it to a SQL database. However there is another simpler solution which is to serialize the map with RandomAccessFile.WriteObject. This allows you to write the whole map to a binary file.
    Code:
    raf.Initialize(File.DirTemp, serializedMapFile, False)
    raf.WriteObject(ips, 
    True0)
    raf.Close
    In the next run you can load the map and work with it instead of going over the 3.7GB file.

    The complete code:
    Code:
    Sub Process_Globals
       
    Type IpAndCount (ip As String, count As Int)
       
    Private serializedMapFile As String = "ips.dat"
    End Sub

    Sub AppStart (Args() As String)
       
    Dim start As Long = DateTime.Now
       
    Dim ips As Map
       ips.Initialize
       
    Log("File size: " & NumberFormat(File.Size("", Args(0)) / (1024 * 1024), 00) & " MB")
       
    Dim raf As RandomAccessFile
       
    If File.Exists(File.DirTemp, serializedMapFile) = False Then
         
    Log("Building map file...")
         
    Dim tr As TextReader
         tr.Initialize(
    File.OpenInput("", Args(0)))
         
    Dim lineCounter As Int
         
    Dim line As String = tr.ReadLine
         
    Do While line <> Null
           lineCounter = lineCounter + 
    1
           
    Dim m As Matcher = Regex.Matcher("^(\d+.\d+.\d+.\d)", line)
           
    If m.Find Then
             
    Dim count As Int
             
    If ips.ContainsKey(m.Match) Then
               count = ips.Get(m.Match)
             
    Else
               count = 
    0
             
    End If
             ips.Put(m.Match, count + 
    1)
           
    Else
             
    Log(line) 'not expected to happen
           End If
           
    If lineCounter Mod 1000000 = 0 Then
             
    Log("Line: " & NumberFormat(lineCounter,0,0) & ", Map size: " & ips.Size & _
               
    ", time: " & (DateTime.Now - start) & " ms")
           
    End If
           line = tr.ReadLine
         
    Loop
         tr.Close
         
    'save the map for future use
         raf.Initialize(File.DirTemp, serializedMapFile, False)
         raf.WriteObject(ips, 
    True0)
       
    Else
         
    'read the map from the file
         Log("Loading map file...")
         raf.Initialize(
    File.DirTemp, serializedMapFile, False)
         ips = raf.ReadObject(
    0)
       
    End If
       raf.Close
       
       
    Dim list1 As List
       list1.Initialize
       
    For i = 0 To ips.Size - 1
         
    Dim ic As IpAndCount
         ic.Initialize
         ic.ip = ips.GetKeyAt(i)
         ic.count = ips.GetValueAt(i)
         list1.Add(ic)
       
    Next
       list1.SortType(
    "count"False)
       ic = list1.Get(
    0'get the top item
       Log(ic.ip & ": " & ic.count)
       
       
    Log("Total time: " & (DateTime.Now - start) & " ms, number of lines: " & _
         
    NumberFormat(lineCounter, 00))
       
    End Sub
    JVM Arguments

    In this program we didn't need too much memory. However there are cases where you will need to store many objects in memory and you will find yourself running out of available memory.
    You can increase the JVM available memory. This is done with the following JVM arguments:
    Code:
    #VirtualMachineArgs: -Xms1024m -Xmx1024m
    Note that 32bit JVMs are limited to 2gb.
     
  2. stevel05

    stevel05 Expert Licensed User

    Very impresive, I need to learn more about Regex.
     
  3. stevel05

    stevel05 Expert Licensed User

    Using a Map that way is a fantastic idea, extremely useful for parsing and analyzing and kinds of data throughput for debugging. I've used it several times already. Another great time saver.
     
  4. positrom2

    positrom2 Active Member Licensed User

    Where to put this line?
     
  5. stevel05

    stevel05 Expert Licensed User

    In the project attributes region at the top of the Main module.
     
  6. Erel

    Erel Administrator Staff Member Licensed User

    Note that #VirtualMachineArgs attribute only affects the program when you run it from the IDE.
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice