Android Question search in huge CSV file

Zeev Goldstein

Well-Known Member
Licensed User
Longtime User
hi

i have a huge csv file (~14 MB)
it has one column only
what is the fastest way to search for a value in the file?
there will be only one elements as all items are unique
i just need to know if the value i search is in the csv or not
no need to load anything

any idea?
a short sample will be appreciated

thanks
 

Zeev Goldstein

Well-Known Member
Licensed User
Longtime User
Looping line by line is NOT a slow method.
As written multiple times it depends on
- how often will the csv will be searched?
- how often is the CSV changed (to decide wheter it is worth to parse the csv and write it to a database)? Again: using a databasesearch is a really fast task. No matter if the Database contains 100 or 100 million datasets.
as said - the file may be read a few times one day and nothing on another day
the file might be updated each 24 or 48 hours
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
B4X:
Private Sub Button1_Click
    Dim a As FileChooser
    a.Initialize
    If SourceFileFolder <> "" Then a.InitialDirectory = SourceFileFolder
    a.Title = "Select huge file:"
    a.setExtensionFilter("Huge file", Array As String("*.*"))

    Dim fn As String = a.ShowOpen(Main.MainForm)
    If fn = "" Then Return
    SourceFileFolder = a.InitialDirectory
 
    Dim size As Long = File.Size("", fn)
    Dim rafInput As RandomAccessFile
    rafInput.Initialize(fn, "", True)
    Dim chunk_size As Long = Min (size / 100, 1000000)
    Dim buffer(chunk_size) As Byte
 
    Dim NeedleToSearch As String = "test1"
 
    Dim found As Boolean
    Do While rafInput.CurrentPosition < size
        Dim count As Int = rafInput.ReadBytes(buffer, 0, Min(buffer.Length, size - rafInput.CurrentPosition), rafInput.CurrentPosition)
        Dim b() As Byte
        If count = buffer.Length Then
            b = buffer
        Else
            Dim b(count) As Byte
            Bit.ArrayCopy(buffer, 0, b, 0, count)
        End If
        Dim text_chunk As String = BytesToString(b, 0, b.Length, "UTF-8")
        'Log(text_chunk)
        If text_chunk.Contains(NeedleToSearch) Then
            found = True
            Exit
        End If
        Dim percent As Int = rafInput.CurrentPosition/size * 100
        Log(percent)
        If percent Mod 3 = 0 Then Sleep(1)
    Loop
    rafInput.Close
    xui.MsgboxAsync(found, "found")
End Sub

about 30 seconds for 5 GB file
 

Attachments

  • bigfile.zip
    3.2 KB · Views: 67
Last edited:
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
B4X:
Private Sub Button1_Click
    Dim a As FileChooser
    a.Initialize
    If SourceFileFolder <> "" Then a.InitialDirectory = SourceFileFolder
    a.Title = "Select huge file:"
    a.setExtensionFilter("Huge file", Array As String("*.*"))

    Dim fn As String = a.ShowOpen(Main.MainForm)
    If fn = "" Then Return
    SourceFileFolder = a.InitialDirectory
   
    Dim NeedleToSearch As String = "0Ua"
   
    Dim size As Long = File.Size("", fn)
    Dim rafInput As RandomAccessFile
    rafInput.Initialize(fn, "", True)
   
    Dim chunk_size As Long = Max(NeedleToSearch.Length * 2, Min(size / 100, 1000000))
    Dim buffer(chunk_size) As Byte
       
    Dim found As Boolean
    Do While rafInput.CurrentPosition < size
        Dim qtyToRead As Long =  Min(buffer.Length, size - rafInput.CurrentPosition)
        Dim positionToRead As Long = rafInput.CurrentPosition
        If qtyToRead > NeedleToSearch.Length Then
            positionToRead = Max(positionToRead - NeedleToSearch.Length, NeedleToSearch.Length)
        End If
       
        Dim count As Int = rafInput.ReadBytes(buffer, 0, qtyToRead, positionToRead)
        Dim b() As Byte
        If count = buffer.Length Then
            b = buffer
        Else
            Dim b(count) As Byte
            Bit.ArrayCopy(buffer, 0, b, 0, count)
        End If
        Dim text_chunk As String = BytesToString(b, 0, b.Length, "UTF-8")
        Log(text_chunk)
        If text_chunk.Contains(NeedleToSearch) Then
            found = True
            Exit
        End If
        Dim percent As Int = rafInput.CurrentPosition/size * 100
        Log(percent)
        If percent Mod 3 = 0 Then Sleep(1)
    Loop
    rafInput.Close
    xui.MsgboxAsync(found, "found")
End Sub

i guess, more correct is to overlap the reading noting the searching text length...
 
Last edited:
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
hi

i have a huge csv file (~14 MB)
it has one column only
what is the fastest way to search for a value in the file?
there will be only one elements as all items are unique
i just need to know if the value i search is in the csv or not
no need to load anything

any idea?
a short sample will be appreciated

thanks
14Mb I wouldn't call huge.
I did a test using RAF, looking at consecutive byte arrays (to avoid memory problems).
Ran this on a postcode file (ID, postcode, latitude, longitude) with a size of 63 Mb and it found
the last postcode in some 125 milli-seconds.

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
14Mb I wouldn't call huge.
I did a test using RAF, looking at consecutive byte arrays (to avoid memory problems).
Ran this on a postcode file (ID, postcode, latitude, longitude) with a size of 63 Mb and it found
the last postcode in some 125 milli-seconds.

RBS
If I understand you well and looking at the sample file then it seems it can be speeded up maybe 3 times by
adding some code to stop looking further in the row if it sees the | character.

RBS
 
Upvote 0

Zeev Goldstein

Well-Known Member
Licensed User
Longtime User
B4X:
Private Sub Button1_Click
    Dim a As FileChooser
    a.Initialize
    If SourceFileFolder <> "" Then a.InitialDirectory = SourceFileFolder
    a.Title = "Select huge file:"
    a.setExtensionFilter("Huge file", Array As String("*.*"))

    Dim fn As String = a.ShowOpen(Main.MainForm)
    If fn = "" Then Return
    SourceFileFolder = a.InitialDirectory
 
    Dim size As Long = File.Size("", fn)
    Dim rafInput As RandomAccessFile
    rafInput.Initialize(fn, "", True)
    Dim chunk_size As Long = Min (size / 100, 1000000)
    Dim buffer(chunk_size) As Byte
 
    Dim NeedleToSearch As String = "test1"
 
    Dim found As Boolean
    Do While rafInput.CurrentPosition < size
        Dim count As Int = rafInput.ReadBytes(buffer, 0, Min(buffer.Length, size - rafInput.CurrentPosition), rafInput.CurrentPosition)
        Dim b() As Byte
        If count = buffer.Length Then
            b = buffer
        Else
            Dim b(count) As Byte
            Bit.ArrayCopy(buffer, 0, b, 0, count)
        End If
        Dim text_chunk As String = BytesToString(b, 0, b.Length, "UTF-8")
        'Log(text_chunk)
        If text_chunk.Contains(NeedleToSearch) Then
            found = True
            Exit
        End If
        Dim percent As Int = rafInput.CurrentPosition/size * 100
        Log(percent)
        If percent Mod 3 = 0 Then Sleep(1)
    Loop
    rafInput.Close
    xui.MsgboxAsync(found, "found")
End Sub

about 30 seconds for 5 GB file
Thanks, i will dive into the sample and learn it
Thanks again
 
Upvote 0

d3vc

Member
hi

i have a huge csv file (~14 MB)
it has one column only
what is the fastest way to search for a value in the file?
there will be only one elements as all items are unique
i just need to know if the value i search is in the csv or not
no need to load anything

any idea?
a short sample will be appreciated

thanks
Try this .
B4X:
Sub Process_Globals
    Private FileName As String = "yourfile.csv"
End Sub

Sub Globals
    Private SearchValue As String
End Sub

Sub Activity_Create(FirstTime As Boolean)
    SearchValue = "value_to_search"
    If SearchInCSV(FileName, SearchValue) Then
        Log("Value found!")
    Else
        Log("Value not found.")
    End If
End Sub

Sub SearchInCSV(FileName As String, SearchValue As String) As Boolean
    Dim InputStream As InputStream
    Dim Reader As TextReader
    Dim Line As String

    Try
        InputStream = File.OpenInput(File.DirDefaultExternal, FileName)
        Reader = File.OpenTextReader(InputStream)

        Do While True
            Line = Reader.ReadLine
            If Line = "" Then Exit
            If Line = SearchValue Then
                Reader.Close
                Return True
            End If
        Loop
    Catch
        Log(LastException)
    Finally
        If Reader.IsInitialized Then Reader.Close
        If InputStream.IsInitialized Then InputStream.Close
    End Try

    Return False
End Sub
 
Upvote 0

d3vc

Member
Thanks, i will dive into the sample and learn it
Thanks again
B4X:
Private Sub Button1_Click
    Dim a As FileChooser
    a.Initialize
    If SourceFileFolder <> "" Then a.InitialDirectory = SourceFileFolder
    a.Title = "Select huge file:"
    a.setExtensionFilter("Huge file", Array As String("*.*"))

    Dim fn As String = a.ShowOpen(Main.MainForm)
    If fn = "" Then Return
    SourceFileFolder = a.InitialDirectory
 
    Dim size As Long = File.Size("", fn)
    Dim rafInput As RandomAccessFile
    rafInput.Initialize(fn, "", True)
    Dim chunk_size As Long = Min(size / 100, 1000000)
    Dim buffer(chunk_size) As Byte
 
    Dim NeedleToSearch As String = "test1"
 
    Dim found As Boolean
    Do While rafInput.CurrentPosition < size
        Dim count As Int = rafInput.ReadBytes(buffer, 0, Min(buffer.Length, size - rafInput.CurrentPosition), rafInput.CurrentPosition)
        Dim b() As Byte
        If count = buffer.Length Then
            b = buffer
        Else
            Dim b(count) As Byte
            Bit.ArrayCopy(buffer, 0, b, 0, count)
        End If
        Dim text_chunk As String = BytesToString(b, 0, b.Length, "UTF-8")
        If text_chunk.Contains(NeedleToSearch) Then
            found = True
            Exit
        End If
        Dim percent As Int = rafInput.CurrentPosition / size * 100
        Log(percent)
        If percent Mod 3 = 0 Then Sleep(1)
    Loop
    rafInput.Close
    xui.MsgboxAsync(found, "found")
End Sub
 
Upvote 0

Zeev Goldstein

Well-Known Member
Licensed User
Longtime User
Try this .
B4X:
Sub Process_Globals
    Private FileName As String = "yourfile.csv"
End Sub

Sub Globals
    Private SearchValue As String
End Sub

Sub Activity_Create(FirstTime As Boolean)
    SearchValue = "value_to_search"
    If SearchInCSV(FileName, SearchValue) Then
        Log("Value found!")
    Else
        Log("Value not found.")
    End If
End Sub

Sub SearchInCSV(FileName As String, SearchValue As String) As Boolean
    Dim InputStream As InputStream
    Dim Reader As TextReader
    Dim Line As String

    Try
        InputStream = File.OpenInput(File.DirDefaultExternal, FileName)
        Reader = File.OpenTextReader(InputStream)

        Do While True
            Line = Reader.ReadLine
            If Line = "" Then Exit
            If Line = SearchValue Then
                Reader.Close
                Return True
            End If
        Loop
    Catch
        Log(LastException)
    Finally
        If Reader.IsInitialized Then Reader.Close
        If InputStream.IsInitialized Then InputStream.Close
    End Try

    Return False
End Sub
wow thanks
that looks so very simple (once you know it)
i will check it
i'm afraid you resolved my problem (if it will work and fast)
i will revert and update
thanks again
 
Upvote 0

emexes

Expert
Licensed User
Longtime User
did you want to see all 600K lines? ~14GB?

Is it 600K lines or 14 GB?

Unlikely to be both.

all lines are the same in content and size

If all lines are 20 characters as per your sample (plus one or two bytes of EOL) then impossible to be both.

600K lines * 22 bytes per line = ~14 MB (not GB)

14 MB would be quite manageable in memory and can be made blindingly fast, eg sort elements, do binary search, or stick them in a Map or hand-crafted bithash.

14 GB is a bit harder - that's more like 600 million (not thousand) elements. Can still be made fast, using a bithash to do a preliminary check that an element *might* exist or definitely does not exist, and then a binary search on the full list in a sorted file.

Actually, if you're just searching by the first 8 digits, then:
(i) that is only 100 million unique possible key values, and
(ii) 100 million bits = 12.5 million bytes = an easy bitmap in memory to do preliminary does-key-exist-or-not check

If the CSV is only 14 MB, then zip it up and post it to pastebin or similar.
 
Upvote 0

emexes

Expert
Licensed User
Longtime User
more correct

I think you also dodged the bullet of the read blocks possibly only partially-reading multibyte UTF-8 characters at beginning and end of a block.

Happily, the CSV data seems to be plain old boring ASCII.

If the CSV data lines are all reliably the same length (perhaps 20 printable bytes plus 1 byte CRLF line terminator) then you could read them in blocks of say 1000 lines ie 21000 bytes, and it'd all nicely line up and work, without crossing across line boundaries.

Although I guess if there's a header line with a different length, you'd need to separately read or skip over that first.

What could possibly go rwong?
 
Last edited:
Upvote 0

Zeev Goldstein

Well-Known Member
Licensed User
Longtime User
correction - i was wrong - the file is 14MB

i tries both samples given and both are TOO slow
the device asked several times to keep waiting or close the app as it seemed as hung
so after a while i just killed it

so sequential read is probably not the answer
 
Upvote 0

peacemaker

Expert
Licensed User
Longtime User
Seems, not so slow... and this was tested in Debug-mode.
1748507328957.png

1748507491957.png
 

Attachments

  • bigfile.zip
    3.4 KB · Views: 68
Last edited:
Upvote 0
Top