Android Question What is the fastest way to split this string?

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Moving data from a text file to SQLite database.
If I have a line in the text file like for example this: 123, "John,", Smith
Then what would be the fastest way to split this line in these 3 fields:

123
John,
Smith

As these files can be large, I need to process the file line by line, so I can't use for example StringUtils.

RBS
 

Quandalle

Member
Licensed User
Below is an optimised CSV line parsing algorithm which avoids intermediate copies in a StringBuilder by using start and end indexes in the original string, and which uses the Char type for the current character rather than String. ( chr(34) = QUOTE and chr(44) = COMMA)

Compared with the algorithm proposed in #14, this saves around 50% of the execution time.

B4X:
Sub ParseCSVLine(csvLine As String) As List
    Dim values As List
    values.Initialize
    Dim startIndex, endIndex As Int
    Dim insideQuotes As Boolean = False
    Dim c As Char
        
    For endIndex = 0 To csvLine.Length - 1
        c = csvLine.CharAt(endIndex)
        If c = Chr(34) Then
            insideQuotes = Not (insideQuotes)
        Else
            If (c = Chr(44)) And Not (insideQuotes) Then
                values.Add(csvLine.SubString2(startIndex, endIndex).trim)
                startIndex = endIndex + 1
            End If
        End If
    Next
    values.Add(csvLine.SubString2(startIndex, endIndex ).trim)
    Return values
End Sub
 
Upvote 0

William Lancee

Well-Known Member
Licensed User
Longtime User
@Quandalle
Nicely done.

The algorithm, yours and mine, requires tweaking to account for escaped quotes inside quotes.
Also, a more general approach should allow for multi-character separators.

Here is a test string to consider:
"Far and wide" and "Lynx, Cat, Tiger" and "A double quote -> \" and a single quote -> '" and and
Split by "and" to yield

Far and wide
Lynx, Cat, Tiger
A double quote -> " and a single quote -> '
-empty-
-empty-
 
Upvote 0

Quandalle

Member
Licensed User
Below is an optimised CSV line parsing algorithm which avoids intermediate copies in a StringBuilder by using start and end indexes in the original string, and which uses the Char type for the current character rather than String. ( chr(34) = QUOTE and chr(44) = COMMA)

Compared with the algorithm proposed in #14, this saves around 50% of the execution time.

B4X:
Sub ParseCSVLine(csvLine As String) As List
    Dim values As List
    values.Initialize
    Dim startIndex, endIndex As Int
    Dim insideQuotes As Boolean = False
    Dim c As Char
       
    For endIndex = 0 To csvLine.Length - 1
        c = csvLine.CharAt(endIndex)
        If c = Chr(34) Then
            insideQuotes = Not (insideQuotes)
        Else
            If (c = Chr(44)) And Not (insideQuotes) Then
                values.Add(csvLine.SubString2(startIndex, endIndex).trim)
                startIndex = endIndex + 1
            End If
        End If
    Next
    values.Add(csvLine.SubString2(startIndex, endIndex ).trim)
    Return values
End Sub

Fast CSV decoding by not using a StringBuilder and by limiting implicit type conversions.
Compared with the previous version, this version adds a few lines to handle escape quotes (a quote inside a string delimited by quotes must be doubled).

B4X:
Sub ParseCSVLine(csvLine As String) As List
    Dim const  qotecar As Char = Chr(34)
    Dim const  separator  As Char = Chr(44)
    Dim startIndex, endIndex As Int
    Dim insideQuotes, DoubleQuote  As Boolean
    Dim c As Char
    Dim values As List : values.Initialize
        
    For endIndex = 0 To csvLine.Length - 1
        c = csvLine.CharAt(endIndex)
                
        If c = qotecar Then
            If endIndex < csvLine.Length-1  And csvLine.CharAt(endIndex+1) = qotecar Then
                endIndex = endIndex +1         
                DoubleQuote = True
            Else
                insideQuotes = Not (insideQuotes)
            End If
        Else
            If (c = separator ) And Not (insideQuotes) Then
                If DoubleQuote Then
                    DoubleQuote = False
                    values.Add(csvLine.SubString2(startIndex, endIndex).trim.Replace(qotecar & qotecar ,qotecar ))
                Else
                    values.Add(csvLine.SubString2(startIndex, endIndex).trim)               
                End If
                startIndex = endIndex + 1
            End If
        End If
    Next
    values.Add(csvLine.SubString2(startIndex, endIndex ).trim)
    Return values
End Sub
 
Upvote 0
Top