Android Question Parsing .csv file

RB Smissaert

Well-Known Member
Licensed User
Longtime User
I am dealing with large and often misformed .csv files and StringUtils.LoadCSV is not flexible enough to cope with all these, so wrote a parsing function in
B4A, having the same output (list holding 1D string arrays). This seems to work all fine, but it is about 3-4x slower that StringUtils.LoadCSV. I suppose that this
is to be expected as I take it LoadCSV works directly with Java. Below the B4A code I have and 2 questions about this:
1. Are there any ways anybody can see to speed this up?
2. Would it be possible to translate this code with inline Java?

B4X:
Sub CSV2List(strFolder As String, _
             strFile As String, _
             btSeparatorByte As Byte, _
             btEncloserByte As Byte, _
             btEndOfLineByte As Byte, _
             lRows As Long) As ResumableSub

    Dim i As Long
    Dim c As Long
    Dim arrBytes() As Byte
    Dim lBytes As Long
    Dim lColumns As Long
    Dim lRow As Long
    Dim lUB As Long
    Dim lSeparatorPos As Long
    Dim lEncloserPos As Long
    Dim lChr10Pos As Long
    Dim arrFields() As Object
    Dim arrData() As Object
    Dim lstRows As List
    Dim lFirstDataByte As Long
    Dim bExit As Boolean
    Dim bCSVHasEncloserBytes As Boolean

    'this will be a zero-based UTF8 byte array
    '-----------------------------------------
    RAF.Initialize(strFolder, strFile, True)
    lBytes = RAF.Size
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)
 
    lUB = arrBytes.Length - 1
    
    bCSVHasEncloserBytes = CSVHasEncloserBytes(arrBytes, btEncloserByte)

    arrFields = GetCSVHeaders(arrBytes, 44, 34, 10)
    lColumns = arrFields.Length
    Dim arrData(lColumns) As Object
    
    lstRows.Initialize
    lstRows.Add(arrFields) 'adding the column names
    
    lRow = lRow + 1

    'as we are skipping the the column names
    '---------------------------------------
    lFirstDataByte = GetCSVFirstDataByte(arrBytes, 10)
    lEncloserPos = -1
    lSeparatorPos = -1
    lChr10Pos = lFirstDataByte - 1    '<<<<<<!!
    
    'deal with the first byte, taking this out speeds up the next loop a bit
    '-----------------------------------------------------------------------
    If bCSVHasEncloserBytes Then '<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        
        If lColumns > 1 Then '<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            
            ''''''''''''''''''''''''''
            'SEPARATORS AND ENCLOSERS'
            ''''''''''''''''''''''''''
            Select Case arrBytes(lFirstDataByte)
                Case btSeparatorByte
                    arrData(c) = Null
                Case btEncloserByte
                    lEncloserPos = lFirstDataByte
            End Select

            For i = lFirstDataByte + 1 To lUB - 1
                Select Case arrBytes(i)
                    Case btSeparatorByte
                        If c < lColumns Then
                            If lEncloserPos = -1 Then
                                If lSeparatorPos = -1 Then
                                    If arrBytes(i - 1) <> btEncloserByte Then
                                        arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - 1) - lChr10Pos, "UTF-8")
                                        c = c + 1
                                        lChr10Pos = -1
                                    End If    'If arrBytes(i - 1) <> btEncloserByte
                                Else    'If lSeparatorPos = -1
                                    If (i - 1) - (lSeparatorPos + 1) > 1 Then
                                        arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8")
                                    Else    'If (i - 1) - (lSeparatorPos + 1) > 1
                                        arrData(c) = Null
                                    End If    'If (i - 1) - (lSeparatorPos + 1) > 1
                                    c = c + 1
                                    lChr10Pos = -1
                                End If    'If lSeparatorPos = -1
                            End If    'If lEncloserPos = -1
                        End If    'If c < lColumns
                        lSeparatorPos = i
                    Case btEncloserByte
                        If lEncloserPos = -1 Then
                            lEncloserPos = i
                        Else
                            If c < lColumns Then
                                If i - lEncloserPos = 1 Then
                                    arrData(c) = Null
                                    c = c + 1
                                Else    'If i - lEncloserPos = 1
                                    arrData(c) = BytesToString(arrBytes, lEncloserPos + 1, (i - 1) - lEncloserPos, "UTF-8")
                                    c = c + 1
                                End If    'If i - lEncloserPos = 1
                            End If    'If c < lColumns
                            lEncloserPos = -1
                        End If
                        lSeparatorPos = -1
                        lChr10Pos = -1
                    Case btEndOfLineByte
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If arrBytes(i - 1) = 13 Then
                                    If (i - 2) - lSeparatorPos > 0 Then
                                        arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 2) - lSeparatorPos, "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                Else    'If arrBytes(i - 1) = 13
                                    If (i - 1) - lSeparatorPos > 0 Then
                                        arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                End If    'If arrBytes(i - 1) = 13
                            End If
                    
                            lstRows.Add(arrData)
                            lRow = lRow + 1
                            lEncloserPos = -1
                            lSeparatorPos = -1
                            lChr10Pos = i
                            c = 0
                            Dim arrData(lColumns) As Object

                            If lRows > -1 Then
                                If lRow > lRows Then
                                    bExit = True
                                End If
                            End If
                        End If    'If lEncloserPos = -1
                End Select
                If bExit Then Exit
            Next
    
            'deal with the last byte, this is needed for if there is no final linebreak
            '--------------------------------------------------------------------------
            If lRows = -1 Then
                Select Case arrBytes(lUB)
                    Case btSeparatorByte
                        If c < lColumns Then
                            If lEncloserPos = -1 Then
                                If lSeparatorPos = -1 Then
                                    If lUB - lChr10Pos = 1 Then
                                        arrData(c) = Null
                                    Else    'If lUB - lChr10Pos = 1
                                        If arrBytes(lUB - 1) <> btEncloserByte Then
                                            arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 1) - lChr10Pos, "UTF-8")
                                        End If
                                    End If    'If lUB - lChr10Pos = 1
                                End If    'If lSeparatorPos = -1
                            Else    'If lEncloserPos = -1
                                If lUB - lSeparatorPos = 1 Then
                                    arrData(c) = Null
                                End If
                            End If    'If lEncloserPos = -1
                        End If    'If collStrings.Count < lColumns
                    Case btEncloserByte
                        If lEncloserPos > -1 Then
                            If c < lColumns Then
                                If lUB - lEncloserPos = 1 Then
                                    arrData(c) = Null
                                Else    'If lUB - lEncloserPos = 1
                                    arrData(c) = BytesToString(arrBytes, lEncloserPos + 1, (lUB - 1) - lEncloserPos, "UTF-8")
                                End If    'If lUB - lEncloserPos = 1
                            End If    'If c < lColumns
                        End If    'If lEncloserPos > -1
                    Case btEndOfLineByte
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If lSeparatorPos > -1 Then
                                    If arrBytes(lUB - 1) = 13 Then
                                        If lUB - lSeparatorPos > 2 Then
                                            arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (lUB - 1) - lSeparatorPos, "UTF-8")
                                        Else
                                            arrData(c) = Null
                                        End If
                                    Else    'If arrBytes(lUB - 1) = 13
                                        If lUB - lSeparatorPos > 1 Then
                                            arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (lUB - 2) - lSeparatorPos, "UTF-8")
                                        Else
                                            arrData(c) = Null
                                        End If
                                    End If    'If arrBytes(lUB - 1) = 13
                                Else    'If lSeparatorPos > -1
                                    If lChr10Pos > -1 Then
                                        If arrBytes(lUB - 1) = 13 Then
                                            arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 2) - lChr10Pos, "UTF-8")
                                        Else
                                            arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 1) - lChr10Pos, "UTF-8")
                                        End If
                                    Else    'If lChr10Pos > -1
                                        If arrBytes(lUB - 2) <> btEncloserByte Then
                                            If arrBytes(lUB - 1) = 13 Then
                                                arrData(c) = BytesToString(arrBytes, 0, lUB - 2, "UTF-8")
                                            Else
                                                arrData(c) = BytesToString(arrBytes, 0, lUB - 1, "UTF-8")
                                            End If
                                        End If    'If arrBytes(lUB - 2) <> btEncloserByte
                                    End If    'If lChr10Pos > -1
                                End If    'If lSeparatorPos > -1
                            End If    'If c < lColumns
                        End If    'If lEncloserPos = -1
                    Case Else
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If lSeparatorPos > -1 Then
                                    If lUB - lSeparatorPos > 1 Then
                                        arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, lUB - lSeparatorPos, "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                Else    'If lSeparatorPos > -1
                                    If lChr10Pos > -1 Then
                                        arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, lUB - lChr10Pos, "UTF-8")
                                    Else
                                        arrData(c) = BytesToString(arrBytes, lFirstDataByte, (lUB - lFirstDataByte) + 1, "UTF-8")
                                    End If
                                End If    'If lSeparatorPos > -1
                            End If    'If c < lColumns
                        End If    'If lEncloserPos = -1
                End Select
            End If
        
        Else 'If lColumns > 1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        
            '''''''''''''''''''''''''''''
            'NO SEPARATORS BUT ENCLOSERS'
            '''''''''''''''''''''''''''''
            Select Case arrBytes(lFirstDataByte)
                Case btEncloserByte
                    lEncloserPos = lFirstDataByte
            End Select

            For i = lFirstDataByte + 1 To lUB - 1
                Select Case arrBytes(i)
                    Case btEncloserByte
                        If lEncloserPos = -1 Then
                            lEncloserPos = i
                        Else
                            If c < lColumns Then
                                If i - lEncloserPos = 1 Then
                                    arrData(c) = Null
                                    c = c + 1
                                Else    'If i - lEncloserPos = 1
                                    arrData(c) = BytesToString(arrBytes, lEncloserPos + 1, (i - 1) - lEncloserPos, "UTF-8")
                                    c = c + 1
                                End If    'If i - lEncloserPos = 1
                            End If    'If c < lColumns
                            lEncloserPos = -1
                        End If
                        lChr10Pos = -1
                    Case btEndOfLineByte
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If arrBytes(i - 1) = 13 Then
                                    If (i - lChr10Pos) -1 > 1 Then
                                        arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - lChr10Pos) -1 , "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                Else    'If arrBytes(i - 1) = 13
                                    If (i - lChr10Pos) > 1 Then
                                        arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - lChr10Pos) , "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                End If    'If i - lSeparatorPos > 1
                            End If    'If c < lColumns
                    
                            lstRows.Add(arrData)
                            lRow = lRow + 1
                            lEncloserPos = -1
                            lChr10Pos = i
                            c = 0
                            Dim arrData(lColumns) As Object

                            If lRows > -1 Then
                                If lRow > lRows Then
                                    bExit = True
                                End If
                            End If
                        End If    'If lEncloserPos = -1
                End Select
                If bExit Then Exit
            Next
    
            'deal with the last byte, this is needed for if there is no final linebreak
            '--------------------------------------------------------------------------
            If lRows = -1 Then
                Select Case arrBytes(lUB)
                    Case btEncloserByte
                        If lEncloserPos > -1 Then
                            If c < lColumns Then
                                If lUB - lEncloserPos = 1 Then
                                    arrData(c) = Null
                                Else    'If lUB - lEncloserPos = 1
                                    arrData(c) = BytesToString(arrBytes, lEncloserPos + 1, (lUB - 1) - lEncloserPos, "UTF-8")
                                End If    'If lUB - lEncloserPos = 1
                            End If    'If c < lColumns
                        End If    'If lEncloserPos > -1
                    Case btEndOfLineByte
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If lChr10Pos > -1 Then
                                    If arrBytes(lUB - 1) = 13 Then
                                        If (lUB - 2) - lChr10Pos > 0 Then
                                            arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 2) - lChr10Pos, "UTF-8")
                                        Else
                                            arrData(c) = Null
                                        End If
                                    Else
                                        If (lUB - 1) - lChr10Pos > 0 Then
                                            arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 1) - lChr10Pos, "UTF-8")
                                        Else
                                            arrData(c) = Null
                                        End If
                                    End If
                                Else    'If lChr10Pos > -1
                                    If arrBytes(lUB - 2) <> btEncloserByte Then
                                        If arrBytes(lUB - 1) = 13 Then
                                            arrData(c) = BytesToString(arrBytes, 0, lUB - 2, "UTF-8")
                                        Else
                                            arrData(c) = BytesToString(arrBytes, 0, lUB - 1, "UTF-8")
                                        End If
                                    End If    'If arrBytes(lUB - 2) <> btEncloserByte
                                End If    'If lChr10Pos > -1
                            End If    'If c < lColumns
                        End If    'If lEncloserPos = -1
                    Case Else
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If lChr10Pos > -1 Then
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, lUB - lChr10Pos, "UTF-8")
                                Else
                                    arrData(c) = BytesToString(arrBytes, lFirstDataByte, (lUB - lFirstDataByte) + 1, "UTF-8")
                                End If
                            End If    'If c < lColumns
                        End If    'If lEncloserPos = -1
                End Select
            End If
        End If 'If lColumns > 1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        
    Else  'If bCSVHasEncloserBytes<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        
        '''''''''''''''''''''''''''''
        'SEPARATORS BUT NO ENCLOSERS'
        '''''''''''''''''''''''''''''
        If lColumns > 1 Then '<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            Select Case arrBytes(lFirstDataByte)
                Case btSeparatorByte
                    arrData(c) = Null
            End Select

            For i = lFirstDataByte + 1 To lUB - 1
                Select Case arrBytes(i)
                    Case btSeparatorByte
                        If c < lColumns Then
                            If lSeparatorPos = -1 Then
                                arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - 1) - lChr10Pos, "UTF-8")
                                c = c + 1
                                lChr10Pos = -1
                            Else    'If lSeparatorPos = -1
                                If (i - 1) - (lSeparatorPos + 1) > 1 Then
                                    arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8")
                                Else    'If (i - 1) - (lSeparatorPos + 1) > 1
                                    arrData(c) = Null
                                End If    'If (i - 1) - (lSeparatorPos + 1) > 1
                                c = c + 1
                                lChr10Pos = -1
                            End If    'If lSeparatorPos = -1
                        End If    'If c < lColumns
                        lSeparatorPos = i
                    Case btEndOfLineByte
                        If c < lColumns Then
                            If lSeparatorPos > -1 Then
                                If arrBytes(i - 1) = 13 Then
                                    If (i - 2) - lSeparatorPos > 0 Then
                                        arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 2) - lSeparatorPos, "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                Else    'If arrBytes(i - 1) = 13
                                    If (i - 1) - lSeparatorPos > 0 Then
                                        arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                End If    'If arrBytes(i - 1) = 13
                            Else    'If lSeparatorPos > -1
                                If arrBytes(i - 1) = 13 Then
                                    If (i - lChr10Pos) -1 > 1 Then
                                        arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - lChr10Pos) -1 , "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                Else    'If arrBytes(i - 1) = 13
                                    If (i - lChr10Pos) > 1 Then
                                        arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - lChr10Pos) , "UTF-8")
                                    Else
                                        arrData(c) = Null
                                    End If
                                    c = c + 1
                                End If    'If i - lSeparatorPos > 1
                            End If    'If lSeparatorPos > -1
                    
                            lstRows.Add(arrData)
                            lRow = lRow + 1
                            lSeparatorPos = -1
                            lChr10Pos = i
                            c = 0
                            Dim arrData(lColumns) As Object

                            If lRows > -1 Then
                                If lRow > lRows Then
                                    bExit = True
                                End If
                            End If
                        End If    'If lEncloserPos = -1
                End Select
                If bExit Then Exit
            Next
    
            'deal with the last byte, this is needed for if there is no final linebreak
            '--------------------------------------------------------------------------
            If lRows = -1 Then
                Select Case arrBytes(lUB)
                    Case btSeparatorByte
                        If c < lColumns Then
                            If lSeparatorPos = -1 Then
                                If lUB - lChr10Pos = 1 Then
                                    arrData(c) = Null
                                Else    'If lUB - lChr10Pos = 1
                                    'If arrBytes(lUB - 1) <> btEncloserByte Then
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 1) - lChr10Pos, "UTF-8")
                                End If    'If lUB - lChr10Pos = 1
                            End If    'If lSeparatorPos = -1
                        End If    'If collStrings.Count < lColumns
                    Case btEndOfLineByte
                        If lEncloserPos = -1 Then
                            If c < lColumns Then
                                If lSeparatorPos > -1 Then
                                    If arrBytes(lUB - 1) = 13 Then
                                        If lUB - lSeparatorPos > 2 Then
                                            arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (lUB - 1) - lSeparatorPos, "UTF-8")
                                        Else
                                            arrData(c) = Null
                                        End If
                                    Else    'If arrBytes(lUB - 1) = 13
                                        If lUB - lSeparatorPos > 1 Then
                                            arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (lUB - 2) - lSeparatorPos, "UTF-8")
                                        Else
                                            arrData(c) = Null
                                        End If
                                    End If    'If arrBytes(lUB - 1) = 13
                                Else    'If lSeparatorPos > -1
                                    If lChr10Pos > -1 Then
                                        If arrBytes(lUB - 1) = 13 Then
                                            If (lUB - 2) - lChr10Pos > 0 Then
                                                arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 2) - lChr10Pos, "UTF-8")
                                            Else
                                                arrData(c) = Null
                                            End If
                                        Else
                                            If (lUB - 1) - lChr10Pos > 0 Then
                                                arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 1) - lChr10Pos, "UTF-8")
                                            Else
                                                arrData(c) = Null
                                            End If
                                        End If
                                    Else    'If lChr10Pos > -1
                                        If arrBytes(lUB - 1) = 13 Then
                                            arrData(c) = BytesToString(arrBytes, 0, lUB - 2, "UTF-8")
                                        Else
                                            arrData(c) = BytesToString(arrBytes, 0, lUB - 1, "UTF-8")
                                        End If
                                    End If    'If lChr10Pos > -1
                                End If    'If lSeparatorPos > -1
                            End If    'If c < lColumns
                        End If    'If lEncloserPos = -1
                    Case Else
                        If c < lColumns Then
                            If lSeparatorPos > -1 Then
                                If lUB - lSeparatorPos > 1 Then
                                    arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, lUB - lSeparatorPos, "UTF-8")
                                Else
                                    arrData(c) = Null
                                End If
                            Else    'If lSeparatorPos > -1
                                If lChr10Pos > -1 Then
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, lUB - lChr10Pos, "UTF-8")
                                Else
                                    arrData(c) = BytesToString(arrBytes, lFirstDataByte, (lUB - lFirstDataByte) + 1, "UTF-8")
                                End If
                            End If    'If lSeparatorPos > -1
                        End If    'If c < lColumns
                End Select
            End If
        
        Else 'If lColumns > 1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        
            ''''''''''''''''''''''''''''''''
            'NO SEPARATORS AND NO ENCLOSERS'
            ''''''''''''''''''''''''''''''''
            For i = lFirstDataByte + 1 To lUB - 1
                Select Case arrBytes(i)
                    Case btEndOfLineByte
                        If c < lColumns Then
                            If arrBytes(i - 1) = 13 Then
                                If (i - lChr10Pos) -1 > 0 Then
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - lChr10Pos) -1 , "UTF-8")
                                Else
                                    arrData(c) = Null
                                End If
                                c = c + 1
                            Else    'If arrBytes(i - 1) = 13
                                If (i - lChr10Pos) > 0 Then
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (i - lChr10Pos) , "UTF-8")
                                Else
                                    arrData(c) = Null
                                End If
                                c = c + 1
                            End If    'If i - lSeparatorPos > 1
                        End If    'If c < lColumns
                    
                        lstRows.Add(arrData)
                        lRow = lRow + 1
                        lChr10Pos = i
                        c = 0
                        Dim arrData(lColumns) As Object

                        If lRows > -1 Then
                            If lRow > lRows Then
                                bExit = True
                            End If
                        End If
                End Select
                If bExit Then Exit
            Next
    
            'deal with the last byte, this is needed for if there is no final linebreak
            '--------------------------------------------------------------------------
            If lRows = -1 Then
                Select Case arrBytes(lUB)
                    Case btEndOfLineByte
                        If c < lColumns Then
                            If lChr10Pos > -1 Then
                                If arrBytes(lUB - 1) = 13 Then
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 2) - lChr10Pos, "UTF-8")
                                Else
                                    arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, (lUB - 1) - lChr10Pos, "UTF-8")
                                End If
                            Else    'If lChr10Pos > -1
                                If arrBytes(lUB - 1) = 13 Then
                                    arrData(c) = BytesToString(arrBytes, 0, lUB - 2, "UTF-8")
                                Else
                                    arrData(c) = BytesToString(arrBytes, 0, lUB - 1, "UTF-8")
                                End If
                            End If    'If lChr10Pos > -1
                        End If    'If c < lColumns
                    Case Else
                        If c < lColumns Then
                            If lChr10Pos > -1 Then
                                arrData(c) = BytesToString(arrBytes, lChr10Pos + 1, lUB - lChr10Pos, "UTF-8")
                            Else
                                arrData(c) = BytesToString(arrBytes, lFirstDataByte, (lUB - lFirstDataByte) + 1, "UTF-8")
                            End If
                        End If    'If c < lColumns
                End Select
            End If
        End If 'If lColumns > 1 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        
    End If 'If bCSVHasEncloserBytes<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    
    'add final data array
    lstRows.Add(arrData)

    Return lstRows

End Sub

Sub GetCSVHeaders(arrBytes() As Byte, _
                  btSeparatorByte As Byte, _
                  btEncloserByte As Byte, _
                  btEndOfLineByte As Byte) As String()

    Dim i As Long
    Dim c As Long
    Dim lUB As Long
    Dim lSeparatorPos As Long
    Dim lEncloserPos As Long
    Dim lChr10Pos As Long
    Dim lstFields As List
    Dim bExitLoop As Boolean

    lUB = arrBytes.Length - 1

    lEncloserPos = -1
    lSeparatorPos = -1
    lChr10Pos = -1

    lstFields.Initialize

    For i = 0 To lUB
        Select Case arrBytes(i)
            Case btSeparatorByte
                If lEncloserPos = -1 Then
                    If lSeparatorPos = -1 Then
                        If i - lChr10Pos = 1 Then
                            bExitLoop = True
                        Else
                            If arrBytes(i - 1) <> btEncloserByte Then
                                lstFields.Add(BytesToString(arrBytes, lChr10Pos + 1, (i - 1) - lChr10Pos, "UTF-8"))
                                c = c + 1
                                lChr10Pos = -1
                            End If
                        End If
                    Else
                        If (i - 1) - (lSeparatorPos + 1) > 1 Then
                            lstFields.Add(BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8"))
                        Else
                            bExitLoop = True
                        End If
                        c = c + 1
                        lChr10Pos = -1
                    End If
                Else
                    If i - lSeparatorPos = 1 Then
                        bExitLoop = True
                    End If
                End If
                lSeparatorPos = i
            Case btEncloserByte
                If lEncloserPos = -1 Then
                    lEncloserPos = i
                Else
                    If i - lEncloserPos = 1 Then
                        bExitLoop = True
                    Else
                        lstFields.Add(BytesToString(arrBytes, lEncloserPos + 1, (i - 1) - lEncloserPos, "UTF-8"))
                        c = c + 1
                    End If
                    lEncloserPos = -1
                End If
                lSeparatorPos = -1
                lChr10Pos = -1
            Case btEndOfLineByte
                If lEncloserPos = -1 Then
                    If lSeparatorPos > -1 Then
                        If arrBytes(i - 1) = 13 Then
                            If i - lSeparatorPos > 2 Then
                                lstFields.Add(BytesToString(arrBytes, lSeparatorPos + 1, (i - 2) - lSeparatorPos, "UTF-8"))
                                c = c + 1
                            Else
                                bExitLoop = True
                            End If
                        Else
                            If i - lSeparatorPos > 1 Then
                                lstFields.Add(BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8"))
                                c = c + 1
                            Else
                                bExitLoop = True
                            End If
                        End If
                    Else    'If lSeparatorPos > -1
                        If lChr10Pos > -1 Then
                            If arrBytes(i - 1) = 13 Then
                                lstFields.Add(BytesToString(arrBytes, lChr10Pos + 1, (i - 2) - lChr10Pos, "UTF-8"))
                            Else
                                lstFields.Add(BytesToString(arrBytes, lChr10Pos + 1, (i - 1) - lChr10Pos, "UTF-8"))
                            End If
                            c = c + 1
                        Else
                            If arrBytes(i - 2) <> btEncloserByte Then
                                If arrBytes(i - 1) = 13 Then
                                    lstFields.Add(BytesToString(arrBytes, 0, i - 1, "UTF-8"))
                                Else
                                    lstFields.Add(BytesToString(arrBytes, 0, i, "UTF-8"))
                                End If
                                c = c + 1
                            End If
                        End If
                    End If    'If lSeparatorPos > -1
                    bExitLoop = True
                End If    'If lEncloserPos = -1
        End Select
        If bExitLoop Then Exit
    Next

    Dim arrFields(lstFields.Size) As String

    For i = 0 To lstFields.Size - 1
        arrFields(i) = lstFields.Get(i)
    Next

    Return arrFields

End Sub

Sub GetCSVFirstDataByte(arrBytes() As Byte, btEndOfLineByte As Byte) As Long

    Dim i As Long

    For i = 0 To arrBytes.Length - 1
        If arrBytes(i) = btEndOfLineByte Then
            Return i + 1
        End If
    Next
    
    Return -1

End Sub

Sub CSVHasEncloserBytes(arrBytes() As Byte, btEncloserByte As Byte) As Boolean
    
    Dim i As Long
    Dim lUB As Long
    
    lUB = arrBytes.Length - 1

    For i = 0 To lUB
        If arrBytes(i) = btEncloserByte Then
            Return True
        End If
    Next
    
    Return False
    
End Sub


RBS
 

Albert Kallal

Active Member
Licensed User
Well, the code is eventually converted into Java, so I don't see any speed advantages here.

However, you can use some built in commands to split out a row into columns - and it is MUCH less code. And with MUCH less code, then things are likely to run much faster.

You can use the regex.Split() command here.

So, if you have "a,b,c,d,e" as a string, then

dim OneRow() as string
OneRow = regex.Split(",",strOneRow)

The result is a oneRow will have a array of the values between the "," in this case.

So, a general csv parseing loop could/would look like this:

B4X:
   Dim strDataBuf As String
   strDataBuf = File.ReadString(File.DirAssets,"data.csv")

   Dim MyRows() As String

   MyRows = Regex.Split(CRLF,strDataBuf)

   Dim MyColumns() As String

   Dim bolFirstRow As Boolean = True
   Dim MyResultList As List

   MyResultList.Initialize

   For Each strOneRow As String In MyRows

      If bolFirstRow Then
          bolFirstRow = False
        MyColumns = Regex.Split(",",strOneRow)
      Else
          Dim OneRow() As String
        OneRow = Regex.Split(",",strOneRow)
        MyResultList.Add(OneRow)
     End If
   Next

So, our first split converts the text file into a array of "rows", since each line is terminated with a crlf (but, check, it is in some cases just LF).

Then, we loop each row, and then split that row by the field delim "," in this case.
And before you split, you could do one replace of " to nothing if you don't care about quotes around strings, and none around numbers. You not supplied any data, or in fact explained the defaults of the bad csv file.

But:
Reading rows as a full string in one chunk whiteout a whole lot of code = VERY fast.
Using split against the one strong row in one shot without a whole of code = VERY fast.

So, based on the above basic loop, I would suspect your results will be fast - very fast.

So, dealing with things at a byte level means you have a LOT OF code here, and a LOT of operations.

The less over all operations and lines of code to run for each row? Again, = much faster code.

So, you not much explained the details as to what you want to do with quoted strings vs values that are not quoted. But in a LOT of cases?

I just wipe out the quotes from the line - as I often don't care, and I happy to take all values as strings anyway. So, in place of complicated code to deal with the quotes in the one given row? I just remove them like this:

strOneRow.Replace(",","")

So, now the parsing code can freely do on ONE shot the split command based on the delimiter. And with the quoted value removed, then again this process runs a lot faster.

In my example the delimiter was ",", but you could pass/use say "/" or whatever is being used here.

Give the above basic "design" pattern a try - it will be oh so much less code, but on top of this with so much less code, it will run a whole lot faster.

And I assumed the first row had column names. And again, if they quoted like this:

"FirstName", "LastName", "City", "HotelName"

Then once again, I do the replace trick, and above thus becomes without quotes this:

FirstName, LastName, City, HotelName

And now to create a array of the column names from above, I can do this:

B4X:
    Dim MyColumns()   as string
   MyColmums = regex.Split(",",strOneLine)

And for those lines that are missing a few columns? You can still use split, and you can check the size (how many columns) with this:

B4X:
      Dim OneRow() as string
     OneRow = regex.split(",",strOneRow)
     NumberOfColumnsInRow =  OneRow.Length

So process this data at the row level chunks as a string - not at the byte level - the results will be multiple orders faster.

So, first split into array of rows.

Then each row as string - split that based on the column delimiter.

and if you don't care about quoted data? Then just blast those out with a replace and set them to a empty string (to remove the quoted value).

And now it is one operation to split based on delimiter.

Regards,
Albert D. Kallal
Edmonton, Alberta Canada
 
Last edited:
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
This seems to work all fine, but it is about 3-4x slower that StringUtils.LoadCSV. I suppose that this
is to be expected as I take it LoadCSV works directly with Java. Below the B4A code I have and 2 questions about this:
1. B4A code is not slower than Java code.
2. Make sure to test performance in release mode.
3. Try this implementation: [B4X] CSVParser - CSV parser and generator
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User

I am aware of the regex.Split function, but didn't think it would speed things up as I run through the bytes of the file in only one loop and I need the Ifs to check things
and the assignments to keep track of positions. There is a lot of code, but bear in mind that only about a quarter will run as I have 4 different loops, and the one that will run depends on if there are separators or not and enclosers or not.
There are a number of things I haven't tried eg, getting all the end of row positions first and jumping to the next one eg if c > lColumns.
Another thing would be working with one big string rather than a byte array, so you can get the values with SubsString rather than BytesToString. I just guessed it wouldn't be faster.
Will have a look at Erel's CSVParser.bas.

RBS
 
Upvote 0

Albert Kallal

Active Member
Licensed User

I suspect that using split, and dealing with a row as string? Multiple orders faster. You be using "more" built in library code this way, and doing fewer operations. Give the above basic layout loop a try. If the file is large (you don't mention number of rows), then you could adopt a readline approach for each row, but it depends on how many rows you talking about. Writing looping code vs a split? It's not even close - the split much faster. You dealing with chunks and strings - not things at a byte level. However, the first split? Yes, that's going to be a large string and that bottleneck issue can be addressed.

R
Albert
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
>>Writing looping code vs a split? It's not even close - the split much faster.

Only testing will tell, but how do you think split works? No loop there?

RBS
 
Upvote 0

Albert Kallal

Active Member
Licensed User
>>Writing looping code vs a split? It's not even close - the split much faster.

Only testing will tell, but how do you think split works? No loop there?

RBS

sure, but you doing less operations - and that routine is a native android library. this concept centers around what we often call supervisor code, or using OS API code vs writing your own. this is also why say for example a interpreted script running on windows vs that of a compiled program often does not run faster. The % of time spend in that OS in both cases can be similar. And we even see that say in desktop software that compiling that code often does not speed things up.

So, say we have some code to load a form and draw it (that amounts to a bunch of OS calls. So the time spent in that OS code between the interpreted script vs OS code is often not different. Say the command to load and display a form is available as ONE compiled code command vs one interpreted code command? Well, 98% of that code in BOTH cases is spent in that OS code. So compiling the code? It don't speed things up, since both are at the end of the day calling the SAME code routines in the OS or the development stack of tools. Thus compiling that script don't help much - both are spending MOST of their time in the same code base - a code base that been optimized in both cases.

In your case? You spending a lot of time in your code - not the API and OS calls. As a result, you have room and options to use MORE of that OS or api code library code - and as a result, in general when you do this, you enjoy significant performance improvements.

so this concept of OS or supervisor code applies here. You for example can't come close to writing your own character by character loop routine to reaplce some values in a string compared to the built in commands. They are OS code and API calls. And they been optomized far better down to machine level, and take advantge of OS built in funcitons to parse and deal with strings. This would not be any different then doing this:

dim strTest as string
strTest = "This is a test"
log(strTest).

Or now? Try filling up a byte array with 14 sepeate btyes (the length of the above string).
Now, to display those above 14 chars, you doing this:

for OneBye from the ByteArray
log(OneByte) ' displays T
then display "h" on next loop (so, you eventually get This is a test output.

You not only making 14 more OS calls, you also running 14 less commands. You get at least 14x improvement in one command to write out that string as compared to executing 14 commands to output 1 byte at a time.

So display or spit out a string vs some looping code to split out a string? The built in commands are not only faster, but you be using LESS of them. And as a general rule writing your own code vs that of OS or API code tends to be a lot slower. Since those built in commands are (often) at the OS level, and that OS level thus has very high optimized code at the machine level. It kind of like making BIOS calls on the desktop. If such code was not optimized, then the OS speed would be slow - and such code is not slow.

So, you wind up executing less instructions, but more important of those code and instructions are spend in OS or api code, or what was often called supervisor code (os level code). That stuff is fast - often hand coded assembler routines, or c code compiled routines at the OS level.

I mean, in place of the above log command?

To copy that string byte by byte into another array vs say a single assignment of a string to another string var? It is one operation - and a very fast operation. The time to assign a one byte value to another byte value not much different then the speed of the command to assign a 15 character string to another string var. And I don't execute 15 commands to do this when assigning a string, but I would be by using a loop to assign those 15 chars or move those 15 chars. So, I am going to wind up making 15 OS or so called system routine calls in one case, and ONE OS call + operation in the other. You be running less code, but more important each of those commands has a higher % of time spent in that OS code and you making less OS calls to such code.

The above rule is not ALWAYS the case, but is the case in most cases!

And the reverse is also true? I can write hand coded assembler, and call a OS routine to copy a file. Or you can write a interpreted script. Since in both cases you spending 99% of time using the built in OS file copy command routine? Then compiling such code or even had coded assembler will not speed things up, since in both cases, you are calling the SAME library OS code to copy that file. The copy file takes 99% of the time, not that the ONE command to call the file copy command that executes as fast as possible. You can say speed up code that calls that file copy command by a 1000x, but since such a tiny % of that code is the over all execution time, then you not really see any difference in speed.

However, in your case, your writing code that can be replaced by the built in OS level and library commands - so you have lots of room and possibilities for speed improvements.

Edit: a good example is how you noted that using the built in commands is about 4x faster - and the above narrative is why.

R
Albert
 
Last edited:
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User

OK, will bear in mind and do some testing.

RBS
 
Upvote 0

Albert Kallal

Active Member
Licensed User
OK, will bear in mind and do some testing.

RBS
just keep in mind, this "advice" used to be simple, and hard and fast advice. However, today, with these "super" optimizers that Android has? The rule above I suggest applies a LOT less these days - the new systems do such a good job now, that this "easy" rule is not all oh so much simple advice anymore!

R
Albert
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
1. B4A code is not slower than Java code.
2. Make sure to test performance in release mode.
3. Try this implementation: [B4X] CSVParser - CSV parser and generator

I have tested the code from CSVParser.bas and it appears a bit more than twice as slow as the post I posted.
And that is without making any alterations to cope with misformed files.
With misformed files I mean files like for example this:

ID,"Age",
10,20
,
"11",30,,,,
12,"40"

This should still produce:

ID,Age
10,20
,
11,30
12,40

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Possible. In most cases parsing is fast enough.
I tested it in B4J and it takes ~5 milliseconds to parse a 140kb CSV file.

I tested on a 12 column 2Mb file.
The other thing I need to take care of is that I get Null values if there are
2 consecutive separators or a trailing separator. This is why I use object arrays.
Not sure now if Sutils.LoadCSV can do that.

RBS
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
Putting Nulls in strings is a bad idea. Better to put an empty string.
One way to make things faster is to make assumptions on the input data. For example you can make an assumption that there are no escaped quotes inside quotes. This makes it simpler to copy bytes directly from the input sources.

BTW, it looks like your code makes the wrong assumption that each byte is a different character. The fact that data(x) = separator doesn't mean that it is really a separator. This will not work in UTF-8 where code points can take between 1 to 4 bytes.
It can work in UTF-32 but will require more complex code.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User

The nulls are going in an object array, and this seems to work well.
As this code is to import a .csv file into SQLite, I need the nulls as for example putting an empty string
in an integer column will cause problems. As far as I can see string arrays don't take nulls.

Will have a look at the UTF-8 problem where a code point can take more than one byte.

RBS
 
Upvote 0
Cookies are required to use this site. You must accept them to continue using the site. Learn more…