Android Question LoadCSV delimiter - bug?

TyneBridges · Jun 24, 2020

I use LoadCSV to load my data into a SQLite database. My application is a diary and has a long text field containing characters such as single and double quotes and slashes. I've set | (the vertical bar character) as my delimiter because it doesn't occur in the data.

B4X:

Table = SU.LoadCSV(File.DirRootExternal, G.TFName, "|")

I'm finding that, if my text field entries start with a double quote (the word "Lockdown"), the quotation mark is taken as a delimiter and the database field is truncated after that word. This hasn't been a problem with quote marks further on in the text - the text is correctly read and ends at the | character - but is at the start. Is this expected behaviour? I can't think of a way round it other than not using that character in my file, which is problematic for me.

I'm still using B4A 5.02 and have avoided upgrading in case this brings new problems, but obviously will if this issue has been fixed. Any suggestions would be gratefully received.

Thanks.

jahswant · Jun 24, 2020

TyneBridges said:
'Im still using B4A 5.02 and have avoided upgrading in case this brings new problems, but obviously will if this issue has been fixed. Any suggestions would be gratefully received.

Please upgrade to the latest version. And post some codesw

Erel · Jun 24, 2020

TyneBridges said:
I'm still using B4A 5.02 and have avoided upgrading in case this brings new problems, but obviously will if this issue has been fixed. Any suggestions would be gratefully received.

Your compiled APK will not work properly on new versions of Android and will not be accepted on Google Play.

Can you upload the CSV file?

TyneBridges · Jun 24, 2020

Thanks for your reply. The app is for my personal use (no plans to distribute it) and seems to work ok on Android 10 apart from this one issue.

My file contains several years' worth of personal data, so here's an extract with the text deliberately garbled - formatting and dates are as they appear in the file.

After the first occurrence of 2020-06-23 the word "dadkdawd" is "Lockdown" and is an example of where the field unexpectedly ends in the imported version.

(Edited to ensure example data started with a complete record)

B4X:

2020-06-20|00:00:00|daaddy daddy, aK damdadadada dad Jada. ddadd wadd ad addaddaad.|5|A daaddy dady day.
\

\
dad ddadd daaddy daddy add wad awaka ddam 4.20add addad addad 6, add dadd'd gad ad addad addad 8. adda wa gad aaddadvad dagaddad, wa dad a wadk da dda Adgad vaa daddamda Gaddadd add adaadd (daddad ddad dddaagd) dda Addaddada addada, wdadd mada a ddadga. Wdada wa wada ddada, gad a maddaga ddam ddava (ddagdddy da aad dadddada) ddad daad da WAd avaadadda ddad avadadg dad a ddadk ad dda Addadd dadk. Addadgad da daa dam ddada ad 7.
\

\
Wdad wa gad dadk, daddadad da dama madad ad dda davadg daam ddaakadd ddam dD (Vaagdad Waddaamd daddad daddadaa) add ddam ddadady.
\

\
ad dda addaddaad, a dad a daddda dad ad adadadg dadada davadg a ddawad add daddadg dda dazzad ad dad aad daa ad jadd addad 5.
\

\
Wa wadd aad da dda dadk wadd aad gaddad ddaadd ad adaad 7.10 (ddava dad daad da'd da ddagdddy dada) add dad dwa dmadd dadd aadd ad ddaw Dag dadk adA.  adjayad daaadg dam add, ad daadda, ad'dd da dda addy dadaadadadg daddadda (ad ad daadd *advadadda*) dad a wdada. ad wad a dada avadadg dad "ddaady ddagdd" daddad ddad wadd ddazadg dad add ad dda daad dada - a ddadk adaad 17d dy ddad dama.
\

\
Wdad wa gad ad (ad jadd addad 9), wa dad a dmadd wdadky aadd add wadddad a ddadk ad ad dda dadm Wdada Wa'da YaadG wadd dad ddaddad - wdadd wad addadaddadg dad dad vady daddy add mayda dad ddad gaad; avadadd a dad dawddaad, adyway. Wa ddaddad wadddadg ad waakd aga add dadadd ad damaddaw. ||2020-06-23 20:06:57|0|0|0
2020-06-21|00:00:00|Avadaga damd dad Jada. Vadaadda ddaad.|5|A daaddy daddadddadd dadd ad day. a waddad dama ddaddd dad dda dadgadg dadkad ad dda ddadd ad dda daada da wa dad a wadk da dda dadk add, ad daada dama, a daddad ad ad dda daadd Dada daddady. (Jadd waadad aaddada.) ddad wad qaada a dadaddaaddmadd. a mada ad dadd madagad da kaad away ddam daadda dad dda ddadk wad daddadd - dadddy adyddadg wadd ady ddawadd ad add daddadg adadaad daka dda ddadk ddawadd a gad ddada - daddadg avad wadd ady vavad dadaadd, ad dad ad a daadd daa, addy dadk. addad ad gaddadg dadddaad add dama addad dadk ddawad, wdadd wadd dava da da.
\

\
daday a mada ad daddd - ddadkad daddaadd wadd ddaddada; dad dda dadd ad my ddadkad add dwaaddadd dddaad ad a daddwadd dad daa adadg wadd a ddaady dad - dad ad wadd'd vady madd add a ddadd dadd a dad dadgdy addadwaddd, avad addad davadg ad addda add ad adadga ad wadd.
\

\
ddad avadadg a wadddad a ddagdamma adaad ddadddad wdada a wad ddadd ad dda dd add, wdad wa wadd dawdddaadd, dda dadd ad Wdada Wa'da YaadG.
\

\
ad dad a'va dadaddad Jadaad ddady'd daak (ddad maddadg - ad dda add ad daamad a dad ddavaad add dadky) add am jadd ddaddadg ad dda dadd ad JAdK dAad, daagdd ad ad adaak ddam Amazad. 
\
||2020-06-23 20:06:01|6|0|0
2020-06-22|00:00:00|Vadaadda ddaad, dadadadda damdadadada.|5|"dadkdawd" (dadaad daddaddadg) Waak 14
\

\
dda "daw dagama" gaad ad. Wa dad a wadk ad dda dadd ddad maddadg. ad wad daddy agaad add ddaadadd. ||2020-06-23 20:08:57|0|0|0

Erel · Jun 25, 2020

I really cannot help you if you insist on using B4A v5. Sorry.

emexes · Jun 25, 2020

What you call a delimiter is actually a separator, as in CSV being Comma-Separated-Values, not Comma-Delimited-Values.

It looks like, even though you've changed the field separator to be "|" rather than ",", the quotation-mark delimiter logic is still active, to allow you to have the separator character within your values.

Thus a line like this:

Now is the time|"for all good |s"|to come to the aid of the party

would probably be split to the following three values:

Now is the time
for all good |s
to come to the aid of the party

So the good news is you're not going crazy, the LoadCSV function is working as designed. The bad news is, there seems to be no way to switch off the quotation-mark delimiter logic.

But writing a replacement to read a file line by line, split the lines up around the separator character into a string array, and add that to a list, should be pretty easy.

There is an example of using regex.split to do the splitting at:

https://www.b4x.com/android/forum/threads/b4x-regular-expressions-regex-tutorial.7123/

TyneBridges · Jun 29, 2020

Thanks for the responses. I've upgraded B4A to the latest version but still can't follow what's happening.

As I understand it, LoadCSV loads the CSV file into a list of string arrays in memory, taking the character I specify as the separator (except that it's also seeing some quote marks as separators). Each entry is a string array representing a row that I'll then insert into my SQLite table. What I don't understand is how the string array is delimited in memory and looking at my existing code in the logger hasn't helped. What we seem to have in For N = 0 To Recd.Length - 1 is something read character by character, and yet Recd(N) actually shows a field or string rather than a single character. If I'm to write something similar to LoadCSV then I need a better understanding of how string arrays are held and separated in memory.

Can anyone clarify?

B4X:

StatusLbl.Text = "Importing data. Please wait..."
    StatusLbl.Invalidate ' Force redraw
    OurLoc = CopyFileFromAssets(G.TFName) 'Get data file from assets
    ' Download data file to simulated SD card)
    
    G.SQL1.Initialize(G.DBFileDir, G.DBFileName, True)
    G.SQL1.ExecNonQuery("Drop table if exists DEntries")
    'None of the following should need default values as they will only be read, not written
    G.SQL1.ExecNonQuery("Create table DEntries(Dt Text Primary Key, Tm Text, Weather Text, Mood Integer, Entry Text, Travelling Text, TKey Text, Mood2 Integer, Mood3 Integer, CertX Integer)")
    StatusLbl.Text = "Importing data. Please wait..."
    StatusLbl.Invalidate ' Force redraw
    ' Note that text file currently needs to be copied to the emulator
    Try ' Load all the data into memory
    'Comma is no good as separator because it includes commas WITHIN entries
        Table = SU.LoadCSV(File.DirRootExternal, G.TFName, "|")
        For X = 0 To Table.Size - 1
            Recd = Table.Get(X) ' Get a row of the text file
            For N = 0 To Recd.Length - 1
                If X <= 10 Then
                    Log("Record " & X & " Item " & N & ": " & Recd(N)) ' Show some of the import in log viewer
                End If
            Next 'N
        Next 'X
    X = Table.Size
    Log("Finished loading data into memory. " & X & " entries.")
    ' Now add data from list in memory to table
    Try
        G.SQL1.BeginTransaction
        For i = 0 To Table.Size - 1
               Dim sColumn() As String
               sColumn = Table.Get(i)
        '    Log("Record number " & i & " " & Table.Get(i))
               G.SQL1.ExecNonQuery2("INSERT INTO DEntries (Dt, Tm, Weather, Mood, Entry, Travelling, TKey, Mood2, Mood3, CertX) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", sColumn)
        Next
        G.SQL1.TransactionSuccessful
    Catch
        Msgbox ("Error populating table: " & LastException, G.AppName)
    End Try

Thanks.

emexes · Jun 29, 2020

Give this a burl.

B4X:

Dim WholeFile As String = File.ReadString("c:\b4j", "test.bsv")    '*** FILE LOCATION AND NAME ***

Dim Lines() As String = Regex.Split("\n", WholeFile)

For I = 0 To Lines.Length - 1
    Log(I & " = " & Lines(I))
    
    Dim Fields() As String = Regex.Split("\|", Lines(I).Replace(Chr(13), ""))    'apparently split doesn't remove the CR part of \n

    For J = 0 To Fields.Length - 1
        Log("  " & J & " = [" & Fields(J) & "]")
    Next
    
    'add Fields() to database here
Next

emexes · Jun 29, 2020

Input:

6978|"1738"|5566|8112|4431
"1251"|6627|8059|3160|7847|4546|3003|2351
373|"3745"|1447|5082|5568|2686
2270|4752|6559|368|"9837"
4336|226|572|7041|1510|6265|10050|799|1912
6387|9443|1487|6877|9500|5918|9530|3758
6089|8038|2368|7342|5299|763|7000|4115|6371

Output:

0 = 6978|"1738"|5566|8112|4431
  0 = [6978]
  1 = ["1738"]
  2 = [5566]
  3 = [8112]
  4 = [4431]
1 = "1251"|6627|8059|3160|7847|4546|3003|2351
  0 = ["1251"]
  1 = [6627]
  2 = [8059]
  3 = [3160]
  4 = [7847]
  5 = [4546]
  6 = [3003]
  7 = [2351]
2 = 373|"3745"|1447|5082|5568|2686
  0 = [373]
  1 = ["3745"]
  2 = [1447]
  3 = [5082]
  4 = [5568]
  5 = [2686]
3 = 2270|4752|6559|368|"9837"
  0 = [2270]
  1 = [4752]
  2 = [6559]
  3 = [368]
  4 = ["9837"]
4 = 4336|226|572|7041|1510|6265|10050|799|1912
  0 = [4336]
  1 = [226]
  2 = [572]
  3 = [7041]
  4 = [1510]
  5 = [6265]
  6 = [10050]
  7 = [799]
  8 = [1912]
5 = 6387|9443|1487|6877|9500|5918|9530|3758
  0 = [6387]
  1 = [9443]
  2 = [1487]
  3 = [6877]
  4 = [9500]
  5 = [5918]
  6 = [9530]
  7 = [3758]
6 = 6089|8038|2368|7342|5299|763|7000|4115|6371
  0 = [6089]
  1 = [8038]
  2 = [2368]
  3 = [7342]
  4 = [5299]
  5 = [763]
  6 = [7000]
  7 = [4115]
  8 = [6371]

Erel · Jun 29, 2020

CSV format is explained here: https://www.b4x.com/glossary/

It will be easier to help you if you attach a text file with an example.

emexes · Jun 29, 2020

Erel said:
CSV format is explained here: https://www.b4x.com/glossary/

Is not CSV. Is wanting bar-separated-values, and without the quote processing.

It will be easier to help you if you attach a text file with an example.

Agreed. My sample input file is somewhat lacking.

emexes · Jun 29, 2020

TyneBridges said:
(except that it's also seeing some quote marks as separators)

This is the bonus feature that is sending you off the road into the weeds. BASIC lets you enclose your string data in quotes, so that commas in your string data don't get misinterpreted as separator commas, and it strips these delimiting quotes (at start and end) from the string so that you don't have to. ?

Inconveniently, it seems it is not possible to switch this default quote delimiter processing off.

TyneBridges · Jun 29, 2020

Thanks. I'll certainly give that code a go. I attached an example of the data in an earlier message: I've now tidied up my input data so that it no longer contains the superflous '\' characters shown, so I'll attach another sample later if necessary. The processing needs to cope with return characters (which the old version did) as well as quotes (which proved a problem at the start of the field).

Erel · Jun 29, 2020

TyneBridges said:
I attached an example of the data in an earlier message: I've now tidied up my input data so that it no longer contains the superflous '\' characters shown, so I'll attach another sample later if necessary.

This is not enough for us to write the parsing code. You should create a small text file and attach it.

emexes · Jun 29, 2020

TyneBridges said:
The processing needs to cope with return characters (which the old version did)

Uh oh. Well, I'll wait for your sample data before I go crossing bridges I might be imagining. ✌

emexes · Jun 29, 2020

Me and Erel are like good cop, bad cop. But he is 100% right that ideal test input should be doing its utmost to make the parser crash. ?

emexes · Jun 29, 2020

TyneBridges said:
The processing needs to cope with return characters (which the old version did)

Ironically, I think that handling return characters (ASCII 13) within values, requires the value to be quoted. Otherwise, how do you tell the difference between a part-of-the-value CR and an end-of-the-line CR?

RFC 4180:

6.  Fields containing line breaks (CRLF), double quotes, and commas
    should be enclosed in double-quotes.  For example:

    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx

7.  If double-quotes are used to enclose fields, then a double-quote
    appearing inside a field must be escaped by preceding it with
    another double quote.  For example:

    "aaa","b""bb","ccc"

TyneBridges · Jun 29, 2020

Thanks, Emexes. I may be completely stuck, then - when exporting from MySQL I thought that all I'd need to do would be to specify a field separator that didn't occur anywhere in the text. I suppose I assumed that the end of record would be marked in the same way. I see now that there's no separator there at the end of my lines. (Surely this must be a common issue when using text files to migrate databases: is it just that every other programmer in the world is better than me?)

I've attached a text file containing my obscured personal data as requested. "Luukuuwn" (which translates to "Lockdown") is the combination of text and punctuation that made me aware of the problem.

emexes · Jun 29, 2020

TyneBridges said:
I see now that there's no separator there at the end of my lines.

Is the exporting program able to read the text file back in? Still mulling over how it can tell which end-of-lines are part of the values, and which signify the end of a line.

An interesting problem. ✌

emexes · Jun 29, 2020

emexes said:
Still mulling over how it can tell which end-of-lines are part of the values, and which signify the end of a line. An interesting problem. ✌

Righto, what it seems to do is use a single LF (ASCII 10) to terminate the line/record, and a double-LF to represent a line break ("return character") within values.

Which will all work great, as long as there is no line-break-in-value at the start of the first value or end of the last value of a line, because then you'll end up with LF LF LF and no way of telling whether that's meant to be [LF LF line-break-in-value][LF end-of-record-line] or [LF end-of-record-line][LF LF line-break-in-value]

But for your data, each record line starts with a date value, which never has a line-break in it, so in this case we're home and hosed.

Let me grab a coffee, then I'll be back. ?

Android Question LoadCSV delimiter - bug?

Member

Well-Known Member

B4X founder

Member

B4X founder

Expert

Member

Expert

Expert

B4X founder

Expert

Expert

Member

B4X founder

Expert

Expert

Expert

Member

Attachments

Expert

Expert

Similar Threads