B4J Question Loading CSV File with Qualified Text (")

Jorge M A

Well-Known Member
Licensed User
Hello, everybody,

I need to upload a CSV file. Texts are qualified by double quotation marks. Some columns may have an embedded comma (company name, product description, etc.).

I have tried StringUtils.LoadCSV2, and I don't find how to let you know about the qualified texts, and it breaks my columns, with every comma it finds.

I've also tried to do it with File.ReadList, and each row in the list contains the entire string including commas and quotation marks.
From here I have no idea how to make a correct Split.

The goal is to display this information in a TableView (w/multiple columns).

Can you guide me or tell me what is (the best) way to do this?

Thank you very much!

[EDIT] I found one column that cause a break, the content is: "MAIL BOXES "FAST POST", S.A. DE C.V." that contains quotes and comma.
I'm stuck in the problem.
 
Last edited:

Daestrum

Well-Known Member
Licensed User
Maybe not the best solution but...
You could read the file as a string, then split using regex on "," then rebuild the missing " that regex will take out.
Example
B4X:
 ' textline could be from reading the file as string
 Dim textline As String = $""MAIL BOXES "FAST POST", S.A. DE C.V.","MAIL BOXES "SLOW POST", S.A. DE C.V.""$
 Dim a() As String = Regex.Split($"",""$,textline) ' split on quote comma quote
 For l=0 To a.Length - 1
  If Not(a(l).EndsWith($"""$)) Then ' rebuild last quote
   a(l) = a(l) & $"""$ 
  End If
  If Not(a(l).startsWith($"""$)) Then ' rebuild first quote
   a(l) = $"""$ & a(l)
  End If
 Next
 ' a() should now contain the lines you want
 For Each j As String In a
  Log(j)
 Next
There are probably cases where this will break in the wrong place.
 

emexes

Well-Known Member
Licensed User
I found one column that cause a break, the content is: "MAIL BOXES "FAST POST", S.A. DE C.V." that contains quotes and comma.
Is that sample just one field (ie, not the entire line)?

If it is, then the unescaped quotes around FAST POST will make it difficult to decide whether the ", is part of the current field, or is the end of it.

But it can be done, for the given example and similar... but not all. What it won't cope with is a field like:

"MAIL BOXES "FAST POST","CENTRAL OFFICE" S.A. DE C.V."

where it is impossible to know whether the "," is part of the one field, or delimits and separates it into two fields eg:

"MAIL BOXES "FAST POST" and
"CENTRAL OFFICE" S.A. DE C.V."

Perhaps step back a little from the problem, and consider:

1/ Can you use a different separator, that does not appear within the data, eg, Excel text file imports default to tabs (ASCII 9)

2/ Can you escape the quotes? Outside of C-like languages, this is commonly done by doubling them, eg:

"MAIL BOXES ""FAST POST"",""CENTRAL OFFICE"" S.A. DE C.V."

I suspect what you really want is just some code that will read your data correctly. If you can't change the data per 1/ or 2/ then: if you post a reasonable sample of the data (hundreds of lines, not just two or three ;-) including lines that currently do not parse (as you expect them to) and any other edge cases that are not basic CSV, then I will probably write code that returns those lines split according to whatever rules you like.

:)
 

Erel

Administrator
Staff member
Licensed User
I found one column that cause a break, the content is: "MAIL BOXES "FAST POST", S.A. DE C.V." that contains quotes and comma.
I'm stuck in the problem.
Your CSV string is invalid. As @emexes wrote, double quotes should be used to escape quotes. If this issue happens with specific terms then read the complete file, replace "FAST POST" with ""FAST POST"" and then save it.
 

emexes

Well-Known Member
Licensed User
I need to upload a CSV file ... correct split ... display this information in a TableView (w/multiple columns)
Is this resolved? I'm mucking about with regex's (regices?) and things are progressing slowly - it is a race between success, or my head turning inside-out.

At the moment I am still powered by interest. If your specific problem is a once-off and you've already fixed it with some manual editing, then the pressure is less.

:)
 

Jorge M A

Well-Known Member
Licensed User
Thank you all so much for sharing your time, effort and talent.

The specific point, as @Erel says, is that the file becomes invalid because of the quality of the information it contains.

The case I described, is just an example, and can occur with various unpredictable terms and in multiple columns.

The files are produced by a third party (I don't know how they produce it), and they are files that can have up to 1 to 3 million records, on a daily basis.

Any pre-process treated as a string, I see it unsuccessful and risky.

Now I don't know how I can do a previous validation, to inform and reject the file...:(
 

Jorge M A

Well-Known Member
Licensed User
I achieved an important breakthrough by using the driver, which raises an exception when reading an erroneous register.
For the moment it is enough to implement the error handling routine.
Thanks to All!
 
Top