Android Question [RESOLVED] Different results from: string.GetBytes, File.ReadBytes and RandomAccessFile.ReadBytes

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Experimenting with reading large files and writing data to SQLite.
With this I noticed there are differences with the above 3 methods to get a byte array.

For example testing with this string (either in a file or as a string):

Phone¬Email¬Website¬Fax
13043¬FA512¬Pharmacy¬Community Pharmacy¬Visible¬

Reading this as string I get bytes -62 and -84 for the ¬ character and and only byte 10 for the linebreak
Reading this string from a text file (.csv) I get only byte -84 for the ¬ character and bytes 13 and 10 for the linebreak.

These are the code fragments to get the byte array:

B4X:
'reading string:
    arrBytes = strString.GetBytes("UTF8")

'reading file, using File:
arrBytes = File.ReadBytes(tFF.strFolder, tFF.strFile)

'reading the file using RAF:
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = RAF.Size
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)

I can work around this, but would like to understand why this happens and how I can predict what will happen.

RBS
 

drgottjr

Expert
Licensed User
Longtime User
bytes is bytes, strings are bytes that have been "massaged" to appear a certain way.
(just as an image is bytes that have been "massaged" to appear as an image. eg, a
byte with a value of 0x02 is a perfectly valid value in an image, but it's not printable
as text.)

text files saved under certain operating systems will store end of line as 2 bytes.
text files saved under unix-like systems will store end of line as 1 byte.
text files saved using a particular character set will look strange if handled using a
different character set.

in java, a byte is a signed entity, so when you reach 127, the values go negative.
in the context of a text file, when java sees the negative value,it knows how to
deal with it (there no negative chr() values). under different languages, a byte
can be unsigned and contain a value up to 255.

a text file may have been saved using ascii, utf8, unicode or any number of other sets.
in the case of, eg, ascii, its "alphabet" can only have 255 characters because
each character is represented by 1 byte and a byte's maximum value is 255
(unless you're programming in java, where you'll see negative values when you reach
127). as for utf8, each character can be represented by up to 4 bytes (thus allowing for
millions of characters, divided up into many different character sets).

when you read a text file as a string, the os reads the bytes (of course) and converts
them to a string of characters based on your system's default character set (known as
your locale). if you see garbage when displaying the string, it's because the file
was saved using a different set (either deliberately or by default). once you see the
garbage, you probably should not try to convert it by getting its bytes and using a
different set. once you see the garbage, it's too late for that. you have to go back
and re-read the file as bytes and then convert to a string using a different character
set until you get it right. that's why websites indicated the character set in the html
document. with a 3rd party text file, you don't know until you read it and see the
garbage, if any. then you go back and start over.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
bytes is bytes, strings are bytes that have been "massaged" to appear a certain way.
(just as an image is bytes that have been "massaged" to appear as an image. eg, a
byte with a value of 0x02 is a perfectly valid value in an image, but it's not printable
as text.)

text files saved under certain operating systems will store end of line as 2 bytes.
text files saved under unix-like systems will store end of line as 1 byte.
text files saved using a particular character set will look strange if handled using a
different character set.

in java, a byte is a signed entity, so when you reach 127, the values go negative.
in the context of a text file, when java sees the negative value,it knows how to
deal with it (there no negative chr() values). under different languages, a byte
can be unsigned and contain a value up to 255.

a text file may have been saved using ascii, utf8, unicode or any number of other sets.
in the case of, eg, ascii, its "alphabet" can only have 255 characters because
each character is represented by 1 byte and a byte's maximum value is 255
(unless you're programming in java, where you'll see negative values when you reach
127). as for utf8, each character can be represented by up to 4 bytes (thus allowing for
millions of characters, divided up into many different character sets).

when you read a text file as a string, the os reads the bytes (of course) and converts
them to a string of characters based on your system's default character set (known as
your locale). if you see garbage when displaying the string, it's because the file
was saved using a different set (either deliberately or by default). once you see the
garbage, you probably should not try to convert it by getting its bytes and using a
different set. once you see the garbage, it's too late for that. you have to go back
and re-read the file as bytes and then convert to a string using a different character
set until you get it right. that's why websites indicated the character set in the html
document. with a 3rd party text file, you don't know until you read it and see the
garbage, if any. then you go back and start over.
Thanks for that and that was roughly what I expected the answer to this would be.
As I understand it one way to deal with this when parsing a file is to read a chunk
of it as bytes and analyse that and decide how to parse. The other way is just from looking
at the text and trial and error. In other words, there is no single, fool-proof solution.
Luckily, I am only dealing with UK English, medical and demographic data, eg files like this:

RBS
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Luckily, I am only dealing with UK English, medical and demographic data, eg files like this:
You are referring to the NHS datasets. The given link gives access to files with an Excel extension. The contents of an Excel file contain text that has been written in a certain way. The method depends, among other things, like the Excel version, the cell formatting, character size, text color, background color and many more, in short, you can encounter an infinite number of possibilities.

You can also save an Excel file as "plain" text without all kinds of formatting codes. Almost 40 years ago I used Excel to read text files, solve errors and then save them as an Excel file or as a plate text file as Comma Separated Version (CSV).

Start with reading this post for Excel file reading in B4A and B4J
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
You are referring to the NHS datasets. The given link gives access to files with an Excel extension. The contents of an Excel file contain text that has been written in a certain way. The method depends, among other things, like the Excel version, the cell formatting, character size, text color, background color and many more, in short, you can encounter an infinite number of possibilities.

You can also save an Excel file as "plain" text without all kinds of formatting codes. Almost 40 years ago I used Excel to read text files, solve errors and then save them as an Excel file or as a plate text file as Comma Separated Version (CSV).

Start with reading this post for Excel file reading in B4A and B4J
I just opened the file as .txt or .csv and processed that, but you are right the files in the mentioned link are Excel files. I know how to read these files and in that case there
is no need for text parsing at all and hadn't thought of that, so thanks for giving me that idea!

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
I just opened the file as .txt or .csv and processed that, but you are right the files in the mentioned link are Excel files. I know how to read these files and in that case there
is no need for text parsing at all and hadn't thought of that, so thanks for giving me that idea!

RBS
Had a look at this and unfortunately, although they are called Excel files they have all the items in a row concatenated with the ¬ character in one column, so it can't be read as Excel file without parsing.

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Had a look at this and unfortunately, although they are called Excel files they have all the items in a row concatenated with the ¬ character in one column, so it can't be read as Excel file without parsing.

RBS
Of course Windows Excel can import/open text files with different (not comma or tab) field delimiters, but not sure this can be done in B4A code with the Android Excel library.
Will have a look at that.

RBS
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
In line with the Microsof philosophy of making everything unnecessarily more complicated and therefore more user-unfriendly by hiding functions behind all kinds of extra buttons, you should proceed as follows:

  1. Load de file in Excel
  2. Select column A
  3. Select Data Tab
  4. Select Text to column
  5. Select delimited
  6. Select next
  7. Paste ¬ character into Other text field
  8. Give the columns the correct formating like text, number, date, ectr.
  9. Select Finish
That's all that is now hidden behind the data key

If you turn on the macro recorder during this operation, you will have recorded the correct actions per file for the future.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
In line with the Microsof philosophy of making everything unnecessarily more complicated and therefore more user-unfriendly by hiding functions behind all kinds of extra buttons, you should proceed as follows:

  1. Load de file in Excel
  2. Select column A
  3. Select Data Tab
  4. Select Text to column
  5. Select delimited
  6. Select next
  7. Paste ¬ character into Other text field
  8. Give the columns the correct formating like text, number, date, ectr.
  9. Select Finish
That's all that is now hidden behind the data key

If you turn on the macro recorder during this operation, you will have recorded the correct actions per file for the future.
Yes, I am aware of that, but I was trying to let the user open (that is show) a file in the B4A app by just pointing to the file and doing OK.
When I need the data myself to import into SQLite I will probably use the above.

RBS
 
Upvote 0

emexes

Expert
Licensed User
why this happens and how I can predict what will happen.

I downloaded and looked directly at the clinics data file clinics.csv, and this is what it is:

1/ lines are terminated with the two byte sequence 13 10 ie CR LF
2a/ fields (columns) are separated by byte 172 (which is ASCII 44 comma with high bit set, presumably chosen to simplify parsing of field data that includes commas)
2b/ byte 172 is not valid UTF-8 (unless preceded by a byte 192 thru 255, which in clinics.csv it is not)

Things aren't made any clearer by:

3/ B4A bytes being signed, and thus raw unsigned byte bit patterns 128 thru 255 are interpreted as -128 thru -1
4/ the CSV file extension being somewhat misleading (unless maybe the C stands for Character instead of Comma 🤔 )

1702129033379.png
 
Upvote 0

emexes

Expert
Licensed User
This code works with B4J and will should get you half-way to the finishing line:

B4X:
Dim TR As TextReader
TR.Initialize(File.OpenInput("C:\users\emexe\downloads", "clinics.csv"))

Do While True
    Dim L As String = TR.ReadLine
    If L = Null Then
        Exit
    End If

    L = L.Replace(Chr(65533), Chr(9))    'invalid UTF-8 translates to Chr(65533); change those to tabs

    Log(L)
  
    'this is where you'd split the line up into fields separated by Chr(9)
Loop

TR.Close

Log output:
...
11025836    A4Q9X    Clinic    UNKNOWN    Visible    True    Newmedica Leicester    1 Barton Close    Grove Park    Enderby    Leicester        LE19 1SJ    52.595909118652344    -1.190454363822937    01162 163737    [email protected]    https://www.newmedica.co.uk/clinics/leicester 
11026133    B5K8U    Clinic    UNKNOWN    Visible    False    One Health - Optima Clinic    45 Thwaite Street            Cottingham        HU16 4QX    53.778984069824219    -0.40360575914382935             
11026165    Z4P7L    Clinic    UNKNOWN    Visible    False    Middlewood Clinic    Little Ashfield            Midhurst        GU29 9JP    50.986545562744141    -0.74226260185241699             
11026334    D0V4O    Clinic    UNKNOWN    Visible    False    Arthur Webster Clinic    35 Landguard Manor Road            Shanklin        PO37 7HZ    50.635200500488281    -1.1804198026657104             
11026343    Z0G0T    Clinic    UNKNOWN    Visible    False    Worcester Physiotherapy Clinic    Blueprint Training Unit    14C Weir Lane        Worcester        WR2 4AY    52.176029205322266    -2.2320890426635742             
11026389    G3L4H    Clinic    UNKNOWN    Visible    False    Paragon Clinic    Unit 5-6    Bank Farm Road        Shrewsbury        SY3 6DU    52.696418762207031    -2.7754943370819092             
Program terminated (StartMessageLoop was not called).
 
Upvote 0

emexes

Expert
Licensed User
B4X:
L = L.Replace(Chr(65533), Chr(9))    'invalid UTF-8 translates to Chr(65533); change those to tabs

https://stackoverflow.com/questions/3526965/unicode-issue-with-an-html-title-question-mark-65533

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
B4X:
L = L.Replace(Chr(65533), Chr(9))    'invalid UTF-8 translates to Chr(65533); change those to tabs

Field separator doesn't have to be Chr(9) tab, and can even be multiple characters eg:

B4X:
L = L.Replace(Chr(65533), " | ")    'invalid UTF-8 translates to Chr(65533); change those to " | "

If L.Contains("nilwor") Then    'reduce example log output to postable number of lines
    Log(L)
End If

Log output:
Waiting for debugger to connect...
Program started.
63986 | RWP11 | Clinic | UNKNOWN | Visible | False | Crabbs Cross Clinic | Kenilworth Close |  |  | Redditch | Worcestershire | B97 5JX | 52.281394958496094 | -1.9436973333358765 |  |  |  |
77610 | NTD05 | Clinic | UNKNOWN | Visible | False | Interhealth At Abbey Sefton | Abbey Sefton Hospital | 1 Kenilworth Road | Crosby | Liverpool | Merseyside | L23 3AD | 53.486988067626953 | -3.0373263359069824 |  |  |  |
1521977 | RJC71 | Clinic | UNKNOWN | Visible | False | Kenilworth Clinic | 13 Smalley Place |  |  | Kenilworth | Warwickshire | CV8 1QG | 52.343219757080078 | -1.5806599855422974 |  |  |  |
2917642 | AJ401 | Clinic | UNKNOWN | Visible | False | Sw Healthcare | 38 Kenilworth Close |  |  | Redditch | Worcestershire | B97 5JX | 52.281394958496094 | -1.9436973333358765 |  |  |  |
7353530 | NXX1H | Clinic | UNKNOWN | Visible | False | Scrivens Hearing Care At Crabbs Cross Medical Centre - Redditch | 39 Kenilworth Close |  |  | Redditch | Worcestershire | B97 5JX | 52.281394958496094 | -1.9436973333358765 |  |  |  |
7353624 | NCN0E | Clinic | UNKNOWN | Visible | True | Dmc Community Dermatology Service At Sefton Suite Diagnostic Centre | 1 Kenilworth Road | Crosby |  | Liverpool | Merseyside | L23 3AD | 53.486988067626953 | -3.0373263359069824 | 0151 5416770 |  |  |
Program terminated (StartMessageLoop was not called).
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
https://stackoverflow.com/questions/3526965/unicode-issue-with-an-html-title-question-mark-65533

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).
>> L = L.Replace(Chr(65533), Chr(9)) 'invalid UTF-8 translates to Chr(65533); change those to tabs

Thanks for that and wasn't aware of that and don't understand it fully, but will study that.

For me as developer there are no problems dealing with this file. Simplest is opening it with Excel with ¬ as the column delimiter.
I also have written a custom file parser that can split on any single byte in that case that would then be byte -84.
It is a lot of code, so attached that as a file. I have another version where CSV2List has start byte index and an end byte index arguments, so
it can deal with very large files running it on smaller chunks of that file.
To code for end users opening such files directly in the Android app (so not for example via Windows Excel) is more tricky, but I think
it can be done.

RBS
 

Attachments

  • ParserCode.zip
    3.5 KB · Views: 46
Upvote 0

Mahares

Expert
Licensed User
Longtime User
This code works with B4J and will should get you half-way to the finishing line
@emexes: Why don't you just use StringUtils to painlessly parse the file without having to do a replace or use the infamous textreader.. It will work also in B4A. Am I overlooking something special you are trying to do:
B4X:
Dim su As StringUtils  'stringutils lib
MyList =  su.LoadCSV2(File.dirassets, filename, Chr(65533), Header)
I am pretty sure I can then import that list you are referring to into a SQLite table without any issues.
 
Last edited:
Upvote 1

emexes

Expert
Licensed User
Why don't you just use StringUtils to painlessly parse the file

A good question 👍 which I think has a good answer:

because the files are potentially too large to load into memory, let alone having two copies of the data whilst it is transposed from a temporary List table to its final internal form.

The Android forum in particular seems to have many instances of .LoadCSV users non-painlessly running out of memory.

Reading line-by-line and filtering down to only the records and fields that are needed, and converting on-the-fly to internal format, reduces or eliminates that pain.

In that clinic file are longitudes/latitudes eg 53.486988067626953 -3.0373263359069824 which can quite reasonably be stored as 4-byte Floats rather than ~40 byte Strings, assuming that locating clinics to within a couple of metres is close enough.
 
Upvote 0

emexes

Expert
Licensed User
without having to do a replace

The .Replace doesn't have to be done. I included it as an example of how to get back to having a more-traditional field separator, because using an "error" character as a separator had a bad vibe about it. 🍻
 
Upvote 0
Top