Android Question Discard Unicode In Listview

bocker77

Active Member
Licensed User
Longtime User
I use downloaded csv data from a website that I import into a database. One of the fields is a string that can contain Unicode characters and I add these to a Listview. These Unicode characters are displayed as either a box or replacement character � in the Listview. I can replace the replacement character easy enough which I replace with a single quote but I can't seem to handle the others. The culprits are typically the left and right double quotes and a few others. When sending the strings (variables) to the B4A Log I noticed that the offending characters are discarded. I am wondering what is used in the Log command to do this. I was going to use the log so that I could see the hex codes in an editor but none of those characters show up.

Thanks,
Greg
 

bocker77

Active Member
Licensed User
Longtime User
Here is what it looks like in the csv file before importing into a database.

Lincolnメs Farewell to Springfield

but shows up in the database as BLOB using DBBrowser for SQLite.

When viewed in the Listview it displays the "メ" as a box. All the other ascii characters are displayed.

I use sqlite3.exe in a VBScript to import into a new database. That database file is then used in an SMB2 function to bring into my app. Maybe if I find an encoding that can replace those characters with �. Also I have to find out how sqlite3.exe can use the new encoding value if I find one. If importing using "DB Browser for SQLite" they all show up as replacement character �. If I can get sqlite3.exe to do that then I can handle it with a string replacement. I will contact SQLite forum and ask this question.

Still this doesn't answer the question as how the B4A Log command discards the Unicode. However that is done would be nice to know.
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
Still this doesn't answer the question as how the B4A Log command discards the Unicode.
I don't think it does. I think its converted before getting to a string in B4A. When you convert Unicode to 7 or 8 bit Windows code pages the out of page characters appear as question marks or boxes.
 
Upvote 0

teddybear

Well-Known Member
Licensed User
This is a text encoded issue, manually download csv data to local , check what its encoding is using notepad, I guess it might is ANSI or UNICODE. save as using encoding UTF-8,then load it into database, see if it is ok.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Let me try this again and see if I can explain this a little better because I am so confused. Some of the data in my database looks like these lines below which are in utf8. These lines are originally coming from HTML downloaded in a csv file.

Lincolnメs Farewell to Springfield (the unprintable character is a right single quote, hex code "0xef 0xbe 0x92")
The K�enster Building (the unprintable character is an umlaut "u", hex code "0xef 0xbf 0xbd")

From the forum I use this code.

B4X:
bFix = strTitle.GetBytes("utf8")
strTitle = BytesToString(bFix, 0, bFix.Length, "windows-1252")

Then I get this. BTW character set "Windows-1252" is the closest I come in seeing some of the characters.

Lincolnï¾’s Farewell to Springfield (the right single quote is displayed along with garbage in front, hex code "0xc3 0xaf 0xc2 0xbe 0xe2 0x80 0x99")
The K�enster Building (the umlaut "u" and btw all umlauts look like this, hex code "0xc3 0xaf 0xc2 0xbf 0xc2 0xbd")

For the first one I can do something like this but that just doesn't seem that I should need to do this.

B4X:
strTitle = strTitle.Replace("ï¾", "")

I have tried numerous encoding and am getting nowhere. As you can see I am not efficient in encoding/decoding techniques. Also I am not sure why this is so hard.
 
Upvote 0

DonManfred

Expert
Licensed User
Longtime User
These lines are originally coming from HTML downloaded in a csv file.
can you post the csv-file please? UNCHANGED in encoding
 
Upvote 0

teddybear

Well-Known Member
Licensed User
The first you should check what encoding the csv file you downloaded is, it is a key to covert encoding, you can see its encoding using notepad or notepad++
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
I actually think that it is how the csv file is saved. I will attach the csv file but feel I need to see if the editor of the website could save his HTML data in the Title field a different way.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
I am not seeing the csv file being attached. Let me zip it and see if that makes a difference. That hopefully fixed it.
 

Attachments

  • Markers.zip
    1.1 KB · Views: 75
Upvote 0

emexes

Expert
Licensed User
My first pass was a fail. success (turns out I can't count bits accurately :rolleyes: )

The what-seems-to-be-an-apostrophe in Lincoln's is a three-byte UTF sequence:

hex: EF BE 92
binary: 1110 1111 : 1011 1110 : 1001 0010

which should be Unicode character:

binary: 1111 111110 010010 = 1111 1111 1001 0010
hex: FF92

which is:

メ Halfwidth Katakana Letter Me
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Here is the website that I download the csv files from. To view one of these markers you can use the number in the first column in an advanced search on the site.

 
Upvote 0

emexes

Expert
Licensed User
When viewed in the Listview it displays the "メ" as a box. All the other ascii characters are displayed.

My first guess would be that the font used in the Listview does not contain glyphs for all ~140,000 Unicode characters, and that "メ" is one of the missing glyphs.

When I first load the file into Windows Notepad it shows as:

1664492827373.png


but when I change the font to something more comprehensive, like Arial:

1664492876489.png


then, instead of the "unknown" character displaying as a placeholder character, it now shows correctly(?) :

1664492907743.png
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
emexes,

Yes I seen what you have discovered but to get to the "right quote" requires that string replace that I noted above. That seems to work for those characters, "Left Quote, Left double quote. etc." but the amulets are a different story.

The website is user contributed that are stored in their database and subsequently saved in a csv file so whoever adds a historical marker can enter unicode characters in the titles. The problem seems to be how the csv file is saved. I get what they provide me in their csv file download function. As stated the Listview in my app displays garbage for these characters. Not very professional I might add.
 
Upvote 0

emexes

Expert
Licensed User
Still this doesn't answer the question as how the B4A Log command discards the Unicode. However that is done would be nice to know.

ASCII 0x00 - 0x7F = Unicode 0x0000 - 0x007F = UTF-8 0x00 - 0x7F ie high bit is 0

Any characters greater than ASCII (ie > 0x7F) encode to a multibyte UTF-8 with the high bits of all bytes set to 1

So to filter out non-ASCII characters from UTF-8: discard all bytes in the string that have the high bit set.

Or convert to an array of Chars, and if any of the Chars are > 127, then rebuild the String from the array but leaving out Chars > 127.
(if none of the Chars are > 127, then can just use original string ie no need to rebuild it)
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
emexes,

I believe that the Listview has no problem viewing any encoding. The hex code from these characters in my created DB from the downloaded csv files are corrupted. Garbage in garbage out.
 
Upvote 0

emexes

Expert
Licensed User
but to get to the "right quote" requires that string replace that I noted above.
That seems to work for those characters, "Left Quote, Left double quote. etc."
but the amulets are a different story.

Does this get the different story back on track? :

B4X:
Dim Historical As String = "4611,Lincolnメs Farewell to Springfield,39.79933,-89.64238"
Dim Filtered As String = Historical.Replace(Chr(0xFF92), "'")
Log(Filtered)
 
Upvote 0
Top