What happened to the ß?

JamesC

Member
Licensed User
I have a text file, x, containing some "ß" characters. When I use k=FileRead (x), k contains no "ß" characters. So if I loop though the file reading lines, and writing them to a second file, the second file will be shorter than the first, rather than identical.

This is a big problem if the file is in German!

What on earth is going on?? :sign0085:

Thanks

James
 

JamesC

Member
Licensed User
Problem unresolved

Using FileOpen(x,filename,cRead,,cAscii), when I FileRead(x), the "ß" characters appear as "?" characters. So writing all the lines to a second file results in a file of the same length.

Using FileOpen(x,filename,cRead), the "ß" characters disappear altogether as I described earlier.

This is rather strange :confused:
 

klaus

Expert
Licensed User
Longtime User
If the original file was saved with ANSI coding, you cannot read the special characters.
If you read the text file with Notepad and save it back with UTF-8 coding
you will get the special characters in B4PPC.
I have made a test with Notepad, I created a file with special characters in, saved it with ANSI coding, and read this file in B4PPC with FileOpen(c1,"Test.txt",cRead). The special characters don't appear.
Reading back the same file with Notepad and saving it with UTF-8 coding, the special characters appear normaly in B4PPC.
The strange thing is that with Notepad the special characters are normaly displayed whatever the coding is.

Best regards
 

JamesC

Member
Licensed User
FileOpen(x,filename,cRead) should be able to handle these characters.
Can you upload a text file with these characters?

I created pretty much a random file in Pocket Word to test what was happening, like this:

Aßßß
Bßcßßß
C???ß
de

This is saved as a txt file, saying No to the dialog which asks if I want to save it as word instead. When it is reopened in Pocket Word, it appears unchanged.
 

JamesC

Member
Licensed User
Uploaded file

Attached txt file, although it looks different in Notepad on the PC, with "ß" characters changed to "฿"!! I guess this is something to do with pocket word.

A฿฿฿
B฿c฿฿฿
C???฿
dc


Also attached is the beginning of original file where I first noticed the problem.
 

klaus

Expert
Licensed User
Longtime User
Hi JamesC

Both of your files are OK in Notepad on my desktop.
I saved them back with UTF-8 coding to
testNew.txt and origfileNew.txt

I have joined a small program with the 4 files where you can test them.

On my desktop, when I read the original files the special characters are not displayed correctly but with the New files they are. The only difference is the coding.
Could the source of your files save or transmit them with UTF-8 coding ?

This gives only an explanation but I am afraid that it will not solve your problem.

The 'ß' has code 223 above 127 like many other characters used in european languages. The B4PPC IDE, on the desktop, recognizes these characters .

The question is why does B4PPC not recognize characters with codes higher than 127 ?

Erel, you were quicker than me, but nevertheless I still post it.

Best regards
 

JamesC

Member
Licensed User
PGN Files

Thanks, Klaus. Your program shows very well the disappearing 'ß'.

The original files I am dealing with are called .pgn files, and are apparently ascii files. (Incidentally, how do you tell whether a file is an Ascii file or a Unicode file?). The PGN specification begins:

'PGN is "Portable Game Notation", a standard designed for the representation of chess game data using ASCII text files. PGN is structured for easy reading and writing by human users and for easy parsing and generation by computer programs.'

An example of an original file, complete with characters that appear as 'ß', can be downloaded http://www.endgame.nl/MATCHPGN.ZIP

So I think my particular problem is solved by using Fileopen(x,filename,cRead,,cAscii).
 

klaus

Expert
Licensed User
Longtime User
Hi JamesC

The problem is the following:
PGN data is represented using a subset of the eight bit ISO 8859/1 (Latin 1)
character set that includes the special characters for our languages.
You can find this in
http://www.chessclub.com/help/PGN-spec

The ASCII character set is a 7 bit set and doesn't recognize these characters.
http://en.wikipedia.org/wiki/ASCII

In both character sets : 1 character = 1 byte

Unicode or UTF-8 codes the characters with up to 4 bytes, but leaves the ASCii character codes as they are.
http://en.wikipedia.org/wiki/UTF-8

Your problem is that B4PPC recognizes only pure ASCII (7 bit) code or UTF-8, but in your files the special characters are coded with only 1 byte, with codes higher than 127.

For me, using Fileopen(x,filename,cRead,,cAscii) will not solve the problem because the special characters will not be recognized anyway.
The only difference will be that the spcial characters will be displayed as a '?' instead of a rectangle.

A question to Erel :
Is this limitation due tp B4PPC or to .NET ?
If it is B4PPC why not extend the 7 bit character set to 8 bit ISO 8859/1 (Latin 1). This would be intersesting for many users who are using languages having these special characters, and it seems that there are many files using it.

The problem is not only the 'ß' character, as I saw in your origfile.txt you have also ö,ä and others, but of course it's the same subject.

Best regards
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
There are many standards for ASCII values between 128-255.
Using the BinaryFile library it is possible to use any of the code pages required (if there is such a code page).
You should first load the data as raw binary data and then convert it to string:
B4X:
Sub Globals
    Dim buffer(0) As byte
End Sub

Sub App_Start
    Form1.Show
    FileOpen(c,"test.txt",cRandom)
    bin.New2(c,28605)
    size = FileSize("test.txt")
    Dim buffer(size) As byte
    bin.ReadBytes(buffer(),size)
    textbox1.Text = bin.BytesToString(buffer(),0,size)
    FileClose(c)
End Sub
You can find the code pages here: http://msdn2.microsoft.com/en-us/library/ms776446.aspx
 

klaus

Expert
Licensed User
Longtime User
Thank you Erel !
It works fine !
I had never looked in detail into the BinaryFile library, otherwise I should have found it.
Probably, even if I had read it with no practical problem behind it, I wouldn't have remembered it.

This shows once again the power of this forum !

JamesC, That's it !

Best regards
 
Last edited:

JamesC

Member
Licensed User
Thanks for your help Klaus and Erel! The subject is obviously more complicated than I imagined. I live and learn! :)

I'm not too worried about the foreign letters displaying as "?". The important thing for me was to note at which byte the data for each game started and finished so that I could use FileOpen(x,filename,cRandom) and Fileget and to quickly extract the required game data. To do this I used Fileread and then added up the length of the strings until I reached the marker for the next game. Using FileOpen(x,filename,cRead,,cAscii), I seem to get the right results, whereas using FileOpen(x,filename,cRead) the 'ß' would be ignored resulting in a shorter string length, and hence the wrong byte address for the Fileget. I still don't understand why the 'ß' disappeared (rather than being displayed by a rectangle, say) in Klaus's program, but I have still to get my head round these different Ascii standards etc etc.

Anyway, thanks for your patience!
 

JamesC

Member
Licensed User
Thanks again for the code, Klaus.:) Like I said, the FileOpen as Ascii is working for me (Indeed, I hope to release a beta in the next week), but the approach using the binary library might prove better in the long run.

Cheers!

James
 
Top