What happened to the ß?

Discussion in 'Questions (Windows Mobile)' started by JamesC, Apr 20, 2008.

  1. JamesC

    JamesC Member Licensed User

    I have a text file, x, containing some "ß" characters. When I use k=FileRead (x), k contains no "ß" characters. So if I loop though the file reading lines, and writing them to a second file, the second file will be shorter than the first, rather than identical.

    This is a big problem if the file is in German!

    What on earth is going on?? :sign0085:

    Thanks

    James
     
  2. agraham

    agraham Expert Licensed User

    You may have opened the file in ASCII mode rather than Unicode.
     
  3. JamesC

    JamesC Member Licensed User

    Ah ha!

    Thanks for the quick reply agraham - I guess that must be it!
     
  4. JamesC

    JamesC Member Licensed User

    Problem unresolved

    Using FileOpen(x,filename,cRead,,cAscii), when I FileRead(x), the "ß" characters appear as "?" characters. So writing all the lines to a second file results in a file of the same length.

    Using FileOpen(x,filename,cRead), the "ß" characters disappear altogether as I described earlier.

    This is rather strange :confused:
     
  5. Erel

    Erel Administrator Staff Member Licensed User

    FileOpen(x,filename,cRead) should be able to handle these characters.
    Can you upload a text file with these characters?
     
  6. klaus

    klaus Expert Licensed User

    If the original file was saved with ANSI coding, you cannot read the special characters.
    If you read the text file with Notepad and save it back with UTF-8 coding
    you will get the special characters in B4PPC.
    I have made a test with Notepad, I created a file with special characters in, saved it with ANSI coding, and read this file in B4PPC with FileOpen(c1,"Test.txt",cRead). The special characters don't appear.
    Reading back the same file with Notepad and saving it with UTF-8 coding, the special characters appear normaly in B4PPC.
    The strange thing is that with Notepad the special characters are normaly displayed whatever the coding is.

    Best regards
     
  7. JamesC

    JamesC Member Licensed User

    I created pretty much a random file in Pocket Word to test what was happening, like this:

    Aßßß
    Bßcßßß
    C???ß
    de

    This is saved as a txt file, saying No to the dialog which asks if I want to save it as word instead. When it is reopened in Pocket Word, it appears unchanged.
     
  8. JamesC

    JamesC Member Licensed User

    Uploaded file

    Attached txt file, although it looks different in Notepad on the PC, with "ß" characters changed to "฿"!! I guess this is something to do with pocket word.

    A฿฿฿
    B฿c฿฿฿
    C???฿
    dc


    Also attached is the beginning of original file where I first noticed the problem.
     
  9. JamesC

    JamesC Member Licensed User

    Original file

    attached.
     

    Attached Files:

  10. Erel

    Erel Administrator Staff Member Licensed User

    The last file attached was saved encoded with ASCII and not UTF-8.
    It doesn't contain any "special" characters.

    You can use Notepad to create the file and then choose Save As - Encoding - UTF8.
     
  11. klaus

    klaus Expert Licensed User

    Hi JamesC

    Both of your files are OK in Notepad on my desktop.
    I saved them back with UTF-8 coding to
    testNew.txt and origfileNew.txt

    I have joined a small program with the 4 files where you can test them.

    On my desktop, when I read the original files the special characters are not displayed correctly but with the New files they are. The only difference is the coding.
    Could the source of your files save or transmit them with UTF-8 coding ?

    This gives only an explanation but I am afraid that it will not solve your problem.

    The 'ß' has code 223 above 127 like many other characters used in european languages. The B4PPC IDE, on the desktop, recognizes these characters .

    The question is why does B4PPC not recognize characters with codes higher than 127 ?

    Erel, you were quicker than me, but nevertheless I still post it.

    Best regards
     
  12. JamesC

    JamesC Member Licensed User

    PGN Files

    Thanks, Klaus. Your program shows very well the disappearing 'ß'.

    The original files I am dealing with are called .pgn files, and are apparently ascii files. (Incidentally, how do you tell whether a file is an Ascii file or a Unicode file?). The PGN specification begins:

    'PGN is "Portable Game Notation", a standard designed for the representation of chess game data using ASCII text files. PGN is structured for easy reading and writing by human users and for easy parsing and generation by computer programs.'

    An example of an original file, complete with characters that appear as 'ß', can be downloaded http://www.endgame.nl/MATCHPGN.ZIP

    So I think my particular problem is solved by using Fileopen(x,filename,cRead,,cAscii).
     
  13. klaus

    klaus Expert Licensed User

    Hi JamesC

    The problem is the following:
    PGN data is represented using a subset of the eight bit ISO 8859/1 (Latin 1)
    character set that includes the special characters for our languages.
    You can find this in
    http://www.chessclub.com/help/PGN-spec

    The ASCII character set is a 7 bit set and doesn't recognize these characters.
    http://en.wikipedia.org/wiki/ASCII

    In both character sets : 1 character = 1 byte

    Unicode or UTF-8 codes the characters with up to 4 bytes, but leaves the ASCii character codes as they are.
    http://en.wikipedia.org/wiki/UTF-8

    Your problem is that B4PPC recognizes only pure ASCII (7 bit) code or UTF-8, but in your files the special characters are coded with only 1 byte, with codes higher than 127.

    For me, using Fileopen(x,filename,cRead,,cAscii) will not solve the problem because the special characters will not be recognized anyway.
    The only difference will be that the spcial characters will be displayed as a '?' instead of a rectangle.

    A question to Erel :
    Is this limitation due tp B4PPC or to .NET ?
    If it is B4PPC why not extend the 7 bit character set to 8 bit ISO 8859/1 (Latin 1). This would be intersesting for many users who are using languages having these special characters, and it seems that there are many files using it.

    The problem is not only the 'ß' character, as I saw in your origfile.txt you have also ö,ä and others, but of course it's the same subject.

    Best regards
     
  14. Erel

    Erel Administrator Staff Member Licensed User

    There are many standards for ASCII values between 128-255.
    Using the BinaryFile library it is possible to use any of the code pages required (if there is such a code page).
    You should first load the data as raw binary data and then convert it to string:
    Code:
    Sub Globals
        
    Dim buffer(0As byte
    End Sub

    Sub App_Start
        Form1.Show
        FileOpen(c,
    "test.txt",cRandom)
        bin.New2(c,
    28605)
        size = FileSize(
    "test.txt")
        
    Dim buffer(size) As byte
        bin.ReadBytes(buffer(),size)
        textbox1.Text = bin.BytesToString(buffer(),
    0,size)
        FileClose(c)
    End Sub
    You can find the code pages here: http://msdn2.microsoft.com/en-us/library/ms776446.aspx
     
  15. klaus

    klaus Expert Licensed User

    Thank you Erel !
    It works fine !
    I had never looked in detail into the BinaryFile library, otherwise I should have found it.
    Probably, even if I had read it with no practical problem behind it, I wouldn't have remembered it.

    This shows once again the power of this forum !

    JamesC, That's it !

    Best regards
     
    Last edited: Apr 22, 2008
  16. JamesC

    JamesC Member Licensed User

    Thanks for your help Klaus and Erel! The subject is obviously more complicated than I imagined. I live and learn! :)

    I'm not too worried about the foreign letters displaying as "?". The important thing for me was to note at which byte the data for each game started and finished so that I could use FileOpen(x,filename,cRandom) and Fileget and to quickly extract the required game data. To do this I used Fileread and then added up the length of the strings until I reached the marker for the next game. Using FileOpen(x,filename,cRead,,cAscii), I seem to get the right results, whereas using FileOpen(x,filename,cRead) the 'ß' would be ignored resulting in a shorter string length, and hence the wrong byte address for the Fileget. I still don't understand why the 'ß' disappeared (rather than being displayed by a rectangle, say) in Klaus's program, but I have still to get my head round these different Ascii standards etc etc.

    Anyway, thanks for your patience!
     
  17. klaus

    klaus Expert Licensed User

    Is it something like the joined program you are looking for ?
    I used Erel's code, and added a search function for the different events in the MATCH.PGN file.

    I hope this will help you.

    Best regards.
     
  18. JamesC

    JamesC Member Licensed User

    Thanks again for the code, Klaus.:) Like I said, the FileOpen as Ascii is working for me (Indeed, I hope to release a beta in the next week), but the approach using the binary library might prove better in the long run.

    Cheers!

    James
     
Loading...