Character encoding / code pages

Discussion in 'Questions (Windows Mobile)' started by TWELVE, May 27, 2008.

  1. TWELVE

    TWELVE Active Member Licensed User

    Hello,

    i've got a question regarding different / localized character encodings.

    In the app i'm working on i receive a certain text from an internet server.I use two different transport methods to get this text, one method is http and the other one is pop3( mail).The text is independent from which transport method i choose and it is always using the same character encoding.

    The text is then printed in a textbox control.

    The text contains non-US/ACSII characters ( german umlauts for example).For the http method i use the code page style webresponse object:

    Code:
    Response.New2(1252)

    which is working fine and as expected ( the Response.New1 leaves me with unprintable characters).

    The text is in the ISO 8859-1 encoding, which is an 8bit extension to the 7bit ACSII - see ISO/IEC 8859 - Wikipedia, the free encyclopedia .

    The used code page 1252 in the response is pretty much the same as ISO8859-1 encoding except for some control characters which does not matter in this case - see Windows-1252 - Wikipedia, the free encyclopedia


    If i receive the same text from the POP3/Mail server, i end up with unprintable characters ( squares).Using a network sniffer i can see that the text is encoded in exactly the same way when received through http.

    So let's say, the text contains a german "ö", which has a hex code of F6 in ISO8859-1 encoding.Due to the lack of any code page handling in B4P ( except a few intructions as the mentioned webresponse.new2) my plan was just to substitute the ISO codes for umlauts using a


    but this does not work, probably due to the fact that Chr() does not know about Code Page 1252 or ISO8859-1.

    Questions regarding that matter:

    - how does Basic4PPC handle different code pages ?

    - does it at all or does it completely rely on UTF-8 encoding ? ( Chr() does not appear to be able to cope with UTF-8 ???) or on ASCII encoding ?

    - what about MIME/quoted-printable encoding ?

    - how can i solve my problem outlined above ? Manual character conversion is relatively complex and time-consuming.


    Kind regards

    TWELVE
     
  2. TWELVE

    TWELVE Active Member Licensed User

    Meanwhile i found a solution for my particular problem:

    Since i use the network library to communicate with the POP3 server and a bitwise object to convert between strings and binary bytes, the following works similar to the solution i use for the http response:

    Code:
    bit.New2(1252)
    Although this is working ok now for me, i still want to have answered my quesions above...

    cheers

    TWELVE
     
  3. agraham

    agraham Expert Licensed User

    It doesn't and has no need to! .NET uses UTF-16 (2 byte characters) encoding internally so there is no need for code pages. Wide characters are used for ease of string manipulation and indexing so that each character is a known fixed size.

    .NET streams normally convert from UTF-16 to UTF-8 and vice versa on input and output. They can also convert to and from a non UTF-8 single byte character stream. To do so they use an Encoding object associated with a code page. BinaryFile.New2 lets you initialise the Encoding object associated with the stream to the codepage you require.

    Chr() is UTF-16 based and only knows about wide characters. By the time characters are inside B4PPC they are UTF-16 characters

    They will be treated as any other character stream.

    I'm not sure that I completely understand your problem but any problems are caused at the interface from .NET to the OS or outside world. Http should be UTF-8 based which is why the character coding works correctly. Your details on the POP3 stream seem contradictory. From what you say it sounds like the POP3 stream is the same as the Http stream and so is also UTF-8 which would mean that unmlauts are encoded as two bytes. But you also say that bit.New2(1252) works which implies that the stream is actually single byte characters coded to code page 1252. Bit.New2() is the correct solution in this sort of case where you are dealing with a single byte "code paged" character stream.
     
    Last edited: May 28, 2008
  4. klaus

    klaus Expert Licensed User

    Last edited: May 28, 2008
  5. Erel

    Erel Administrator Staff Member Licensed User

  6. klaus

    klaus Expert Licensed User

    It's strange, because when I start the help file from the 6.30 IDE and click on the link I get the message Contend not found.

    Best regards.
     
  7. Erel

    Erel Administrator Staff Member Licensed User

    On which topic?
     
  8. klaus

    klaus Expert Licensed User

    Binary file New2

    Attached a screenshot.

    Best regards
     
  9. TWELVE

    TWELVE Active Member Licensed User

    @agraham:


    Maybe internal, but Chr() help is talking about ACSII and a value range of 0 to 255:

    What does this mean..? :) If the Compiler and or the OS is dealing with UTF internally, some conversion might be needed if a character/stream is coming in with an encoding different from UTF.


    That's not true.A http stream can use UTF-8, but this is not obligatory.The used/supported encoding is determined by server and client and can be read from a http header.

    I'm afraid this is called a wrong assumption...:D

    Both streams are in the same encoding, which is ISO8859-1 and NOT UTF-8.So it's clear no matter what transport is used a conversion has to take place.
    Because i cannot guess in what encoding a text is i need some hint, and this is usually contained in a header.


    That's absolutely true.

    So for me the conclusions from this are as following:


    - the programmer does not need to take care about character encodings as long as everything is kept in UTF

    - strings in basic4ppc are in UTF

    - if a (foreign) character/stream from outside enters a basic4ppc variable,
    a conversion needs to take place, if the stream is not in UTF.

    - the conversion can only be done properly, if the stream's code page is known and a conversion function supporting a code page is available

    - if no code page is specified, basic4ppc seems to interpret the non-UTF stream as ASCII ( this is why i could read most of the text, but the umlauts were replaced with the squares), which equals to the lower 7 Bit of any ISO8859 charset.

    For a http stream this can be achieved easily by interpreting the content-type header, which contains the used charset.But the (ISO-)Charset number needs to be converted to a code page, though.

    cheers

    TWELVE
     
  10. agraham

    agraham Expert Licensed User

    It's wrong. Try this "For i = 1024 To 1124 :msg = msg & Chr(i) : Next : msgbox(msg)"
    It might but as I tried to explain any conversion is done by the stream at the boundary of the .NET world and you need to specify the conversion necessary.
    Right. Due to my utter lack of interest, and hence utter lack of knowledge, in all things Webby I made a false assumption. I now understand how the Http stream converted the characters properly without it being UTF-8.
    Correct.
    Correct, held in UTF-16 format each character occupying two bytes.
    Correct, achieved by attaching a .NET Encoding object to the stream and specifying to that object the conversion to be made.
    Correct
    To be pedantic (again :) ) basic4ppc doesn't interpret anything, it receives UTF-16 from a stream. It depends on the stream how the encoding is treated. How are you getting this ASCII default? I assume you are using a BinaryFile object as the stream which if opened by New1 gives you the choice of ASCII or UTF-8 or if opened by New2 requires a codepage to be specified. I see no default behaviour :confused:
    From your experience it looks like the WebResponse object in the Http library takes care of this as it is part of the Http protocol - hence my false assumption of UTF-8. The Network library, not knowing about higher level protocols doesn't and just provides a byte stream which, as you say, may need conversion.

    EDIT :- I'm wrong again about Webby stuff and the WebResponse handling things - I just saw your "Response.New2(1252)" in the first post. I suppose you need to New The WebRequest with the required code page and use the same codepage for Newing the WebResponse!
     
    Last edited: May 29, 2008
  11. agraham

    agraham Expert Licensed User

    I've poked around inside the .NET HttpWebRequest object (used by the HTTP WebRequest) object and the HttpWebReponse (also used by the HTTP library) but am a bit hampered by my lack of knowledge. It seems you probably have to set the Charset you want by setting the ContentType property (incorrectly named ConnectionType in the HTTP help) of a WebRequest appropriately.

    A WebRequest seems to know nothing about encoding - possibly because web request headers are always ASCII - you will know better than me. Why therefore you specify an encoding for a WebRequest is a mystery to me - so I have asked Erel :) To send data to a server you use a BinaryFile object as a stream writer opened with the encoding you want specified and use it to write to the WebRequest stream.

    A WebResponse knows about encoding as it opens the stream used to return the data when you call WebResponse.GetString with the encoding you specified on New.
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice