Character encoding / code pages

TWELVE · May 27, 2008

Hello,

i've got a question regarding different / localized character encodings.

In the app i'm working on i receive a certain text from an internet server.I use two different transport methods to get this text, one method is http and the other one is pop3( mail).The text is independent from which transport method i choose and it is always using the same character encoding.

The text is then printed in a textbox control.

The text contains non-US/ACSII characters ( german umlauts for example).For the http method i use the code page style webresponse object:

B4X:

Response.New2(1252)

which is working fine and as expected ( the Response.New1 leaves me with unprintable characters).

The text is in the ISO 8859-1 encoding, which is an 8bit extension to the 7bit ACSII - see ISO/IEC 8859 - Wikipedia, the free encyclopedia .

The used code page 1252 in the response is pretty much the same as ISO8859-1 encoding except for some control characters which does not matter in this case - see Windows-1252 - Wikipedia, the free encyclopedia

If i receive the same text from the POP3/Mail server, i end up with unprintable characters ( squares).Using a network sniffer i can see that the text is encoded in exactly the same way when received through http.

So let's say, the text contains a german "ö", which has a hex code of F6 in ISO8859-1 encoding.Due to the lack of any code page handling in B4P ( except a few intructions as the mentioned webresponse.new2) my plan was just to substitute the ISO codes for umlauts using a

text = StrReplace(text,Chr(246),"ö")

but this does not work, probably due to the fact that Chr() does not know about Code Page 1252 or ISO8859-1.

Questions regarding that matter:

- how does Basic4PPC handle different code pages ?

- does it at all or does it completely rely on UTF-8 encoding ? ( Chr() does not appear to be able to cope with UTF-8 ???) or on ASCII encoding ?

- what about MIME/quoted-printable encoding ?

- how can i solve my problem outlined above ? Manual character conversion is relatively complex and time-consuming.

Kind regards

TWELVE

TWELVE · May 27, 2008

Meanwhile i found a solution for my particular problem:

Since i use the network library to communicate with the POP3 server and a bitwise object to convert between strings and binary bytes, the following works similar to the solution i use for the http response:

B4X:

 bit.New2(1252)

Although this is working ok now for me, i still want to have answered my quesions above...

cheers

TWELVE

agraham · May 28, 2008

TWELVE said:
how does Basic4PPC handle different code pages ?

It doesn't and has no need to! .NET uses UTF-16 (2 byte characters) encoding internally so there is no need for code pages. Wide characters are used for ease of string manipulation and indexing so that each character is a known fixed size.

does it at all or does it completely rely on UTF-8 encoding ?

.NET streams normally convert from UTF-16 to UTF-8 and vice versa on input and output. They can also convert to and from a non UTF-8 single byte character stream. To do so they use an Encoding object associated with a code page. BinaryFile.New2 lets you initialise the Encoding object associated with the stream to the codepage you require.

( Chr() does not appear to be able to cope with UTF-8 ???) or on ASCII encoding ?

Chr() is UTF-16 based and only knows about wide characters. By the time characters are inside B4PPC they are UTF-16 characters

what about MIME/quoted-printable encoding ?

They will be treated as any other character stream.

[ how can i solve my problem outlined above ?

I'm not sure that I completely understand your problem but any problems are caused at the interface from .NET to the OS or outside world. Http should be UTF-8 based which is why the character coding works correctly. Your details on the POP3 stream seem contradictory. From what you say it sounds like the POP3 stream is the same as the Http stream and so is also UTF-8 which would mean that unmlauts are encoded as two bytes. But you also say that bit.New2(1252) works which implies that the stream is actually single byte characters coded to code page 1252. Bit.New2() is the correct solution in this sort of case where you are dealing with a single byte "code paged" character stream.

klaus · May 28, 2008

Hi TWELVE,

JamesC had a similar problem with german characters coded in a single byte.
http://www.b4x.com/forum/questions-help-needed/2168-what-happened-ss.html

Erel's solution was the same with the binary file and bin.New2(c,Code Page number).

Hi Erel,
The link in the help file for the Code Page numbers doesn't work anymore, it says Contend not found.

Best regards.

Erel · May 28, 2008

Code page link: Code Page Identifiers
It was updated in version 6.30.

klaus · May 28, 2008

It's strange, because when I start the help file from the 6.30 IDE and click on the link I get the message Contend not found.

Best regards.

Erel · May 28, 2008

On which topic?

klaus · May 28, 2008

Binary file New2

Attached a screenshot.

Best regards

TWELVE · May 29, 2008

@agraham:

Chr() is UTF-16 based and only knows about wide characters. By the time characters are inside B4PPC they are UTF-16 characters

Maybe internal, but Chr() help is talking about ACSII and a value range of 0 to 255:

Returns the ASCII character represented by the given number.
Syntax: Chr (Integer)
Integer ranges from 0 to 255.

what about MIME/quoted-printable encoding ?

They will be treated as any other character stream.

What does this mean..?

If the Compiler and or the OS is dealing with UTF internally, some conversion might be needed if a character/stream is coming in with an encoding different from UTF.

Http should be UTF-8 based which is why the character coding works correctly.

That's not true.A http stream can use UTF-8, but this is not obligatory.The used/supported encoding is determined by server and client and can be read from a http header.

Your details on the POP3 stream seem contradictory. From what you say it sounds like the POP3 stream is the same as the Http stream and so is also UTF-8 which would mean that unmlauts are encoded as two bytes.

I'm afraid this is called a wrong assumption...

Both streams are in the same encoding, which is ISO8859-1 and NOT UTF-8.So it's clear no matter what transport is used a conversion has to take place.
Because i cannot guess in what encoding a text is i need some hint, and this is usually contained in a header.

But you also say that bit.New2(1252) works which implies that the stream is actually single byte characters coded to code page 1252.

That's absolutely true.

So for me the conclusions from this are as following:

- the programmer does not need to take care about character encodings as long as everything is kept in UTF

- strings in basic4ppc are in UTF

- if a (foreign) character/stream from outside enters a basic4ppc variable,
a conversion needs to take place, if the stream is not in UTF.

- the conversion can only be done properly, if the stream's code page is known and a conversion function supporting a code page is available

- if no code page is specified, basic4ppc seems to interpret the non-UTF stream as ASCII ( this is why i could read most of the text, but the umlauts were replaced with the squares), which equals to the lower 7 Bit of any ISO8859 charset.

For a http stream this can be achieved easily by interpreting the content-type header, which contains the used charset.But the (ISO-)Charset number needs to be converted to a code page, though.

cheers

TWELVE

agraham · May 29, 2008

TWELVE said:
Maybe internal, but Chr() help is talking about ACSII and a value range of 0 to 255:

It's wrong. Try this "For i = 1024 To 1124 :msg = msg & Chr(i) : Next : msgbox(msg)"

What does this mean..? If the Compiler and or the OS is dealing with UTF internally, some conversion might be needed if a character/stream is coming in with an encoding different from UTF.

It might but as I tried to explain any conversion is done by the stream at the boundary of the .NET world and you need to specify the conversion necessary.

That's not true ... encoding is determined by server and client and can be read from a http header.

Right. Due to my utter lack of interest, and hence utter lack of knowledge, in all things Webby I made a false assumption. I now understand how the Http stream converted the characters properly without it being UTF-8.

the programmer does not need to take care about character encodings as long as everything is kept in UTF

Correct.

strings in basic4ppc are in UTF

Correct, held in UTF-16 format each character occupying two bytes.

if a (foreign) character/stream from outside enters a basic4ppc variable, a conversion needs to take place, if the stream is not in UTF.

Correct, achieved by attaching a .NET Encoding object to the stream and specifying to that object the conversion to be made.

the conversion can only be done properly, if the stream's code page is known and a conversion function supporting a code page is available

Correct

if no code page is specified, basic4ppc seems to interpret the non-UTF stream as ASCII ( this is why i could read most of the text, but the umlauts were replaced with the squares), which equals to the lower 7 Bit of any ISO8859 charset.

To be pedantic (again

) basic4ppc doesn't interpret anything, it receives UTF-16 from a stream. It depends on the stream how the encoding is treated. How are you getting this ASCII default? I assume you are using a BinaryFile object as the stream which if opened by New1 gives you the choice of ASCII or UTF-8 or if opened by New2 requires a codepage to be specified. I see no default behaviour

For a http stream this can be achieved easily by interpreting the content-type header, which contains the used charset.But the (ISO-)Charset number needs to be converted to a code page, though.

From your experience it looks like the WebResponse object in the Http library takes care of this as it is part of the Http protocol - hence my false assumption of UTF-8. The Network library, not knowing about higher level protocols doesn't and just provides a byte stream which, as you say, may need conversion.

EDIT :- I'm wrong again about Webby stuff and the WebResponse handling things - I just saw your "Response.New2(1252)" in the first post. I suppose you need to New The WebRequest with the required code page and use the same codepage for Newing the WebResponse!

agraham · May 29, 2008

But the (ISO-)Charset number needs to be converted to a code page, though.

I've poked around inside the .NET HttpWebRequest object (used by the HTTP WebRequest) object and the HttpWebReponse (also used by the HTTP library) but am a bit hampered by my lack of knowledge. It seems you probably have to set the Charset you want by setting the ContentType property (incorrectly named ConnectionType in the HTTP help) of a WebRequest appropriately.

A WebRequest seems to know nothing about encoding - possibly because web request headers are always ASCII - you will know better than me. Why therefore you specify an encoding for a WebRequest is a mystery to me - so I have asked Erel

To send data to a server you use a BinaryFile object as a stream writer opened with the encoding you want specified and use it to write to the WebRequest stream.

A WebResponse knows about encoding as it opens the stream used to return the data when you call WebResponse.GetString with the encoding you specified on New.

Character encoding / code pages

TWELVE

Active Member

TWELVE

Active Member

agraham

Expert

klaus

Expert

Erel

B4X founder

klaus

Expert

Erel

B4X founder

klaus

Expert

TWELVE

Active Member

agraham

Expert

agraham

Expert