Other I thought UTF8 was always 8 bits

Starchild

Active Member
Licensed User
Longtime User
I was writing my own file downloader but when I converted the byte array to string it changed length.
I was using the "UTF8" charset in the byte converter functions.

As a simple test I tried this...

B4X:
Sub AppStart (Args() As String)
    Dim conv As ByteConverter

    Dim A As String = "Hello World" & Chr(255)
    Dim B() As Byte = conv.StringToBytes(A,"UTF8")
    Dim C As String = conv.StringFromBytes(B,"UTF8")
    
    Log("A String = " & A.Length)
    Log("Byte array = " & B.Length)
    Log("C String = " & C.Length)
End Sub

It seems that "UTF8" charset requires 2 bytes when representing ordinal values (characters) greater than ordinal 127. I always thought UTF8 would always be within 1 byte per character.

To correct this I am now using "ASCII" as the charset and it all seems fine.

Does anyone have insight into UTF8 to help me understand why this is so?
 

emexes

Expert
Licensed User
It seems that "UTF8" charset requires 2 bytes when representing ordinal values (characters) greater than ordinal 127.
Does anyone have insight into UTF8 to help me understand why this is so?
Because UTF-8 is a mapping to represent the entire Unicode set (tens-of-thousands of characters) using 8-bit bytes (256 possible values). If the first 256 (ie: all) possible byte values were used up representing Unicode characters 0..255, then there would be no unused byte values available to represent any of the other tens-of-thousands of Unicode characters. Thus it is necessary for some of the byte values to be reserved as a flag that they are part of a multibyte representation of a character that does not have its own single-byte representation.

Somewhat similar to telephone numbers, eg: if a dialled number here begins with an 8 or a 9, then I know it is an 8-digit local number. If it begins with a 0 then I know it is an non-local number, and if it begins with 04 then it is a 10-digit mobile number.
 
Upvote 0

emexes

Expert
Licensed User
To correct this I am now using "ASCII" as the charset and it all seems fine.
This may be correct. But ASCII is a 7-bit code, so it would be worth what happens with:
i) Unicode characters 128..255 (sounds like they translate to UTF-8 bytes 128..255)
ii) Unicode characters >= 256 (sounds like they get dropped altogether)
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
Upvote 0

Starchild

Active Member
Licensed User
Longtime User
Thanks for all the information. I completely misunderstood UTF8 meaning.
I assumed it would always be a fixed byte length per character. Totally NOT true.

In fact I actually required an 8-bit ORD() ordinal function like pascal/delphi.
In Java they seem to do Type Casting to achieve this.
ie. int my_ord = (int)my_chr;

It makes me also wonder what is the implied charset for the B4X functions ASC() and CHR() once you get above ordinal 127, as these seem to misbehave for me too.

I have in fact rewritten my file downloader to use byte arrays.
Works fine now.

I agree, stay away from strings unless dealing with plain text.
 
Upvote 0

emexes

Expert
Licensed User
Thanks for all the information. I completely misunderstood UTF8 meaning.
It is easy to mix up. The Wikipedia UTF-8 page has this table:

upload_2019-10-13_1-46-53.png


which happily also demonstrates that presentation of information can indeed be worth a thousand words :)

B4X functions ASC() and CHR() once you get above ordinal 127, as these seem to misbehave for me too.
Say what? That shouldn't happen. Asc(Chr(X) should always return X, where X is a valid Unicode character number. Maybe X is a non-integer Float, perhaps, or a negative number, eg, B4X Bytes are -128 to 127 not 0..255.

I agree, stay away from strings unless dealing with plain text.
+1 :)
 
Upvote 0

MarkusR

Well-Known Member
Licensed User
Longtime User
Asc is more UTF8Value("A")
i believe ASC have its name from ASCII Table :) whereby CHR means Char
 
Upvote 0

emexes

Expert
Licensed User
i believe ASC have its name from ASCII Table :)
Agreed (and the history of ASCII is an interesting and useful read),

Asc is more UTF8Value("A")
but I am not sure where this extrapolation came from. Perhaps I misunderstand.

Cannot test this at the moment, but can say with 100% confidence that:
- strictly speaking, ASCII is 7-bit, thus 0..127
- I have observed ASC(CHR(X)) = X for 0..255 in a B4X language (probably B4J)
which means that ASC is returning, for that range at least, the Unicode character number ("codepoint"), and not a "more UTF8Value".

Having written that, I think I have spotted the misunderstanding. Replace "more UTF8Value" with "more UnicodeValue" and we are back on the same page (pun intended ;-)
 
Upvote 0

emexes

Expert
Licensed User
ps my understanding is that the reason it is called ASC() rather than ASCII() is that it was a pragmatic/realpolitik extension by an early (perhaps the original) Gates & Allen BASIC interpreter, in which all function names were 3 characters, eg also CHR, RND, SGN.

Although now you've got me intrigued about LEFT$ and RIGHT$... when did they hit the traditional BASIC vocabulary?
 
Upvote 0

MarkusR

Well-Known Member
Licensed User
Longtime User
Replace "more UTF8Value" with "more UnicodeValue" and we are back on the same page (pun intended ;-)

see link in #2
there are different methods to store unicode, i think just UnicodeValue would be imprecise.
=Value("Ä") would be ok if anything is in the same encoding.
 
Upvote 0

Starchild

Active Member
Licensed User
Longtime User
It is easy to mix up. The Wikipedia UTF-8 page has this table:

View attachment 84612

which happily also demonstrates that presentation of information can indeed be worth a thousand words :)


Say what? That shouldn't happen. Asc(Chr(X) should always return X, where X is a valid Unicode character number. Maybe X is a non-integer Float, perhaps, or a negative number, eg, B4X Bytes are -128 to 127 not 0..255.


+1 :)


I was incorrectly considering a byte array of (say of 6 bytes) would convert to a string of 6 characters. But this is only true if the character ordinal values are less than 128.

From my early assembly micro-processor coding I have always assumed a character to be a byte of a different type.
As ASCII is 7 bit, I just assumed that "UTF8" meant 8 bit, and utf-16 meant 16 bit to accomodate many more characters. I need to read more instead of guessing. This is a problem of growing up in an english only environment and the fact I'm old enough to have started with ASCII as the ONLY character index system.

I must now change my thinking regarding buffer size in memory verses displayed characters on screen as a dynamic relationship depending on actual character content.

Thanks for all the web links and info.

Technically a BYTE value has no sign, it's just 8 bits, or in decimal 0 to 255. A byte holding a value -128 to 127 would be a short int. Once you consider the most significant bit as a sign bit the variable becomes an integer. B4X doesn't really support BYTE, WORD or DWORD types. You can only hold these bit values in an integer variable being aware of the 2s complement representation of the bit values as positive and negative numbers.
 
Upvote 0

emexes

Expert
Licensed User
But this is only true if the character ordinal values are less than 128.
or if you use an encoding that maps Unicode codepoints 0..255 to bytes 0x00..0xFF. I don't have a B4X environment on this computer, but I did at home do a sweep through all the available mappings, and there was one where a 256 character string of Chr(0) to Chr(255) would translate to a 256 byte array of 0x00..0xFF.
I must now change my thinking regarding buffer size in memory verses displayed characters on screen as a dynamic relationship depending on actual character content.
I believe in B4X it is still a fixed ratio between characters and bytes, but rather than the previous 1:1 it is now 3:1 or 4:1. Haven't tested it, but I doubt it is 2:1 because I have used some formatting characters that are > 16-bit, and Unicode conceptually began as a 32-bit code but has settled down into a 21-bit code (17 blocks of 16-bit ranges = a bit over a million available codepoints). It is possible that some programming environments store Unicode as 3-byte codes, but I'd guess that most would use 4-byte codes because that is a more natural size for binary computers.
Technically a BYTE value has no sign, it's just 8 bits
I would agree with this
or in decimal 0 to 255.
but in some more esoteric edge cases, you might get some pushback on this - there are many and varied ways of interpreting bits, and base-2 is just one of them, albeit the most common.
A byte holding a value -128 to 127 would be a short int.
B4R would disagree with you on that. I feel like we are overlapping byte (collection of 8 bits, aka octet) and Byte (data type, interpretation of those bits).
B4X doesn't really support BYTE, WORD or DWORD types.
I certainly miss having the option of unsigned versions of them. And compatibility between B4R types and the rest of the B4X dialects. But that's the way it is, and unlikely to change, so... here we are :)
 
Last edited:
Upvote 0

Starchild

Active Member
Licensed User
Longtime User
I think there is an encoding type "ASCII" that returns a 1:1 relationship when using B4X ByteConverter class.
I tried this a few days ago, but it still had issues later when I was applying ASC() and CHR() functions to the results of elements returned by StringToArray and StringFromArray functions.

Anyway, I'm past all that for now as I am staying with byte arrays ONLY to handle my data. No confusions.

Just a note on short int.
Short Int itself is just a type name. Depending on language it can vary in range.
The BASIC micro compilers I use, a Short Int fits into a single byte of memory (8 bit), while the C based compilers allocate it as 2 byte (16 bits). It catches me sometimes, specially when coding for limited memory applications.

Just things to be aware of I guess. :)

"... we need more than 1 byte for smiley!! As ASCII doesn't really "cut it" anymore in today's world."
 
Upvote 0

emexes

Expert
Licensed User
staying with byte arrays ONLY to handle my data. No confusions.
+1

Short Int itself is just a type name. Depending on language it can vary in range.
Long live stdint.h {thumbs-up}

Would have been nice to have such a regular scheme as part of B4X, but... I guess that's the price of evolution.

we need more than 1 byte for smiley!!
I'll {drink} to that!

edit: well, that was disappointing - Unicode emojis looked great in editor, vanished when posted. Spewin!
 
Upvote 0
Top