Android Question SubString to string

arvin

Member
Hi, I need help to convert 2 bytes from a string to string and show in my app ex:

Dim str As String
dim ii as int
str= 00 00 00 A9 -> byte0=00 byte1=00 byte2=00 byte3=A9 heximal

I need to add number 2 to A9 and show in a EditText.

how to do this with B4A ?

like this:
ii=str.SubString2(3,3)+2
 
Last edited:

agraham

Expert
Licensed User
Longtime User
Strings are not a sequence of Bytes but are a sequence of Chars. Each Char is actually a 16bit value representing the Unicode code point of a character.
You can used String.CharAt(pos) to get a single character from a String and use the Asc(char) keyword to get the code point of the character.

i=Asc(str.CharAt(3))
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
I think it might (now?) be larger than 16 bits
No. The internal coding for strings in Java (and Windows, C#, Javascript etc.) is UTF-16 which comprises a sequence of 16 bit values. Do not confuse a Unicode code point with how it is encoded. There are about 1,112,064 valid code points of Unicode which can be encoded in various ways. UTF-8 uses sequences of 8 bit values, UTF-16 uses 16 bit values and UTF-32 uses 32 bit values. Where the value of a code point exceeds that of the units within which it is encoded then it is encoded in multiple coding values. Wikipedia has several very good articles on Unicode itself and the various encodings.
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
Hi, I need help to convert 2 bytes from a string to string and show in my app ex:

Dim str As String
dim ii as int
str= 00 00 00 A9 -> byte0=00 byte1=00 byte2=00 byte3=A9 heximal

I need to add number 2 to A9 and show in a EditText.
B4X:
Dim Str As String
Dim ii as int
str = Chr(0x00) & Chr(0x00) & Chr(0x00) & Chr(0xA9)

'just to make sure we're talking about the same thing
Log( "Length = " & Str.Length )    '4 characters
Log( "Last character code = " & Asc(Str.CharAt(3)) )    'A9 hex = 169 decimal

'if so, then this be what ye want
ii = Asc(Str.CharAt(3)) + 2
MyEditText.Text = ii    'Int value 171 will be cast to a 3-character string "171"
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
there are many things about Java that go against my expectation
This is not just Java. UTF-16 encoding is almost universally standard for the internal representation of strings in Windows, Linux, Android and MacOS and hence in most languages running on them.

Windows also supports 8 bit byte characters and code page definitions for backward compatibility. In fact just about every string API call in Win32 exists in both narrow and wide form, the difference being whether the code values are 8 code page or 16 bit Unicode values. It also has conversion APIs to convert 8 bit characters to and from Unicode that need a code page identifier to identify what glyphs the character values between 128 and 255 are meant to represent.
 
Upvote 0

emexes

Expert
Licensed User
Each Char is actually a 16bit value
I think it might (now?) be larger than 16 bits.
No. The internal coding for strings in Java (and Windows, C#, Javascript etc.) is UTF-16 which comprises a sequence of 16 bit values.
Hmm. You are right. Ran this code in B4J:
B4X:
Dim S As String = Char(0x1F6A9)
Log(S.Length)    'where S is comprised of Chars, and .Length is the number of Chars in the string
Dim C As Char = S.CharAt(0)
Log(ASC(C))
got this log:
B4X:
Program started.
1
63145
which is completely contrary to my recollection, but compatible with your explanation. Although I did think that UTF-16 was also supposed to be able to encode the entire Unicode range, by using more than one 16-bit component (like UTF-8 does with 8-bit components). And it is a bit disappointing that a Char doesn't hold Unicode characters > 0xFFFF. I'll pull on those threads a bit more, see what comes out.
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
I did think that UTF-16 was also supposed to be able to encode the entire Unicode range, by using more than 16-bit component (like UTF-8 does with 8-bit components)
It does, in a similar manner using surrogate pairs of values. I don't know for sure but String.Length may return the number of Unicode code points in the string, not the number of individual 16 bit coding elements. I tend to ignore the Unicode complexities until it hits me in the face and I have never needed to use any characters that require more than a single 16 bit value when coded in UTF-16.
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
got this log: 63145
Note that 63145 is the decimal equivalent to 0xF6A9 so it looks like CharAt has returned the correct Unicode code point but by assigning it to a Char which is a 16 bit value it has truncated it. Try Asc(S.CharAt(0) into a Long.
 
Upvote 0

emexes

Expert
Licensed User
I tend to ignore the Unicode complexities
I am so sad at the moment
until it hits me in the face
but this made me laugh anyway, so thank you for that :)

I am sad because I thought this Unicode guff had been finally and properly sorted out, and now I find it is not.

This code:

upload_2019-10-14_22-45-52.png


produces this log:

upload_2019-10-14_22-47-47.png


A little piece of my love for programming died when I saw that.
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
55357 is 0xD83D which is a valid UTF-16 high surrogate pair value and 57001 is 0xDEA9 which is a valid UTF-16 low surrogate pair value so it looks correct to me but I can't be bothered to calculate them myself. The reason for the "?" is that CharAt is returning the individual surrogate pair values and the font used to log the values does not have a glyph defined for those codes.
 
Upvote 0

agraham

Expert
Licensed User
Longtime User
Note that 63145 is the decimal equivalent to 0xF6A9 so it looks like CharAt has returned the correct Unicode code point but by assigning it to a Char which is a 16 bit value it has truncated it.
In retrospect this is most likely wrong. The real reason is probably that Char() masks any value passed to it to a 16 bit value which fits in a single Char variable. In your second case embedding the extended character in a string bypasses this limitation by generating a correctly encoded UTF-16 literal string.
 
Upvote 0

emexes

Expert
Licensed User
I have never needed to use any characters that require more than a single 16 bit value when coded in UTF-16
I haven't either. The flag character was encountered whilst on a quest for somebody else, and my recollection there is that the new Unicode representation of flags is too new to have percolated down to being available for common usage, so we let go of that solution for the time being.

For us non-flag-waving English speakers, the 65536 codepoints of Unicode plane 0 may well be enough. But there are 16 other planes in Unicode too. About a week ago, there was a forum query along the lines of "my Arabic text strings display ok in the IDE, but go wonky when my program manipulates them" and now I am thinking: perhaps that issue was related to this half-baked handling of Unicode.
 

Attachments

  • upload_2019-10-14_23-24-24.png
    upload_2019-10-14_23-24-24.png
    2.2 KB · Views: 142
Upvote 0

agraham

Expert
Licensed User
Longtime User
For us non-flag-waving English speakers, the 65536 codepoints of Unicode plane 0 may well be enough
Note that even some plane 0 codepoints are going to be encoded as surrogate pairs in UTF-16 as some 0xD??? values are reserved for surrogate pair usage and so Unicode codepoints in this range need to be encoded as a surrogate pair.

EDIT: I'm wrong here. They are reserved for surrogate use only with no character allocations :(
 
Upvote 0

emexes

Expert
Licensed User
55357 is 0xD83D which is a valid UTF-16 high surrogate pair value and 57001 is 0xDEA9 which is a valid UTF-16 low surrogate pair value so it looks correct to me but I can't be bothered to calculate them myself. The reason for the "?" is that CharAt is returning the individual surrogate pair values and the font used to log the values does not have a glyph defined for those codes.
Understood. And thank you for freeing me from a bad assumption. Or maybe: a good assumption, implemented badly. I am letting go of this topic, per your "until it hits me in the face" approach ;-)

https://www.compart.com/en/unicode/U+1F6A9

upload_2019-10-14_23-38-56.png
 
Upvote 0

emexes

Expert
Licensed User
Note that even some plane 0 codepoints are going to be encoded as surrogate pairs in UTF-16 as some 0xD??? values are reserved for surrogate pair usage and so Unicode codepoints in this range need to be encoded as a surrogate pair.
Thanks again. You are a right font of good news at the moment. But if you are doing it to cheer me up - it's not working ;-)
 
Upvote 0
Top