Android Question SubString to string

arvin · Oct 13, 2019

Hi, I need help to convert 2 bytes from a string to string and show in my app ex:

Dim str As String
dim ii as int
str= 00 00 00 A9 -> byte0=00 byte1=00 byte2=00 byte3=A9 heximal

I need to add number 2 to A9 and show in a EditText.

how to do this with B4A ?

like this:
ii=str.SubString2(3,3)+2

agraham · Oct 13, 2019

Strings are not a sequence of Bytes but are a sequence of Chars. Each Char is actually a 16bit value representing the Unicode code point of a character.
You can used String.CharAt(pos) to get a single character from a String and use the Asc(char) keyword to get the code point of the character.

i=Asc(str.CharAt(3))

Erel · Oct 14, 2019

If the string is actually "000000A9" then you can use ByteConverter.HexToBytes to parse it.

emexes · Oct 14, 2019

agraham said:
Each Char is actually a 16bit value representing the Unicode code point of a character.

I think it might (now?) be larger than 16 bits. Pretty sure I have used something like U+1F6A9 (triangular flag on post), but not sure in which of B4A/I/J.

agraham · Oct 14, 2019

emexes said:
I think it might (now?) be larger than 16 bits

No. The internal coding for strings in Java (and Windows, C#, Javascript etc.) is UTF-16 which comprises a sequence of 16 bit values. Do not confuse a Unicode code point with how it is encoded. There are about 1,112,064 valid code points of Unicode which can be encoded in various ways. UTF-8 uses sequences of 8 bit values, UTF-16 uses 16 bit values and UTF-32 uses 32 bit values. Where the value of a code point exceeds that of the units within which it is encoded then it is encoded in multiple coding values. Wikipedia has several very good articles on Unicode itself and the various encodings.

emexes · Oct 14, 2019

arvin said:
Hi, I need help to convert 2 bytes from a string to string and show in my app ex:

Dim str As String
dim ii as int
str= 00 00 00 A9 -> byte0=00 byte1=00 byte2=00 byte3=A9 heximal

I need to add number 2 to A9 and show in a EditText.

B4X:

Dim Str As String
Dim ii as int
str = Chr(0x00) & Chr(0x00) & Chr(0x00) & Chr(0xA9)

'just to make sure we're talking about the same thing
Log( "Length = " & Str.Length )    '4 characters
Log( "Last character code = " & Asc(Str.CharAt(3)) )    'A9 hex = 169 decimal

'if so, then this be what ye want
ii = Asc(Str.CharAt(3)) + 2
MyEditText.Text = ii    'Int value 171 will be cast to a 3-character string "171"

agraham · Oct 14, 2019

emexes said:
there are many things about Java that go against my expectation

This is not just Java. UTF-16 encoding is almost universally standard for the internal representation of strings in Windows, Linux, Android and MacOS and hence in most languages running on them.

Windows also supports 8 bit byte characters and code page definitions for backward compatibility. In fact just about every string API call in Win32 exists in both narrow and wide form, the difference being whether the code values are 8 code page or 16 bit Unicode values. It also has conversion APIs to convert 8 bit characters to and from Unicode that need a code page identifier to identify what glyphs the character values between 128 and 255 are meant to represent.

emexes · Oct 14, 2019

agraham said:
Each Char is actually a 16bit value

emexes said:
I think it might (now?) be larger than 16 bits.

agraham said:
No. The internal coding for strings in Java (and Windows, C#, Javascript etc.) is UTF-16 which comprises a sequence of 16 bit values.

Hmm. You are right. Ran this code in B4J:

B4X:

Dim S As String = Char(0x1F6A9)
Log(S.Length)    'where S is comprised of Chars, and .Length is the number of Chars in the string
Dim C As Char = S.CharAt(0)
Log(ASC(C))

got this log:

B4X:

Program started.
1
63145

which is completely contrary to my recollection, but compatible with your explanation. Although I did think that UTF-16 was also supposed to be able to encode the entire Unicode range, by using more than one 16-bit component (like UTF-8 does with 8-bit components). And it is a bit disappointing that a Char doesn't hold Unicode characters > 0xFFFF. I'll pull on those threads a bit more, see what comes out.

agraham · Oct 14, 2019

emexes said:
I did think that UTF-16 was also supposed to be able to encode the entire Unicode range, by using more than 16-bit component (like UTF-8 does with 8-bit components)

It does, in a similar manner using surrogate pairs of values. I don't know for sure but String.Length may return the number of Unicode code points in the string, not the number of individual 16 bit coding elements. I tend to ignore the Unicode complexities until it hits me in the face and I have never needed to use any characters that require more than a single 16 bit value when coded in UTF-16.

agraham · Oct 14, 2019

emexes said:
got this log: 63145

Note that 63145 is the decimal equivalent to 0xF6A9 so it looks like CharAt has returned the correct Unicode code point but by assigning it to a Char which is a 16 bit value it has truncated it. Try Asc(S.CharAt(0) into a Long.

emexes · Oct 14, 2019

agraham said:
I tend to ignore the Unicode complexities

I am so sad at the moment

until it hits me in the face

but this made me laugh anyway, so thank you for that

I am sad because I thought this Unicode guff had been finally and properly sorted out, and now I find it is not.

This code:

produces this log:

A little piece of my love for programming died when I saw that.

agraham · Oct 14, 2019

55357 is 0xD83D which is a valid UTF-16 high surrogate pair value and 57001 is 0xDEA9 which is a valid UTF-16 low surrogate pair value so it looks correct to me but I can't be bothered to calculate them myself. The reason for the "?" is that CharAt is returning the individual surrogate pair values and the font used to log the values does not have a glyph defined for those codes.

agraham · Oct 14, 2019

agraham said:
Note that 63145 is the decimal equivalent to 0xF6A9 so it looks like CharAt has returned the correct Unicode code point but by assigning it to a Char which is a 16 bit value it has truncated it.

In retrospect this is most likely wrong. The real reason is probably that Char() masks any value passed to it to a 16 bit value which fits in a single Char variable. In your second case embedding the extended character in a string bypasses this limitation by generating a correctly encoded UTF-16 literal string.

emexes · Oct 14, 2019

agraham said:
I have never needed to use any characters that require more than a single 16 bit value when coded in UTF-16

I haven't either. The flag character was encountered whilst on a quest for somebody else, and my recollection there is that the new Unicode representation of flags is too new to have percolated down to being available for common usage, so we let go of that solution for the time being.

For us non-flag-waving English speakers, the 65536 codepoints of Unicode plane 0 may well be enough. But there are 16 other planes in Unicode too. About a week ago, there was a forum query along the lines of "my Arabic text strings display ok in the IDE, but go wonky when my program manipulates them" and now I am thinking: perhaps that issue was related to this half-baked handling of Unicode.

agraham · Oct 14, 2019

emexes said:
For us non-flag-waving English speakers, the 65536 codepoints of Unicode plane 0 may well be enough

Note that even some plane 0 codepoints are going to be encoded as surrogate pairs in UTF-16 as some 0xD??? values are reserved for surrogate pair usage and so Unicode codepoints in this range need to be encoded as a surrogate pair.

EDIT: I'm wrong here. They are reserved for surrogate use only with no character allocations

emexes · Oct 14, 2019

agraham said:
55357 is 0xD83D which is a valid UTF-16 high surrogate pair value and 57001 is 0xDEA9 which is a valid UTF-16 low surrogate pair value so it looks correct to me but I can't be bothered to calculate them myself. The reason for the "?" is that CharAt is returning the individual surrogate pair values and the font used to log the values does not have a glyph defined for those codes.

Understood. And thank you for freeing me from a bad assumption. Or maybe: a good assumption, implemented badly. I am letting go of this topic, per your "until it hits me in the face" approach ;-)

https://www.compart.com/en/unicode/U+1F6A9

emexes · Oct 14, 2019

agraham said:
Note that even some plane 0 codepoints are going to be encoded as surrogate pairs in UTF-16 as some 0xD??? values are reserved for surrogate pair usage and so Unicode codepoints in this range need to be encoded as a surrogate pair.

Thanks again. You are a right font of good news at the moment. But if you are doing it to cheer me up - it's not working ;-)

agraham · Oct 14, 2019

Note my correction above! There, that's cheered you up a bit

Android Question SubString to string

arvin

Member

agraham

Expert

Erel

B4X founder

emexes

Expert

agraham

Expert

emexes

Expert

agraham

Expert

emexes

Expert

agraham

Expert

agraham

Expert

emexes

Expert

agraham

Expert

agraham

Expert

emexes

Expert

Attachments

agraham

Expert

emexes

Expert

emexes

Expert

agraham

Expert

Similar Threads