Android Question RegEx on byte arrays

Didier9 · Nov 26, 2019

I am parsing byte arrays that contain a number of binary encoded messages (the whole 0-255 range) and I need to find start and end of messages and eliminate some byte stuffing. The data comes through WiFi and I have to process it as it comes. At the other end of the WiFi link is a serial device that sends data at 9600 bauds, so the data rate is not very high (probably 300-600 bytes/sec average, but the longer the messages, the less time I have to process them)

Doing it character by character with a for() loop is actually not fast enough to run on an inexpensive tablet (runs well on a Moto-X 4 and very well on a Pixel 2 but not well at all on a 5 year old XT1096 phone or the Fire tablet).

I have tried to cast the data to a string and use string matching and IndexOf() but that does not work, I suspect that the binary data containing zeros messes with strings that are normally null-terminated.

I should mention that after the messages are split, they still need to be decoded and displayed and at the moment, that takes about as long as the message splitting itself even though I am just putting text into labels (no fancy graphics).

Interestingly, the same algorithm runs very well in C on a 12 MHz 8 bit microcontroller driving an LCD

While for this app I could live with the requirement for a relatively modern device, I would like to explore a little bit for better ways to do it and
I thought regular expressions would probably help but apparently the RegEx.Matcher does not like binary data.

Any suggestion appreciated.

Didier9 · Nov 27, 2019

This was my original char-by-char routine:

B4X:

Sub ReceiveMessage( tchar As Byte )
    ' Note: with the TSIP protocol, messages start with DLE-0x8F-0xAB or DLE-Ox8F-0xAC
    ' we cannot be sure to identify the beginning of a message because these sequences are legal inside a message
    ' However we can identify the end of a message because it is (odd number of DLE)-ETX
    ' So we start collecting a message when we get DLE-0x8F, but we have to check for
    ' early DLE-ETX and scrap the message if length is not good

    Select Case RxState
        Case WAIT_FOR_START:
            If tchar = DLE Then
                ' this is a tentative start
                RxBuf = ""
                RxBuf = RxBuf & Chr(tchar)
                RxState = WAIT_FOR_ID
                bDLE = True ' will be used to identify the end of a message
            Else
                ' do nothing, just keep waiting
                bDLE = False
            End If

        Case WAIT_FOR_ID: ' make sure it's a clean DLE (followed by not-ETX)
            ' there are many more message IDs than 0x8E and 0x8F...
            ' even though we can have DLE-0x8F inside a message
            If tchar = 0x8E Or tchar = 0x8F Then
                ' still not 100% sure it's a good start, it just looks like it might
                RxState = COLLECT_MSG
                RxBuf = RxBuf & Chr(tchar)
            Else
                ' pretty sure that was not a start
                RxState = WAIT_FOR_START
                RxBuf = ""
                bDLE = False
            End If
            bDLE = False

        Case COLLECT_MSG:
            If tchar = ETX And bDLE = True Then
                ' ETX preceded by odd number of DLEs: unequivocaly end of message
                '    RxPending = True ' not sure RxPending is useful in this context
                ' Check length versus third character of the message:
                ' Primary packet is 0xAB and length is 19,
                ' Secondary packet is 0xAC and length is 70
                'Debug.Print "Message type: 0x" & Hex$(Asc(Mid$(RxBuffer, 2, 1))) & Hex$(Asc(Mid$(RxBuffer, 3, 1))) & ", length: " & Len(RxBuffer)
                'Debug.Print "Received message"
                'ProcessMessage (RxBuf & Chr(tchar))
                RxBuf = ""
                RxState = WAIT_FOR_START ' get ready for next message
                bDLE = False
            Else If tchar = DLE And bDLE = False Then
                ' we just received a DLE
                RxBuf = RxBuf & Chr(tchar)
                bDLE = True
            Else If tchar = DLE And bDLE = True Then
                ' duplicate DLE, ignore it
                bDLE = False
            Else
                ' any other situation, accumulate character
                RxBuf = RxBuf & Chr(tchar)
                bDLE = False
            End If
                    
        Case Else:
            RxState = WAIT_FOR_START
            RxBuf = ""
            bDLE = False
            '    RxPending = False
    
    End Select

End Sub ' ReceiveMessage()

I did not use it because I felt it would not mesh very well with the packets coming from the WiFi module.

Didier9 · Nov 27, 2019

Actually that one came from a piece of VB 6.0 code I wrote a while back (that was working ok but I never ran it very long). It was running directly from the serial port.

emexes · Nov 27, 2019

Didier9 said:
This was my original char-by-char routine

Seeing that induced some strong déjà vu... great minds think alike!

Although I used a fixed static buffer, which will be h-e-a-p-s quicker than string operations and/or dynamic arrays and their associated memory (re)allocations.

While I'm here:

A little trap of my code that might not be obvious is that PacketSize can be larger than Packet.Length. Or to put a more-positive spin on it: the routine will correctly parse packets larger than the buffer, and returns as much data as will fit into the buffer.[/QUOTE]

emexes · Nov 27, 2019

Didier9 said:
Messages start with the start of text delimiter DLE followed by the data payload.

Hey, when you simplify it down to that, it feels like it could eliminate one of the HandleOneByte parsing states and thus probably reduce the routine's code size by about a corresponding quarter.

emexes · Nov 27, 2019

Worked out that my uneasiness is because if:
- the opening sequence is DLE DLE or DLE ETX, or
- a DLE within the data is followed by something other than DLE or ETX
the results are not clearly defined.

So if we take the liberty of handling those cases however we like, then it simplifies down to:

B4X:

Private PacketState As Int = 0
    '0 = idle, waiting for initial DLE
    '2 = in packet, waiting for data byte or DLE
    '3 = in packet, got DLE, waiting for DLE or ETX

B4X:

Sub HandleOneByte(B As Byte)
  
    Select Case PacketState
        Case 0:    '*** idle, waiting for initial DLE ***
            If B = DLE Then
                PacketSize = 0
                PacketState = 2    'not 3, otherwise DLE is misinterpreted as escaping prefix
            Else
                'noise on line
                Log("wtf 58")
            End If
      
        Case 2:    '*** in packet, waiting for data byte or DLE ***
            If PacketSize < Packet.Length Then
                Packet(PacketSize) = B
            End If
            PacketSize = PacketSize + 1
          
            If B = DLE Then
                PacketState = 3
            End If
          
        Case 3:    '*** in packet, got DLE, waiting for DLE or ETX ***
            If B = DLE Then
                'DLE is already in packet from state 2
                PacketState = 2
            Else
                PacketSize = PacketSize - 1    'remove DLE from packet buffer
              
                If B = ETX Then
                    HandlePacket(PacketSize, Packet)
                    PacketState = 0    'job done
                End If
            End If
          
    End Select
  
End Sub

Didier9 · Nov 27, 2019

The end of message is more easily (and uniquely) identified so the routine should be laid out to make sure you cut the packet right there. Then, what follows, in most cases, is the beginning of a new message, until the next end of message.

You need another state in your routine. After receiving the first DLE, you must make sure the next character (the first character of the data payload) is not ETX. Then after that you can start looking for DLE-ETX while converting the DLE-DLE into a single DLE.

Didier9 · Nov 27, 2019

Thanks for the B4J code, makes it really easy to check things out.

emexes · Nov 27, 2019

Didier9 said:
You need another state in your routine. After receiving the first DLE, you must make sure the next character (the first character of the data payload) is not ETX.

Agreed. That's what I originally thought too, and implemented. But when I read the Trimble document, it says:

<ID> is a packet ID byte, which can have any value with the exception of <ETX> and <DLE>

but says nothing about what should happen if that exception happened anyway. In light of that, I figure best stick to the robustness principle of "conservative in what you send, liberal in what you accept", which in this case has the happy consequence of enabling the data payload to begin with <ETX> or <DLE> which would otherwise not be possible.

It is easy enough to add a check that the first character of the payload is not <ETX> (or <DLE>) if some clinching argument makes it obviously necessary. But until that happens, I'm thinking: if it works as is, leave it be.

At the same time: it's your program, perhaps there are other factors to consider. In which case... go for it.

Didier9 · Nov 27, 2019

Yes, makes little difference in the end. Either you make an extra function call that's going to return right away or you screen the data first. I have done it both ways, there is no right and wrong...
Thanks a lot for the exchange and the ideas, and the B4J test code

Didier9 · Nov 28, 2019

Well, the HandleOneByte approach works considerably better!
My original approach worked well enough on the local network but did not work so well when polling a device across the country from me (I am in Florida, the device is in Seattle). The HandleOneByte() approach works equally well on both devices! From what I saw, the problem was that my routine, aside from being overly complex, did not deal well with fragmentation. Your routine is simply a lot more elegant (I had to add the "fetching ID byte" state just because

Thanks again!

emexes · Nov 28, 2019

Didier9 said:
I had to add the "fetching ID byte" state just because

No worries... "just because" is a valid reason for perfectionists

If efficiency ever becomes a problem, the first thing to do would be to move the HandleOneByte code to inside the byte-by-byte loop of the HandleMultipleBytes routine, which will eliminate the overhead of calling HandleOneByte so many times. The overhead is probably small, but... there's still a fair amount happening: pushing parameters and return address, generating a new stack frame, initialize variables, and then undoing it all to return to the caller. Plus I suspect there is a dummy string return value in there somewhere too. It is low-hanging fruit ripe for picking if you ever get to the point where every clock cycle matters.

On a related note: my car stopped today, so I'll be applying the OBD reader to it tomorrow. Probably half of my packet protocol experience comes from interfacing with automotive test equipment like exhaust gas analysers, opacimeters, weighing scales, pedal force tranducers, and of course the ubiquitous ELM OBD interface. I still remember when that came out, and how great it was compared to previous devices. Not perfect, but close. Still warms my heart remembering when I first used it and it actually did what it said it would ;-)

Android Question RegEx on byte arrays

Didier9

Well-Known Member

Didier9

Well-Known Member

Didier9

Well-Known Member

emexes

Expert

emexes

Expert

emexes

Expert

Didier9

Well-Known Member

Didier9

Well-Known Member

emexes

Expert

Didier9

Well-Known Member

Didier9

Well-Known Member

emexes

Expert

Similar Threads