Share My Creation Talk To The Hand (Redux)

drgottjr · Oct 15, 2022

So, here is Mi Dica II. The initial version https://www.b4x.com/android/forum/threads/talk-to-the-hand.143453/
produced a half-duplex speech recognition/translation app using Google's Voice Recognition engine, MLKit's
on-device translator and Android's TTS engine. Basically, I speak in my language, and the person I'm talking to
hears what I said translated into her language. She responds (or not!) in her language, and I hear a translation
in my language. There was a YouTube video demonstrating.

This release uses @Biswajit's Vosk wrapper to handle speech recognition duties. Immediately we
have some compromises (having nothing to do with the library as far as I can tell).

Whether the 2 recognition engines offer similar results when compared under like conditions, I can't say.
Google's Voice Reconition services require you to be online, that is, be connected to Google. Vosk's
engine for Android works offline (but with a smaller recognition model than used with their online model).

I had wanted an offline model in the first place since a fast Internet connection in some mountain village
might not be available, and I like my devices to be self-sufficient. Vosk is the only offline engine I came
across. Forum member @Biswajit beat me to it, so congratulations (and a donation) to him. His example
was easily refactored to fit my "speak lang 1 ---> translate ---> hear lang 2" model.
YouTube video found here:
Unfortunately, "speak lang 2 ---> translate ---> hear lang 1" will require some looking into. At the very least,
it would require a much larger application since the language models reside onboard. With each model at
around 50MB, you're not going to want too many. And each combination you build will require its own activity.
If you were an English speaker traveling only to France, as an example, you could build a 2-language model
in advance. But if you then went to, eg, Italy, you would have to have another 2-language model app to handle
English-Italian translation. With Google's engine, it didn't matter how many language combinations you needed
or when you needed them. But you did have to be online.

Google's recognizer is very good (to say the least), Vosk's offline recognizer less so. But since Vosk's offline model
as about 5% the size of its full model, I was amazed that anything was captured, even if it occasionally took several attempts.

With Google's system, in addition to its being much simpler to change languages on the fly, any model
updates and improvements are automatically inherited. With Vosk, a less accurate offline recognition model
does not really lend itself to vocal translation. Furthermore, you would have to check with Vosk periodically
to see if there were any updated language models and then make them available for downloading within the
app.

Google recognizes a certain amount of silence as a signal you're done speaking. It stops listening at that point.
If you want to talk, you have to start the engine up again. Vosk features a continuous speech recognition
mode: it listens until you turn it off, capturing your output as it goes along. This is convenient and cool if you were,
eg, dictating a letter. It's not so cool if, in addition to your voice, it also suddenly hears the voice of the TTS engine
speaking in a different language. Let the endless loop begin! It was difficult to get it to stop.

When Vosk "hears" a certain amount of silence, it outputs a "final" version of what you've said until that point. But
it continues to listen in case you (or anyone else) begins to speak. So the continuous recognition mode is a 2-edged
sword. I chose to make it stop listening when I go silent to avoid the vicious cycle referred to above. This is the way
Google's recognizer behaves. And it suits the app's main purpose, which is to provide simple spoken help in a foreign
language. In theory, it would be possible to cobble a continuous listening/speaking model together since the TTS engine
has a queueing feature, but I think there are some issues with voice recording picking up ambient noise, not to mention
output from the speaker.

Share My Creation Talk To The Hand (Redux)

Similar Threads