That is not likely to be a simple program. At the very minimum you would need to index every text file with the time in the audio that it needs to be highlighted.
The only way to make it simpler would be to use some form of speech to text process. I haven't seen a library for B4j, and if you create / find one, whether that would give anywhere near good enough performance would have to be tested.