B4A Library ICUB4A - detecting character-encoding formats

ICUB4A

When experimenting with subtitles, I noted that very often characters were not displayed correctly due to wrong character-encoding being used during the loading of a subtitle file to display. You have surely all seen and heard of UTF-8, ISO-8859-1 and so forth. These abbreviations represent various character-encoding formats.

Unfortunately, there is no 100% method for a program/application to know which character-encoding format to use when loading a file unless the file-format provides it. Therefore, most software uses a detection-algorithm to guess the character-encoding format in use.

I tried several detection-algorithms but at the end, I decided to use ICU4J which is continously updated and maintained and which according to my tests furnished the best results/guesses. The size of the original ICU4J-library is very large (it provides other functionaility as well) so I made a wrapper for B4A and B4J using a subset of the APIs available (total size is approximately 75KB).

Anyway - what is ICU4J?

ICU4J is an open-source, widely used set of Java libraries providing Unicode and globalization support for software applications.

Java provides a very strong foundation for global programs, and IBM and the ICU team played a key role in providing globalization technology into Sun's Java. But because of its long release schedule, Java cannot always keep up-to-date with evolving standards. The ICU team continues to extend Java's Unicode and internationalization support, focusing on improving performance, keeping current with the Unicode standard, and providing richer APIs, while remaining as compatible as possible with the original Java text and internationalization API design.

Companies that use ICU4J are for instance Google, Apple, Adobe, Amazon (Kindle), Debian and so forth. Google has announced that with Android N they will also include a subset of ICU4J's APIs.

Usage in B4A:

B4X:
Sub Process_Globals
    'These global variables will be declared once when the application starts.
    'These variables can be accessed from all modules.

End Sub

Sub Globals
    'These global variables will be redeclared each time the activity is created.
    'These variables can only be accessed from this module.
    Private guessEncoding As IcuB4A

End Sub

Sub Activity_Create(FirstTime As Boolean)
    'Do not forget to load the layout file created with the visual designer. For example:
    'Activity.LoadLayout("Layout1")

    Try
        Dim fileName As String
        Dim fileLocation As String
        Dim detectionResult As String
    
        fileName = "bbchinese.srt"
        fileLocation = File.Combine(File.DirRootExternal,fileName)
        detectionResult = guessEncoding.readFileAsStringGuessEncoding(fileLocation)
        Log(detectionResult)
    
            
    Catch
        Log(LastException)
    End Try

End Sub

I'm attaching sample-project, test-files and library. You can use this library with any text-files.

I have also posted a wrapper for B4J which you can find here.

I hope it may be useful for someone.
 

Attachments

  • ICUB4Asample.zip
    6.8 KB · Views: 180
  • libs.zip
    62.9 KB · Views: 235
  • testfiles.zip
    45.5 KB · Views: 155
Last edited:

awakenblueheart

Member
Licensed User
Longtime User
I'm using the sample but I put the bbchinese.srt into my dir asset.

B4X:
fileLocation = File.Combine(File.DirAssets,fileName)

I got this result:

(ErrnoException) android.system.ErrnoException: open failed: ENOENT (No such file or directory)
 

moster67

Expert
Licensed User
Longtime User
The error you mentioned is likely caused by the fact that the Assets directory is read-only. You will need to copy the file to somewhere you have write access and then call the detection.

You could create a sub like this (not tested) if you know that the file is in the Assets directory:
B4X:
Sub another
        Dim detectionResult As String = GetDetectionResult(fileName)
        Log(detectionResult)
end sub



Sub GetDetectionResult(Filename As String) As String
    Dim filelist As List
    filelist=File.ListFiles(File.DirAssets)
   Dim i As Int
   For i=0 To filelist.Size-1
     If Filename = filelist.Get(i) Then
          File.Copy(File.DirAssets,Filename,File.DirDefaultExternal,Filename)
        Dim fileLocation As String = File.Combine(File.DirDefaultExternal,Filename)
        Dim detectionResult As String = guessEncoding.readFileAsStringGuessEncoding(fileLocation)
        'delete the file after checking
        If File.Exists(File.DirDefaultExternal,Filename) Then
              File.Delete(File.DirDefaultExternal,Filename)
        End If
        Return detectionResult
     End If
   Next
     
   
End Sub
 
Top