B4J Library ICUB4J - detect character-encoding formats

Discussion in 'B4J Libraries & Classes' started by moster67, Apr 4, 2016.

  1. moster67

    moster67 Expert Licensed User

    ICUB4J

    When experimenting with subtitles, I noted that very often characters were not displayed correctly due to wrong character-encoding being used during the loading of a subtitle file to display. You have surely all seen and heard of UTF-8, ISO-8859-1 and so forth. These abbreviations represent various character-encoding formats.

    Unfortunately, there is no 100% method for a program/application to know which character-encoding format to use when loading a file unless the file-format provides it. Therefore, most software uses a detection-algorithm to guess the character-encoding format in use.

    I tried several detection-algorithms but at the end, I decided to use ICU4J which is continously updated and maintained and which according to my tests furnished the best results/guesses. The size of the original ICU4J-library is very large (it provides other functionaility as well) so I made a wrapper for B4A and B4J using a subset of the APIs available (total size is approximately 75KB).

    Anyway - what is ICU4J?

    ICU4J is an open-source, widely used set of Java libraries providing Unicode and globalization support for software applications.

    Java provides a very strong foundation for global programs, and IBM and the ICU team played a key role in providing globalization technology into Sun's Java. But because of its long release schedule, Java cannot always keep up-to-date with evolving standards. The ICU team continues to extend Java's Unicode and internationalization support, focusing on improving performance, keeping current with the Unicode standard, and providing richer APIs, while remaining as compatible as possible with the original Java text and internationalization API design.

    Companies that use ICU4J are for instance Google, Apple, Adobe, Amazon (Kindle), Debian and so forth. Google has announced that with Android N they will also include a subset of ICU4J's APIs.

    Usage in B4J:

    Code:
    Sub Process_Globals
        
    'Private fx As JFX
        Private MainForm As Form
        
    Private guessEncoding As IcuB4J
    End Sub

    Sub AppStart (Form1 As Form, Args() As String)
        
    'MainForm = Form1
        'MainForm.SetFormStyle("UNIFIED")
        'MainForm.RootPane.LoadLayout("Layout1") 'Load the layout file.
        'MainForm.Show

        
    Try
            
    Dim fileName As String
            
    Dim fileLocation As String
            
    Dim detectionResult As String
        
            fileName = 
    "bbchinese.srt"
            fileLocation = 
    File.Combine(File.DirApp,fileName)
            detectionResult = guessEncoding.readFileAsStringGuessEncoding(fileLocation)
            
    Log(detectionResult) ' --> UTF-8
        
        
    Catch
            
    Log(LastException)
        
    End Try

    End Sub
    I'm attaching sample-project, test-files and library. You can use this library with any text-files.

    I have also posted a wrapper for B4A which you can find here.

    I hope it may be useful for someone.
     

    Attached Files:

    Last edited: Apr 15, 2016
    DonManfred, xulihang, nobbi59 and 5 others like this.
  2. xulihang

    xulihang Member Licensed User

    Code:
    Sub convert(dir As String,filename As String)
        
    Dim charsetDetector As JavaObject
        charsetDetector.InitializeNewInstance(
    "com.ibm.icu.text.CharsetDetector",Null)
        charsetDetector.RunMethodJO(
    "setText",Array(File.OpenInput(dir,filename)))
        
    Dim charsetMatch As JavaObject
        charsetMatch=charsetDetector.RunMethodJO(
    "detect",Null)
        
    If charsetMatch.RunMethod("getLanguage",Null)<>"UTF-8" Then
            
    File.WriteString(dir,filename,charsetMatch.RunMethod("getString",Null))
        
    End If
    End Sub
    The jar-lite is great. I find that encoding conversion can also be done using charsetMatch.getString().
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice