B4J Library [NLP] Apache Tika - Text extraction

Apache license 2.0

CmRAxqdCG6.gif


As I wrote here, I intend to focus on text analysis / natural language processing in the near future. The first step to analyze text, is to be able to extract it.
Apache Tika is an open source project that allows extracting text and meta data from many formats: https://tika.apache.org/1.27/formats.html

It is based on many other open source projects and provides a simple and consistent API.

Instructions

1. Download the dependencies jars: www.b4x.com/b4j/files/Tika.zip (70mb).
2. Copy the Tika folder to the additional libraries folder. Make sure to keep the Tika folder.
3. Download Tika.b4xlib and put it in the additional libraries folder.

Parsing is a matter of calling:
B4X:
Wait For (Tik.Parse(File.OpenInput(dir, filename))) Complete (Res As TikaParseResult)
See the attached example. The example depends on DragAndDrop2.b4xlib: https://www.b4x.com/android/forum/threads/jdraganddrop2-drag-and-drop.76168/post-636391

Notes

1. Develop in debug mode. If you want to develop in release mode then set #MergeLibraries: False. Otherwise it will slow down compilation as building the single jar can take a while.
It is possible to build a single jar, it just takes some time to build it.
2. Tika will not work with the standalone package (which is the same as B4JPackager11).
3. You will see a red warning in the logs about a missing dependency (jai-image-io). Ignore it.
4. This library requires Java 11+: https://www.b4x.com/b4j.html
 

Attachments

  • Tika.b4xlib
    3 KB · Views: 322
  • Example.zip
    4.8 KB · Views: 356
Last edited:

paragkini

Member
Licensed User
Longtime User
Getting below error.


B4X:
Compiling generated Java code.    Error
javac 1.8.0_291
src\b4j\example\tika.java:387: error: local variable password is accessed from within inner class; needs to be declared final
                    if (password.length() > 0)
                        ^
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
1 error

Copied Tika folder also.

1629529211505.png


Am I missing something or doing something wrong ?
 

paragkini

Member
Licensed User
Longtime User
Tried it putting in different folder in D and then reference it in configure paths. But it gave other error. Then put it in internal libraries as well (just to try if it works) but still gave same error. Will try re-arranging all the folders again later. Will keep you posted.
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
While @DonManfred is correct that you should configure the additional libraries folder to be outside of Program Files, the error is related to the Java version.
Switch to OpenJDK 11.

I will add a message about it in the first post.
 

hanyelmehy

Active Member
Licensed User
Longtime User
I get this error for larg file ,how i can fix this issue
B4X:
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
also you say (Tika will not work with the standalone package (which is the same as B4JPackager11)
is there are any other way to do standalone app
 
Top