Android Question [Request] Apache Tika Lib/Wrap, Extract structured text from different file types

fredo

Well-Known Member
Licensed User
Longtime User
For a future Android project, it will be necessary to obtain structured textual content from any document format.

It is well known that some parser solutions already exist here in the forum. However, those are mostly specifically designed for a certain file format. In this case, a solution is actually requested to cover as many formats as possible.
The research for the most optimal solution led to this product:

Apache Tika - a content analysis toolkit
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

The data is extracted from the source document using a parser API.
The Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:
B4X:
void parse(
   InputStream stream, ContentHandler handler, Metadata metadata,
   ParseContext context) throws IOException, SAXException, TikaException;
The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document.
An initial search for feasibility on the Android platform showed that there is a lot of interest and several solutions are being discussed.


Since our team lacks sufficient expertise for the realisation, the following questions go to the community:

a) Do I see it correctly that it is basically possible to use Apache Tika in Android apps?

b) Would it be possible to create a B4X-wrap for such a seemingly large package?

 

fredo

Well-Known Member
Licensed User
Longtime User
It is a huge SDK

Thank you, Erel.

Since there seem to be a lot of people involved in porting to android, I was hoping that at least the parser interface could be made available.

https://www.tutorialspoint.com/tika/tika_architecture.htm
2018-12-31_09-17-31.jpg

 
Upvote 0
Top