B4J Question B4J PDF To Text?

RichardN

Active Member
Licensed User
I would like to extract text raw from a PFD.

I have looked briefly at the jPDFjet library but it seems like a heavyweight option aimed primarily at the finer points of PDF authoring rather than engineering in the opposite direction. Is there a simpler option I have missed?

Any suggestions?
 

RichardN

Active Member
Licensed User
Thanks Erel.

Just a word of warning to save anybody else some time.... The jPDFbox library wrapped by fixit30 in the thread above requires the file 'pdfbox-app-1.8.10.jar' to be present in the additional libraries folder NOT 'pdfbox 1.8.10.jar' as quoted in the thread.

The current version of PDFbox available on the Apache site is 2.0.16 and that .jar does NOT work with the B4J lib jPDFbox v 1.0. The older version (1.8.10) can be found in the archive and downloaded from https://archive.apache.org/dist/pdfbox/1.8.10/pdfbox-app-1.8.10.jar
 
Last edited:

Gabino A. de la Gala

Active Member
Licensed User
I can already extract the text from a pdf document.

My "problem" now is to somehow separate the contents of that file by columns so that you can then export to .csv.
The file is a typical list of clients that shows the data of each client separated in columns.
Data such as code, telephone, etc. are made up of a single word, but names, addresses, etc. can have one or several words.

Is there any way to "differentiate" column changes other than taking into account the blanks as separators since they can be interspersed within the field itself?

Thank you.
 

RichardN

Active Member
Licensed User
Hola Gabino,

I have undertaken several projects in the past that were aimed at extracting data from PDFs and parsing it into a consistent text format to be imported into a database. The biggest problem is guaranteeing that the format of the extracted text is consistent from document to document. If it is not then then parsing data from the text becomes a absolute nightmare.

There is no guarantee that the original data > PDF encoding done by the author remains consistent... even if they use the same PDF generator method/software. All you need is the slightest ACSII inconsistency and your parsing of the text will crash or produce incorrect results.

It is something I avoid unless there is no other source of that information.
 
Top