Android Code Snippet PDF2Text

Robert Valentino

Well-Known Member
Licensed User
I found this article on CodeProject : http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

About reading a PDF and converting it to text.

I have done my best to translate the code in B4A basic (see attached file)

The program will read the first page (not sure why it does not process the other pages) of PDF and output the text.

Maybe someone with a stronger knowledge of PDF's can figure out why it does not process the additional pages.

NOW just a FYI - I am not the strongest Basic programmer (I am sure there are better ways to do make this program work) but it was more a Proof of concept.

Any one what to jump in an help???

BobVal
 

Attachments

Robert Valentino

Well-Known Member
Licensed User
The user that wrote the code I converted from CodeProject assumed that every time there was a BT (begin text) that it should start on a New Line.

Reading the PDF doc a little http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf I realized that there are positioning matrix after the BT.

I now process this matrix and if the Ty (matrix parm F) is not the same as the last one I put the text on a New Line otherwise I put a TAB in between the fields.

Now my data looks like something that when I use to copy and paste into a text file - so my parser can handle it.

If I can figure out what is wrong with getting to the other pages?

To be continued.....

BobVal
 

Attachments

Robert Valentino

Well-Known Member
Licensed User
Figured it ALL out (or at least for now)

This program will take Test.pdf and produce Test.txt file

Test.txt is something I can parse easy enough.

I am sure there is more work to be done.

But for now this is useable.

NOTE: I have a lot of debug code and it is slow in Debug mode. So I use the old Legacy debugger when using in debug mode.

Takes a while in release mode but I am sure there will be ways to speed this code up.

Enjoy.

BobVal
 

Attachments

Robert Valentino

Well-Known Member
Licensed User
in my main I called the cPDF2Text.bas

B4X:
    '------------------------------------------------------------------------------------------------------------------------
    '  Process a PDF file and return a Text string that can either be saved or parsed
    '------------------------------------------------------------------------------------------------------------------------
    Dim TextFile As String = cPDF2Text.ProcessPDFFile("/BOBs/PDF2Text/", "test.pdf")
	
	File.WriteString(File.DirRootExternal &"/BOBs/PDF2Text/", "test.txt", TextFile)
	
	Msgbox("File Converted", "File Converted")
 

Robert Valentino

Well-Known Member
Licensed User
Complete project attached.

You will need to change where the PDF file is located and what it is called and what you and where you want the output called.

In my case everything is in /BOBs/PDF2Text

Hope this helps

BobVal
 

Attachments

Robert Valentino

Well-Known Member
Licensed User
Found that the Tm (text matrix) and Td or TD (text positioning) have the x and y the same.

Modified the code to handle any of the three.

Also I am not tracking the last line out and comparing it to the current line I am about to output (if the same then I discard it).
It seems when processing the text if the text is highlighted, or font change or any number of things that there would be a BT / ET with the same text 6 or 8 times.

I now suppress them.

The attached PDF files are completely different looks by different websites. In either case PDF2Text will process them and give an output file that can be parsed for the data.

These are just two examples I am working on. If someone tries it on another file let me know how it does.

BobVal

UPDATE: 4/11/2017

I notice that I could not convert some PDF files to Text because they were using a different internal format.

So I added a flag to ProcessPDFFile called AlternateWay - this will use the alternate format when parsing PDFs. All the PDFs use the normal way.

I haven't figured out yet how to determine on the fly which way to parse the file when I do I will upload an update. If one way doesn't work try again with the AlternateWay flag
 

Attachments

Last edited:

geola

Member
Licensed User
Hi Robert,

I'm dealing with some PDF files converting to text. Now with a new document following Lind in your code gives me an Error:


B4X:
Dim DecompressedBytes()           As Byte = CompressDecompress.DecompressBytes(Data2Decompress, "zlib")
cpdf2text_decompression (B4A line: 109)
Dim DecompressedBytes() As Byte = Compr
java.io.IOException
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:178)
at java.io.InputStream.read(InputStream.java:163)
at anywheresoftware.b4a.objects.streams.File.Copy2(File.java:356)
at anywheresoftware.b4a.randomaccessfile.CompressedStreams.DecompressBytes(CompressedStreams.java:45)
at com.BOBs.PDF2Text.cpdf2text._decompression(cpdf2text.java:98)
at com.BOBs.PDF2Text.cpdf2text._processpdffile(cpdf2text.java:996)
at com.BOBs.PDF2Text.main._jobdone(main.java:493)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:187)
at anywheresoftware.b4a.keywords.Common$5.run(Common.java:981)
at android.os.Handler.handleCallback(Handler.java:725)
at android.os.Handler.dispatchMessage(Handler.java:92)
at android.os.Looper.loop(Looper.java:137)
at android.app.ActivityThread.main(ActivityThread.java:5328)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1102)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:869)
at dalvik.system.NativeStart.main(Native Method)
Caused by: java.util.zip.DataFormatException: data error
at java.util.zip.Inflater.inflateImpl(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:228)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
... 19 more

any idea or update here?

Thanks,
Michael
 

Robert Valentino

Well-Known Member
Licensed User
Michael:

Sorry not to get back to you sooner. Our family pet (dog) has gotten cancer and this has kept me running around from vet to vet for chemo treatments.

I ran the code on my machine (both in release and debug mode [NOTE: Debug mode is real slow - release mode pretty fast]) and have no problems.

I guess the next questions are what version of B4A are you running?
Library versions:
What version of Core (I have version 4.01)
What version of ByteConverter (I have version 1.10)
What version of RandomAccessFile (I have version 2.00 - this is where the zlib decompress is)
What version of Java (do not believe this is a java problem but could be - I just upgraded to 8.0 but was on 7.x when I wrote the code)

Let's make sure everything is in sync

BobVal
 

NeoTechni

Well-Known Member
Licensed User
Change line 85 in the bas file to:

Return Bit.InputStreamToBytes(File.OpenInput(FilePath, FileName))

That way it can work with the assetdir
 

NeoTechni

Well-Known Member
Licensed User
Is there a way to extract images too? Or maybe the X/Y locations of text? Or perhaps HTML tags like bold/underline/etc?
 
Last edited:

MarcoRome

Expert
Licensed User
Cleaned up ALL the compiler warnings (should have done this before the last upload)

Sorry about the warnings - there are none now
Hi Valentino. I have this file in attachment. When i try to convert it, I get the following error:

cpdf2text_v7 (java line: 247)
java.lang.ArrayIndexOutOfBoundsException: length=10; index=10
at com.BOBs.PDF2Text.cpdf2text._v7(cpdf2text.java:247)
at com.BOBs.PDF2Text.cpdf2text._vv1(cpdf2text.java:556)
at com.BOBs.PDF2Text.cpdf2text._vv2(cpdf2text.java:683)
at com.BOBs.PDF2Text.main._activity_create(main.java:334)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:186)
at com.BOBs.PDF2Text.main.afterFirstLayout(main.java:102)
at com.BOBs.PDF2Text.main.access$000(main.java:17)
at com.BOBs.PDF2Text.main$WaitForLayout.run(main.java:80)
at android.os.Handler.handleCallback(Handler.java:751)
at android.os.Handler.dispatchMessage(Handler.java:95)
at android.os.Looper.loop(Looper.java:154)
at android.app.ActivityThread.main(ActivityThread.java:6692)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1468)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1358)
java.lang.ArrayIndexOutOfBoundsException: length=10; index=10
The code is:
B4X:
Dim TextFile As String = cPDF2Text.ProcessPDFFile("/Download/", "08072017_085148_977_BOB_STATEMENT.pdf")

    File.WriteString(File.DirRootExternal &"/Download/", "test.txt", TextFile)

    Msgbox("File Converted", "File Converted")

    ExitApplication
Any suggestion ?
Thank you
Marco
 
Last edited:

MarcoRome

Expert
Licensed User
Hi Valentino. I have this file in attachment. When i try to convert it, I get the following error:



The code is:
B4X:
Dim TextFile As String = cPDF2Text.ProcessPDFFile("/Download/", "08072017_085148_977_BOB_STATEMENT.pdf")

    File.WriteString(File.DirRootExternal &"/Download/", "test.txt", TextFile)
 
    Msgbox("File Converted", "File Converted")
 
    ExitApplication
Any suggestion ?
Thank you
Marco
If i change this line:

B4X:
NumberBytes = Array As Byte(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
in

B4X:
NumberBytes = Array As Byte(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
i havent error but the result txt is empty as file in attachment
 

Attachments

Robert Valentino

Well-Known Member
Licensed User
Using the Alternate method ("you are using an old version of cPDF2Text") that does not have this method

I have found "not sure if they are errors" but things my program does not handle.

I wrote this program to parse know PDF files that I needed to parse.
So my program parses not based on what is really right but what it thinks it should be getting.

Let me explain.

My program is looking for BT and ET commands to find blocks of data that need to be parsed.
IN ALL the examples I have seen it is BT followed by a CR (hex 0x0D) and ET followed by a CR
IN your example it is BT followed by a Line Feed (ox0A) and ET followed by a LF

In all the examples I have seed there is a command Jc followed by a CR
In YOUR example the Jc is Uppercase JC and followed by a LF

NOW I am not saying that there is anything wrong with your PDF in fact it displays just find.

But all the examples I have come across follow the rules in my program so these things need to be handled to make the program work with your PDF

I didn't read the Adobe manual just Brut force to pull the data out
 

MarcoRome

Expert
Licensed User
Using the Alternate method ("you are using an old version of cPDF2Text") that does not have this method

I have found "not sure if they are errors" but things my program does not handle.

I wrote this program to parse know PDF files that I needed to parse.
So my program parses not based on what is really right but what it thinks it should be getting.

Let me explain.

My program is looking for BT and ET commands to find blocks of data that need to be parsed.
IN ALL the examples I have seen it is BT followed by a CR (hex 0x0D) and ET followed by a CR
IN your example it is BT followed by a Line Feed (ox0A) and ET followed by a LF

In all the examples I have seed there is a command Jc followed by a CR
In YOUR example the Jc is Uppercase JC and followed by a LF

NOW I am not saying that there is anything wrong with your PDF in fact it displays just find.

But all the examples I have come across follow the rules in my program so these things need to be handled to make the program work with your PDF

I didn't read the Adobe manual just Brut force to pull the data out
Thank you for your time :)
HERE the wrapper PdfToText
 
Top