Android Code Snippet PDF2Text

Robert Valentino · Apr 26, 2015

The user that wrote the code I converted from CodeProject assumed that every time there was a BT (begin text) that it should start on a New Line.

Reading the PDF doc a little http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf I realized that there are positioning matrix after the BT.

I now process this matrix and if the Ty (matrix parm F) is not the same as the last one I put the text on a New Line otherwise I put a TAB in between the fields.

Now my data looks like something that when I use to copy and paste into a text file - so my parser can handle it.

If I can figure out what is wrong with getting to the other pages?

To be continued.....

BobVal

Robert Valentino · Apr 26, 2015

Figured it ALL out (or at least for now)

This program will take Test.pdf and produce Test.txt file

Test.txt is something I can parse easy enough.

I am sure there is more work to be done.

But for now this is useable.

NOTE: I have a lot of debug code and it is slow in Debug mode. So I use the old Legacy debugger when using in debug mode.

Takes a while in release mode but I am sure there will be ways to speed this code up.

Enjoy.

BobVal

Robert Valentino · Apr 26, 2015

Changed &CRLF to &Chr(13) &Chr(10) otherwise only Line Feeds were being written out and the text did not display well in Notepad.

BobVal

Robert Valentino · Apr 27, 2015

in my main I called the cPDF2Text.bas

B4X:

    '------------------------------------------------------------------------------------------------------------------------
    '  Process a PDF file and return a Text string that can either be saved or parsed
    '------------------------------------------------------------------------------------------------------------------------
    Dim TextFile As String = cPDF2Text.ProcessPDFFile("/BOBs/PDF2Text/", "test.pdf")
	
	File.WriteString(File.DirRootExternal &"/BOBs/PDF2Text/", "test.txt", TextFile)
	
	Msgbox("File Converted", "File Converted")

Isac · Apr 27, 2015

Hi Robert can you post the complete sample to b4a?

Thanks

Robert Valentino · Apr 27, 2015

Complete project attached.

You will need to change where the PDF file is located and what it is called and what you and where you want the output called.

In my case everything is in /BOBs/PDF2Text

Hope this helps

BobVal

Robert Valentino · Apr 28, 2015

Found that the Tm (text matrix) and Td or TD (text positioning) have the x and y the same.

Modified the code to handle any of the three.

Also I am not tracking the last line out and comparing it to the current line I am about to output (if the same then I discard it).
It seems when processing the text if the text is highlighted, or font change or any number of things that there would be a BT / ET with the same text 6 or 8 times.

I now suppress them.

The attached PDF files are completely different looks by different websites. In either case PDF2Text will process them and give an output file that can be parsed for the data.

These are just two examples I am working on. If someone tries it on another file let me know how it does.

BobVal

UPDATE: 4/11/2017

I notice that I could not convert some PDF files to Text because they were using a different internal format.

So I added a flag to ProcessPDFFile called AlternateWay - this will use the alternate format when parsing PDFs. All the PDFs use the normal way.

I haven't figured out yet how to determine on the fly which way to parse the file when I do I will upload an update. If one way doesn't work try again with the AlternateWay flag

Robert Valentino · Apr 28, 2015

Cleaned up ALL the compiler warnings (should have done this before the last upload)

Sorry about the warnings - there are none now

geola · Sep 23, 2015

Hi Robert,

I'm dealing with some PDF files converting to text. Now with a new document following Lind in your code gives me an Error:

B4X:

Dim DecompressedBytes()           As Byte = CompressDecompress.DecompressBytes(Data2Decompress, "zlib")

cpdf2text_decompression (B4A line: 109)
Dim DecompressedBytes() As Byte = Compr
java.io.IOException
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:178)
at java.io.InputStream.read(InputStream.java:163)
at anywheresoftware.b4a.objects.streams.File.Copy2(File.java:356)
at anywheresoftware.b4a.randomaccessfile.CompressedStreams.DecompressBytes(CompressedStreams.java:45)
at com.BOBs.PDF2Text.cpdf2text._decompression(cpdf2text.java:98)
at com.BOBs.PDF2Text.cpdf2text._processpdffile(cpdf2text.java:996)
at com.BOBs.PDF2Text.main._jobdone(main.java:493)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:187)
at anywheresoftware.b4a.keywords.Common$5.run(Common.java:981)
at android.os.Handler.handleCallback(Handler.java:725)
at android.os.Handler.dispatchMessage(Handler.java:92)
at android.os.Looper.loop(Looper.java:137)
at android.app.ActivityThread.main(ActivityThread.java:5328)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:511)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1102)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:869)
at dalvik.system.NativeStart.main(Native Method)
Caused by: java.util.zip.DataFormatException: data error
at java.util.zip.Inflater.inflateImpl(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:228)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
... 19 more

any idea or update here?

Thanks,
Michael

Robert Valentino · Sep 27, 2015

Michael:

Sorry not to get back to you sooner. Our family pet (dog) has gotten cancer and this has kept me running around from vet to vet for chemo treatments.

I ran the code on my machine (both in release and debug mode [NOTE: Debug mode is real slow - release mode pretty fast]) and have no problems.

I guess the next questions are what version of B4A are you running?
Library versions:
What version of Core (I have version 4.01)
What version of ByteConverter (I have version 1.10)
What version of RandomAccessFile (I have version 2.00 - this is where the zlib decompress is)
What version of Java (do not believe this is a java problem but could be - I just upgraded to 8.0 but was on 7.x when I wrote the code)

Let's make sure everything is in sync

BobVal

NeoTechni · Aug 29, 2016

Change line 85 in the bas file to:

Return Bit.InputStreamToBytes(File.OpenInput(FilePath, FileName))

That way it can work with the assetdir

NeoTechni · Aug 29, 2016

Is there a way to extract images too? Or maybe the X/Y locations of text? Or perhaps HTML tags like bold/underline/etc?

Robert Valentino · Apr 11, 2017

In answer to NeoTechni question. I would say yes. But not something I need so not likely I will do it.

MarcoRome · Jul 8, 2017

Robert Valentino said:
Cleaned up ALL the compiler warnings (should have done this before the last upload)

Sorry about the warnings - there are none now

Hi Valentino. I have this file in attachment. When i try to convert it, I get the following error:

cpdf2text_v7 (java line: 247)
java.lang.ArrayIndexOutOfBoundsException: length=10; index=10
at com.BOBs.PDF2Text.cpdf2text._v7(cpdf2text.java:247)
at com.BOBs.PDF2Text.cpdf2text._vv1(cpdf2text.java:556)
at com.BOBs.PDF2Text.cpdf2text._vv2(cpdf2text.java:683)
at com.BOBs.PDF2Text.main._activity_create(main.java:334)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:186)
at com.BOBs.PDF2Text.main.afterFirstLayout(main.java:102)
at com.BOBs.PDF2Text.main.access$000(main.java:17)
at com.BOBs.PDF2Text.main$WaitForLayout.run(main.java:80)
at android.os.Handler.handleCallback(Handler.java:751)
at android.os.Handler.dispatchMessage(Handler.java:95)
at android.os.Looper.loop(Looper.java:154)
at android.app.ActivityThread.main(ActivityThread.java:6692)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1468)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1358)
java.lang.ArrayIndexOutOfBoundsException: length=10; index=10

The code is:

B4X:

Dim TextFile As String = cPDF2Text.ProcessPDFFile("/Download/", "08072017_085148_977_BOB_STATEMENT.pdf")

    File.WriteString(File.DirRootExternal &"/Download/", "test.txt", TextFile)

    Msgbox("File Converted", "File Converted")

    ExitApplication

Any suggestion ?
Thank you
Marco

MarcoRome · Jul 8, 2017

MarcoRome said:
Hi Valentino. I have this file in attachment. When i try to convert it, I get the following error:

The code is:

B4X:

Dim TextFile As String = cPDF2Text.ProcessPDFFile("/Download/", "08072017_085148_977_BOB_STATEMENT.pdf") File.WriteString(File.DirRootExternal &"/Download/", "test.txt", TextFile) Msgbox("File Converted", "File Converted") ExitApplication

Any suggestion ?
Thank you
Marco

If i change this line:

B4X:

NumberBytes = Array As Byte(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

in

B4X:

NumberBytes = Array As Byte(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

i havent error but the result txt is empty as file in attachment

Robert Valentino · Jul 8, 2017

Just got your Posting.

Will look at this over the weekend. Currently have some workers over the house doing some alternations for me.

Robert Valentino · Jul 8, 2017

Using the Alternate method ("you are using an old version of cPDF2Text") that does not have this method

I have found "not sure if they are errors" but things my program does not handle.

I wrote this program to parse know PDF files that I needed to parse.
So my program parses not based on what is really right but what it thinks it should be getting.

Let me explain.

My program is looking for BT and ET commands to find blocks of data that need to be parsed.
IN ALL the examples I have seen it is BT followed by a CR (hex 0x0D) and ET followed by a CR
IN your example it is BT followed by a Line Feed (ox0A) and ET followed by a LF

In all the examples I have seed there is a command Jc followed by a CR
In YOUR example the Jc is Uppercase JC and followed by a LF

NOW I am not saying that there is anything wrong with your PDF in fact it displays just find.

But all the examples I have come across follow the rules in my program so these things need to be handled to make the program work with your PDF

I didn't read the Adobe manual just Brut force to pull the data out

MarcoRome · Jul 9, 2017

Robert Valentino said:
Using the Alternate method ("you are using an old version of cPDF2Text") that does not have this method

I have found "not sure if they are errors" but things my program does not handle.

I wrote this program to parse know PDF files that I needed to parse.
So my program parses not based on what is really right but what it thinks it should be getting.

Let me explain.

My program is looking for BT and ET commands to find blocks of data that need to be parsed.
IN ALL the examples I have seen it is BT followed by a CR (hex 0x0D) and ET followed by a CR
IN your example it is BT followed by a Line Feed (ox0A) and ET followed by a LF

In all the examples I have seed there is a command Jc followed by a CR
In YOUR example the Jc is Uppercase JC and followed by a LF

NOW I am not saying that there is anything wrong with your PDF in fact it displays just find.

But all the examples I have come across follow the rules in my program so these things need to be handled to make the program work with your PDF

I didn't read the Adobe manual just Brut force to pull the data out

Thank you for your time

HERE the wrapper PdfToText

Robert Valentino · Jul 9, 2017

Love your wrapper. Wish I had it long ago.

Android Code Snippet PDF2Text

Attachments

Well-Known Member

Attachments

Well-Known Member

Attachments

Well-Known Member

Attachments

Well-Known Member

Active Member

Well-Known Member

Attachments

Well-Known Member

Attachments

Well-Known Member

Attachments

Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Expert

Expert

Attachments

Well-Known Member

Well-Known Member

Expert

Well-Known Member

Similar Threads