Android Code Snippet PDF2Text

Discussion in 'Code Snippets' started by Robert Valentino, Apr 26, 2015.

  1. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    I found this article on CodeProject : http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

    About reading a PDF and converting it to text.

    I have done my best to translate the code in B4A basic (see attached file)

    The program will read the first page (not sure why it does not process the other pages) of PDF and output the text.

    Maybe someone with a stronger knowledge of PDF's can figure out why it does not process the additional pages.

    NOW just a FYI - I am not the strongest Basic programmer (I am sure there are better ways to do make this program work) but it was more a Proof of concept.

    Any one what to jump in an help???

    BobVal
     

    Attached Files:

  2. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    The user that wrote the code I converted from CodeProject assumed that every time there was a BT (begin text) that it should start on a New Line.

    Reading the PDF doc a little http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf I realized that there are positioning matrix after the BT.

    I now process this matrix and if the Ty (matrix parm F) is not the same as the last one I put the text on a New Line otherwise I put a TAB in between the fields.

    Now my data looks like something that when I use to copy and paste into a text file - so my parser can handle it.

    If I can figure out what is wrong with getting to the other pages?

    To be continued.....

    BobVal
     

    Attached Files:

  3. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Figured it ALL out (or at least for now)

    This program will take Test.pdf and produce Test.txt file

    Test.txt is something I can parse easy enough.

    I am sure there is more work to be done.

    But for now this is useable.

    NOTE: I have a lot of debug code and it is slow in Debug mode. So I use the old Legacy debugger when using in debug mode.

    Takes a while in release mode but I am sure there will be ways to speed this code up.

    Enjoy.

    BobVal
     

    Attached Files:

  4. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Changed &CRLF to &Chr(13) &Chr(10) otherwise only Line Feeds were being written out and the text did not display well in Notepad.

    BobVal
     

    Attached Files:

  5. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    in my main I called the cPDF2Text.bas

    Code:
    '------------------------------------------------------------------------------------------------------------------------
        '  Process a PDF file and return a Text string that can either be saved or parsed
        '------------------------------------------------------------------------------------------------------------------------
        Dim TextFile As String = cPDF2Text.ProcessPDFFile("/BOBs/PDF2Text/""test.pdf")
        
        
    File.WriteString(File.DirRootExternal &"/BOBs/PDF2Text/""test.txt", TextFile)
        
        
    Msgbox("File Converted""File Converted")
     
  6. Isac

    Isac Active Member Licensed User

    Hi Robert can you post the complete sample to b4a?

    Thanks
     
  7. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Complete project attached.

    You will need to change where the PDF file is located and what it is called and what you and where you want the output called.

    In my case everything is in /BOBs/PDF2Text

    Hope this helps

    BobVal
     

    Attached Files:

    Devan likes this.
  8. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Found that the Tm (text matrix) and Td or TD (text positioning) have the x and y the same.

    Modified the code to handle any of the three.

    Also I am not tracking the last line out and comparing it to the current line I am about to output (if the same then I discard it).
    It seems when processing the text if the text is highlighted, or font change or any number of things that there would be a BT / ET with the same text 6 or 8 times.

    I now suppress them.

    The attached PDF files are completely different looks by different websites. In either case PDF2Text will process them and give an output file that can be parsed for the data.

    These are just two examples I am working on. If someone tries it on another file let me know how it does.

    BobVal

    UPDATE: 4/11/2017

    I notice that I could not convert some PDF files to Text because they were using a different internal format.

    So I added a flag to ProcessPDFFile called AlternateWay - this will use the alternate format when parsing PDFs. All the PDFs use the normal way.

    I haven't figured out yet how to determine on the fly which way to parse the file when I do I will upload an update. If one way doesn't work try again with the AlternateWay flag
     

    Attached Files:

    Last edited: Apr 11, 2017
  9. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Cleaned up ALL the compiler warnings (should have done this before the last upload)

    Sorry about the warnings - there are none now
     

    Attached Files:

    Mashiane likes this.
  10. geola

    geola Member Licensed User

    Hi Robert,

    I'm dealing with some PDF files converting to text. Now with a new document following Lind in your code gives me an Error:


    Code:
    Dim DecompressedBytes()           As Byte = CompressDecompress.DecompressBytes(Data2Decompress, "zlib")
    cpdf2text_decompression (B4A line: 109)
    Dim DecompressedBytes() As Byte = Compr
    java.io.IOException
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:178)
    at java.io.InputStream.read(InputStream.java:163)
    at anywheresoftware.b4a.objects.streams.File.Copy2(File.java:356)
    at anywheresoftware.b4a.randomaccessfile.CompressedStreams.DecompressBytes(CompressedStreams.java:45)
    at com.BOBs.PDF2Text.cpdf2text._decompression(cpdf2text.java:98)
    at com.BOBs.PDF2Text.cpdf2text._processpdffile(cpdf2text.java:996)
    at com.BOBs.PDF2Text.main._jobdone(main.java:493)
    at java.lang.reflect.Method.invokeNative(Native Method)
    at java.lang.reflect.Method.invoke(Method.java:511)
    at anywheresoftware.b4a.BA.raiseEvent2(BA.java:187)
    at anywheresoftware.b4a.keywords.Common$5.run(Common.java:981)
    at android.os.Handler.handleCallback(Handler.java:725)
    at android.os.Handler.dispatchMessage(Handler.java:92)
    at android.os.Looper.loop(Looper.java:137)
    at android.app.ActivityThread.main(ActivityThread.java:5328)
    at java.lang.reflect.Method.invokeNative(Native Method)
    at java.lang.reflect.Method.invoke(Method.java:511)
    at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1102)
    at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:869)
    at dalvik.system.NativeStart.main(Native Method)
    Caused by: java.util.zip.DataFormatException: data error
    at java.util.zip.Inflater.inflateImpl(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:228)
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
    ... 19 more

    any idea or update here?

    Thanks,
    Michael
     
  11. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Michael:

    Sorry not to get back to you sooner. Our family pet (dog) has gotten cancer and this has kept me running around from vet to vet for chemo treatments.

    I ran the code on my machine (both in release and debug mode [NOTE: Debug mode is real slow - release mode pretty fast]) and have no problems.

    I guess the next questions are what version of B4A are you running?
    Library versions:
    What version of Core (I have version 4.01)
    What version of ByteConverter (I have version 1.10)
    What version of RandomAccessFile (I have version 2.00 - this is where the zlib decompress is)
    What version of Java (do not believe this is a java problem but could be - I just upgraded to 8.0 but was on 7.x when I wrote the code)

    Let's make sure everything is in sync

    BobVal
     
    Devan likes this.
  12. NeoTechni

    NeoTechni Well-Known Member Licensed User

    Change line 85 in the bas file to:

    Return Bit.InputStreamToBytes(File.OpenInput(FilePath, FileName))

    That way it can work with the assetdir
     
    Mashiane likes this.
  13. NeoTechni

    NeoTechni Well-Known Member Licensed User

    Is there a way to extract images too? Or maybe the X/Y locations of text? Or perhaps HTML tags like bold/underline/etc?
     
    Last edited: Aug 29, 2016
  14. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    In answer to NeoTechni question. I would say yes. But not something I need so not likely I will do it.
     
  15. MarcoRome

    MarcoRome Expert Licensed User

    Hi Valentino. I have this file in attachment. When i try to convert it, I get the following error:

    The code is:
    Code:
    Dim TextFile As String = cPDF2Text.ProcessPDFFile("/Download/""08072017_085148_977_BOB_STATEMENT.pdf")

        
    File.WriteString(File.DirRootExternal &"/Download/""test.txt", TextFile)

        
    Msgbox("File Converted""File Converted")

        
    ExitApplication
    Any suggestion ?
    Thank you
    Marco
     
    Last edited: Jul 9, 2017
  16. MarcoRome

    MarcoRome Expert Licensed User

    If i change this line:

    Code:
    NumberBytes = Array As Byte(0000000000)
    in

    Code:
    NumberBytes = Array As Byte(0000000000000000,0000000000000000,00000000000000000000000000000000)
    i havent error but the result txt is empty as file in attachment
     

    Attached Files:

    • test.txt
      File size:
      93 bytes
      Views:
      76
  17. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Just got your Posting.

    Will look at this over the weekend. Currently have some workers over the house doing some alternations for me.
     
    MarcoRome likes this.
  18. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Using the Alternate method ("you are using an old version of cPDF2Text") that does not have this method

    I have found "not sure if they are errors" but things my program does not handle.

    I wrote this program to parse know PDF files that I needed to parse.
    So my program parses not based on what is really right but what it thinks it should be getting.

    Let me explain.

    My program is looking for BT and ET commands to find blocks of data that need to be parsed.
    IN ALL the examples I have seen it is BT followed by a CR (hex 0x0D) and ET followed by a CR
    IN your example it is BT followed by a Line Feed (ox0A) and ET followed by a LF

    In all the examples I have seed there is a command Jc followed by a CR
    In YOUR example the Jc is Uppercase JC and followed by a LF

    NOW I am not saying that there is anything wrong with your PDF in fact it displays just find.

    But all the examples I have come across follow the rules in my program so these things need to be handled to make the program work with your PDF

    I didn't read the Adobe manual just Brut force to pull the data out
     
    MarcoRome likes this.
  19. MarcoRome

    MarcoRome Expert Licensed User

    Thank you for your time :)
    HERE the wrapper PdfToText
     
  20. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Love your wrapper. Wish I had it long ago.
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice