1. *** New version of B4J is available ***
    B4J v7.8
    Dismiss Notice

B4A Library Pdf To Text

Discussion in 'Additional libraries, classes and official updates' started by MarcoRome, Jul 9, 2017.

  1. MarcoRome

    MarcoRome Expert Licensed User

    Hi all.
    Pdf To Text
    This library converts pdf files to txt.
    I was looking for a library that could convert a pdf file to txt. Behind tip by @Johan Schoeman ( Thank you dear ) i delivery this wrapper itextpdf-5-5-6.jar ( https://sourceforge.net/projects/itext/ )


    pdftotext
    Author:
    DevilApp
    Version: 1
    • PdftToText
      Events:
      • onMessage (Success As String)
      Methods:
      • Initialize (EventName As String)
      • ParsePdf (filepdf As String, filetxt As String)


    You must copy this file itextpdf-5-5-6.jar and the wrapper pdftotext ( in attachment )
    So you have 3 files:
    pdftotext.xml
    pdftotext.jar
    itextpdf-5-5-6.jar
    Copy all files in your libraries folder.

    This is code as example ( you found the same in attachment ):
    Code:
    Sub Activity_Create(FirstTime As Boolean)
        
    'Do not forget to load the layout file created with the visual designer. For example:
        'Activity.LoadLayout("Layout1")
        File.Copy(File.DirAssets, "test-armen.pdf"File.DirRootExternal, "test-armen.pdf")
        
    Dim filepdf As String = File.DirRootExternal & "/test-armen.pdf"
        
    Dim filetxt As String = File.DirRootExternal & "/test-armen.txt"
     
        
    Dim pdf As PdftToText
     
        pdf.Initialize(
    "pdf")
        pdf.ParsePdf(filepdf, filetxt)

    End Sub

    Sub pdf_onMessage(Success As String)
        
    Log("Status conversion: " & Success)
    End Sub
     

    Attached Files:

    Last edited: Jul 9, 2017
    Watchkido1, Don Oso, stanks and 14 others like this.
  2. Johan Schoeman

    Johan Schoeman Expert Licensed User

    100+ !
     
    johndb and MarcoRome like this.
  3. johndb

    johndb Active Member Licensed User

    This is fantastic @MarcoRome. Thank you very much for your work! I hope I don't sound ungrateful but the iText library has many useful features related to PDF:
    • PDF generation
    • PDF manipulation (stamping, watermarks, merging/splitting PDFs, ...)
    • PDF form filling
    • XML functionality
    • Digital signatures
    Yes, I know that other developers have already published PDF libraries that partially include similar features but the iText library/libraries appear to include many more functions. Would there a possibility for these to be included in the B4X library in addition to the PDF-Text conversion? I know that this is a lot of work and I should start looking into "how" to create libraries myself. :confused:

    Thanks again for your much appreciated work :)
     
    Johan Schoeman likes this.
  4. MarcoRome

    MarcoRome Expert Licensed User

    This is right way :).
    Anyway if it is urgent, with a reasonable donation ( depends wrapper that you ask ) and excellent results, there are in this community of excellent wrapper master ( as example @Johan Schoeman , @DonManfred ) to which you can contact. If they have time, they will certainly help
     
    Johan Schoeman and johndb like this.
  5. Star-Dust

    Star-Dust Expert Licensed User

    Great Work ;)
     
    MarcoRome and Johan Schoeman like this.
  6. johndb

    johndb Active Member Licensed User

    You are absolutely right .... downloading Eclipse .... wish me luck!
     
    MarcoRome and Johan Schoeman like this.
  7. Star-Dust

    Star-Dust Expert Licensed User

    I can not download itextpdf-5-5-6.jar
    The link leads me to download itext7-7.0.2 which does not contain the itextpdf-5-5-6.jar file
     
  8. MarcoRome

    MarcoRome Expert Licensed User

    In #1 Post you have
    Click on "itextpdf-5-5-6.jar" or go https://sourceforge.net/projects/itext/files/?source=navbar
    here you have all list jar ( also 5-5-6 )
     
    Johan Schoeman and Star-Dust like this.
  9. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Found one problem - not with your interface but with the conversion of text.

    If you try this file: https://www.b4x.com/android/forum/attachments/test-armen-pdf.33756/
    you will see that it repeats lines that are BOLD multiple times. In the above example it says "Organization League" 9 times and "Team Standings" 9 times on multiple lines.

    This is something to watch out for when processing the text.


    ALSO: Notice at the GitHub site that you may need to buy a license.

    iText is licensed as AGPL software.


    AGPL is a free / open source software license.


    This doesn't mean the software is gratis!


    Buying a license is mandatory as soon as you develop commercial activities distributing the iText software inside your product or deploying it on a network without disclosing the source code of your own applications under the AGPL license. These activities include:


    • offering paid services to customers as an ASP
    • serving PDFs on the fly in the cloud or in a web application
    • shipping iText with a closed source product

    Contact sales for more info: http://itextpdf.com/sales



    Does anyone know what the license might cost?
     
    Last edited: Jul 9, 2017
  10. Johan Schoeman

    Johan Schoeman Expert Licensed User

    Star-Dust likes this.
  11. Johan Schoeman

    Johan Schoeman Expert Licensed User

    Hi Roberto

    Browse the web to see if you can find solutions for the problems that you mentioned. It is very simple to make use of this JAR via inline Java code...

    @MarcoRome 's project is based on this:
    Code:
    #Region  Project Attributes
        
    #ApplicationLabel: b4aReadPDF
        
    #VersionCode: 1
        
    #VersionName:
        
    'SupportedOrientations possible values: unspecified, landscape or portrait.
        #SupportedOrientations: unspecified
        
    #CanInstallToExternalStorage: False
    #End Region

    #AdditionalJar: itextpdf-5.5.6

    #Region  Activity Attributes
        
    #FullScreen: False
        
    #IncludeTitle: True
    #End Region

    Sub Process_Globals
        
    'These global variables will be declared once when the application starts.
        'These variables can be accessed from all modules.
        Dim nativeMe As JavaObject

    End Sub

    Sub Globals
        
    'These global variables will be redeclared each time the activity is created.
        'These variables can only be accessed from this module.

    End Sub

    Sub Activity_Create(FirstTime As Boolean)
        
    'Do not forget to load the layout file created with the visual designer. For example:
        'Activity.LoadLayout("Layout1")
        Log(File.DirAssets)
        
    Log(File.DirRootExternal)

        nativeMe.InitializeContext
        nativeMe.RunMethod(
    "parsePdf"Null)



    End Sub

    Sub Activity_Resume

    End Sub

    Sub Activity_Pause (UserClosed As Boolean)

    End Sub

    #If Java

    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.PrintWriter;

    import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
    import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
    import com.itextpdf.text.pdf.parser.TextExtractionStrategy;


    /** The original PDF that will be parsed. */
    public static final String PREFACE = "/storage/emulated/0/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "/storage/emulated/0/preface.txt";


        public void parsePdf() throws IOException {
                String pdf = PREFACE;
                String txt = RESULT;
                PdfReader reader = new PdfReader(pdf);
                PdfReaderContentParser parser = new PdfReaderContentParser(reader);
                PrintWriter out = new PrintWriter(new FileOutputStream(txt));
                TextExtractionStrategy strategy;
                BA.Log("number of pages = " + reader.getNumberOfPages());
                for (int i = 1; i <= reader.getNumberOfPages(); i++) {
                    strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
                    out.println(strategy.getResultantText());
                }
                reader.close();
                out.flush();
                out.close();
        }




    #End If
    The other changes to turn it into a library - you will have to do that on your own as what @MarcoRome did.
     
    Last edited: Jul 9, 2017
    MarcoRome likes this.
  12. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Thanks for the attachment.

    For the last few hours I have been trying a lot of online converters and downloaded and installed some as well

    Most do something wrong with converting the PDF.
    Some say the bold line multiple times. Some say it once but string multiple lines on one line which makes parsing harder

    I'll keep looking and will work on making the one I am using (cPDF2Text) more data friendly

    I am not making enough money in my APP to allow me to pay for licensing other products - Maybe someday LOL
     
    MarcoRome and Johan Schoeman like this.
  13. MarcoRome

    MarcoRome Expert Licensed User

    No Sorry i dont know.
     
  14. MarcoRome

    MarcoRome Expert Licensed User

    In #1 you have also source...so you can modified as you want...and if you modified dont forget to share new wrapper for all community
     
  15. DonManfred

    DonManfred Expert Licensed User

    Nice one ;-)
     
    Johan Schoeman likes this.
  16. Mashiane

    Mashiane Expert Licensed User

    This is just perfect, thanks guys.
     
    Johan Schoeman likes this.
  17. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    Always - but not doing much coding - Summer to many Golf rounds to play. In the fall will start coding again
     
  18. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    MarcoRome how do I give credit to you (for your wrapper) and this Library

    BobVal
     
  19. Robert Valentino

    Robert Valentino Well-Known Member Licensed User

    I was having some problems with no all the ext being processed so I searched for a new library and found:
    https://mvnrepository.com/artifact/com.itextpdf/itextpdf/5.5.13

    This newer library fixed the problems I was having. Might help someone here.

    The problems I was having is was NOT always getting headers for certain columns
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice