B4J Question Image extraction from a PDF file.

rspitzer

Active Member
A PDF file can contain both text and images, I am wondering if someone could point me to information on just extracting the images from a pdf file individually (if the file has more than one image)? I am not interested in text extraction. I am using B4J, don't know if this matters.
 

xulihang

Active Member
Licensed User
Longtime User
You can use pdfbox which has a commandline tool. It can render pdf to images as well as extract images. Use jshell to run it.
 
Upvote 0

rspitzer

Active Member
Thank you for the quick response. I appreciate that. Unfortunately I was looking for a more programmatic approach to my problem. The PDF's in this process may contain as many as 100 images. The images need to be extracted in real time and then individually processed. Going to a command line utility is not going to work in this case. What I may resort to do since I am not that familiar with the PDF format is possibly post this to the job forum and hire someone from these forums to do this as a project. I first need to produce a more detailed requirement sheet, however, thank you again for the good suggestion.
 
Upvote 0

inakigarm

Well-Known Member
Licensed User
Longtime User
Thank you for the quick response. I appreciate that. Unfortunately I was looking for a more programmatic approach to my problem. The PDF's in this process may contain as many as 100 images. The images need to be extracted in real time and then individually processed. Going to a command line utility is not going to work in this case. What I may resort to do since I am not that familiar with the PDF format is possibly post this to the job forum and hire someone from these forums to do this as a project. I first need to produce a more detailed requirement sheet, however, thank you again for the good suggestion.
Maybe with Inline Java you can achieve this result (not tested)
https://stackoverflow.com/questions...ed-images-from-a-single-pdf-page-using-pdfbox
 
Upvote 0

Quandalle

Member
Licensed User
I did a lot of experimentation using the two main java libraries that handle pdf: pdfbox, and iText, to extract text and images from a pdf. The implementation is often quite complex if you don't want to go through intermediate files and want to proceed gradually. Finally I fell back on using an external program launched from the B4J application and whose extracted files are used by the B4J application. It's almost as fast, and above all much easier to debug and maintain.
There are various command line utilities that allow you to extract images from a pdf : PDFbox.jar command line, or pdfimages from the XpdfReader suite, or from the variation made by poppler-utils . Each of these extractors has advantages and disadvantages depending on the need for use.
Example of use from B4J of the pdfbox extraction:
pdfbox command line syntax:
java -jar pdfbox-app-2.y.z.jar ExtractImages [OPTIONS] <inputfile>
We can thus encapsulate the call to this command line in a B4J function to extract the images.
Example of function calling pdfbox in command line:
Sub pdfToImages(fileName As String)As ResumableSub
    Dim shl As Shell
    shl.Initialize("shl", "java.exe", Array As String("pdfbox-app-2.y-z.jar","ExtractImages", fileName))
    shl.WorkingDirectory = File.dirapp
    shl.Run(-1)
    wait for (shl) shl_ProcessCompleted (Success As Boolean, ExitCode As Int, StdOut As String, StdErr As String)
    If Success And ExitCode = 0 Then
        Return 0
    Else
        Log("Error: " & StdErr)
        Return 1
    End If
End Sub
and finally the call to this function is something like
B4X:
Dim rs As ResumableSub  = pdfToImages("test.pdf")
wait for(rs) complete (ret As Int)
If ret <> 0 Then
        Log (" Erreur extract Images")
Else
    ...process extracted image files (test-xxx.jpg)
End If
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
The images need to be extracted in real time
Realtime? PDF images? Or do you mean: as fast as possible?

Are the PDF's all coming from the one source? Are the images all stored in the same format? Are they generally the same size, as in all being scans of paper pages, or captures from a camera?
 
Upvote 0

rspitzer

Active Member
Real time is a relative word, Sorry about that, what I mean is that when a pdf is presented/picked from the program, an extraction will take place. The pdf will be processed. A user will be picking the pdf's from a server/workstation directory.
 
Upvote 0
Top