Android Question Extract text from PDF file

enemotrop

Member
Licensed User
Longtime User
Hello,

I'm trying to retrieve all text inside a PDF file and store on a variable to parse it, and after my unsuccessful search for a library, I'm trying to code it myself... with only partial results.

Following, the routine I'm doing to extract the text. At this moment, it only recovers plain text, the formatted remains unreadable.

B4X:
Sub Activity_Create(FirstTime As Boolean)
  Dim textpdf As String
   textpdf = ExtractTextFromPDF("1.PDF", "iso-8859-15")
   Msgbox(textpdf,"")
End Sub

Sub ExtractTextFromPDF(FileName As String, Charset As String ) As String
    Try
        'Open pdf file
        ProgressDialogShow2("Parsing PDF. Please wait...", False)
        DoEvents
        Dim In As InputStream
        Dim compress As CompressedStreams
        Dim pdf_length As Double
        pdf_length = File.Size(File.DirRootExternal,FileName)
        Dim bPDF(pdf_length) As Byte    'Bytes array that stores PDF file
        File.OpenInput(File.DirRootExternal,FileName).ReadBytes(bPDF,0,bPDF.length)
        Dim bc As ByteConverter
        'We search all stream containing objects inside the PDF file
        Dim lstBytStream As List, bytBuffer() As Byte 
        Dim lstStrDeco As List
        lstBytStream.Initialize
        lstStrDeco.Initialize
        Dim seekEndstream As Int, seekStream As Int
        Dim pos1, pos2 As Int
        pos1=0
        pos2=0
        Dim whole_deco As String
        whole_deco = ""
        For a = pos1+8 To bPDF.length-1
            'We search for "stream" plus an "x"
            If bPDF(a) = 109 AND bPDF(a-1) = 97 AND bPDF(a-2) = 101 AND bPDF(a-3) = 114 AND _
                bPDF(a-4) = 116 AND bPDF(a-5) = 115 AND bPDF(a-6) <> 100 AND (bPDF(a+2) = 120 OR bPDF(a+1) = 120) Then
                If bPDF(a+2) = 120 Then
                    pos1 = a+2
                Else
                    pos1 = a+1
                End If
                'We search for "endstream"
                For b = pos1 + 6 To bPDF.length - 1
                    If bPDF(b) = 109 AND bPDF(b-1) = 97 AND bPDF(b-2) = 101 AND bPDF(b-3) = 114 AND _
                        bPDF(b-4) = 116 AND bPDF(b-5) = 115 AND bPDF(b-6) = 100 AND bPDF(b-7) = 110 AND _
                        bPDF(b-8) = 101 Then
                        pos2 = b-9
                        Dim select_length As Int
                        select_length = pos2 - pos1
                        Dim bPDF_stream(select_length) As Byte
                        For c = 0 To select_length - 1
                            bPDF_stream(c) = bPDF(pos1 + c)
                        Next
                        Log("bPDF_stream size: " & bPDF_stream.length)
                        Dim deco() As Byte
                        deco = compress.DecompressBytes(bPDF_stream, "zlib")
                        Dim strDeco As String, strDeco2 As String
                        strDeco = BytesToString(deco,0, deco.length, charset)
                        'We unescape and deprecate all text outside parenthesis
                        Dim delete As Boolean
                        delete = True
                        For c = 1 To strDeco.length - 2
                            Log("c:" & c & " length:" & strDeco.Length)
                            If strDeco.CharAt(c) = ")" AND strDeco.CharAt(c - 1) <> "\" Then
                                delete = True
                            Else If strDeco.CharAt(c) = "(" AND strDeco.CharAt(c - 1) <> "\" Then
                                delete = False
                            Else
                                If delete = False Then
                                    Dim nextpar As Int
                                    nextpar = strDeco.IndexOf2(")",c)
                                    If nextpar <> -1 Then
                                        Do While strDeco.CharAt(nextpar-1) = "\"
                                            nextpar = strDeco.IndexOf2(")",nextpar+1)
                                        Loop
                                        strDeco2 = strDeco2 & strDeco.SubString2(c,nextpar)
                                        c = nextpar - 1
                                        Log("nextpar:" & nextpar)
                                    Else
                                        strDeco2 = strDeco2
                                        Exit
                                    End If
                                End If
                            End If           
                            DoEvents
                        Next   
                        If strDeco2.Length > 0 Then
                            strDeco2 = strDeco2.Replace("\(", "<<open_par>>")
                            strDeco2 = strDeco2.Replace("\)", "<<close_par>>")
                            strDeco2 = strDeco2.Replace("(", "")
                            strDeco2 = strDeco2.Replace(")", "")
                            strDeco2 = strDeco2.Replace("<<open_par>>", "(")
                            strDeco2 = strDeco2.Replace("<<close_par>>", ")")
                        End If
                        Log("Decoded and unescaped stream:" & CRLF & strDeco2)       
                        whole_deco = whole_deco & CRLF & strDeco2
                        Exit
                    End If               
                Next
            End If
        Next       
        Log("Whole decoded and unescaped stream:" & CRLF & whole_deco)
        ProgressDialogHide
        Return whole_deco
    Catch
        ProgressDialogHide
        Log(LastException)
    End Try
End Sub

I attach a couple of PDF test files, 1.PDF with RTF text, and 2.PDF with plain text. Any suggestions?
 

Attachments

  • 1.PDF
    43 KB · Views: 527
  • 2.PDF
    6.1 KB · Views: 442

enemotrop

Member
Licensed User
Longtime User
Think I have an idea about where is the problem...

If we apply the routine to the file attached to this post, it only recovers part of the text. The unformatted one.
Unescaped, apart from the flat text, there is also, coded, the formatted text; and it's coded with a simple substitution code.

This is part of the beginning of the output, debugging strDeco previously to the unescape and deprecation of garbage:

q 0.24 0 0 0.24 0 0 cm
/R7 gs
0 0 0 RG
0 0 0 rg
q
4.16667 0 0 4.16667 0 0 cm BT
/R9 9.36 Tf
0.998898 0 0 1 42.9591 811.521 Tm
[(F)-5.06368(A)3.2588(R)3.2588(M)37.2506(A)3.2588(C)3.2588(I)21.3068(A)3.2588(S)-0.40265( )-4.36256(D)3.25958(E)-0.40265( )-4.36256(S)-0.40265(E)-0.40265(R)3.2588(V)-0.40265(I)21.306(C)3.2588(I)21.306(O)7.91983(S)-0.401083( )-4.36335(D)3.2588(E)-0.40265( )-4.36335(G)7.91983(U)3.2588(A)3.2588(R)3.2588(D)3.2588(I)21.306(A)3.2588(S)-0.40265( )-4.36335(P)-0.40265(A)3.2588(R)3.2588(A)3.2588( )-4.36178(E)-0.401083(L)-5.06368( )-4.36178(M)37.2506(E)-0.39795(S)-0.401083( )-4.36178(D)3.2588(E)-0.401083( )-4.36178(D)]TJ
283.272 0 Td
[(I)21.3075(C)3.2588(I)21.3075(E)-0.401083(M)37.2506(B)3.2588(R)3.2588(E)-0.401083( )-4.36178(2)-8.72669(0)-8.72669(1)-8.72669(3)555.999]TJ
/R9 11.04 Tf
0.998891 0 0 1 190.079 782.721 Tm
[(Z)1.62853(o)1.62853(n)1.62853(a)-9.84421:))6.55129(A)3.81229(D)3.81229(E)-7.66044(J)-9.84554(E)-7.66177( )-4.92277(Y)-7.66177( )-4.92277(A)3.81229(R)3.81229(O)-5.47801(N)3.81229(A)722]TJ
ET
Q
1 i
791.996 3251.34 503.996 4.99609 re
f
q
4.16667 0 0 4.16667 0 0 cm BT
/R11 7.68 Tf
0.998906 0 0 1 189.119 764.961 Tm
[(Z)-14.6836(o)16.6006(n)16.6006(a)-7.11647( )-3.55728(F)-14.6836(a)-7.11647(r)-17.6948(m)13.0415(a)-7.11647(c)-7.11647(é)-7.11647(u)16.6006(t)-11.1282(i)-34.8434(c)-7.11456(a)-7.11456:))20.156( )-3.55919(T)-14.6836(F)-14.6836( )-3.55919(2)-7.11456(1)-7.11456( )-3.55919(y)24.1697( )-3.55919(T)-14.6836(F)-14.6836(-)-11.1282(2)-7.11456(2)556.001]TJ
/R11 8.88 Tf
0.998946 0 0 1 198.479 754.401 Tm
[( )7.44424(G)-6.61031(U)-8.49945(A)-8.49945(R)-8.49945(D)-8.49945(I)34.4998(A)-8.49945( )7.44424(D)-8.5011(E)-9.38952( )7.44424(2)-12.1671(4)-12.1671( )7.44424(H)-8.5011(O)-6.61196(R)-8.5011(A)-8.5011(S)667]TJ
ET
Q
1 0 0 RG
1 0 0 rg
q
4.16667 0 0 4.16667 0 0 cm BT
/R13 7.68 Tf
0.998906 0 0 1 35.5191 729.201 Tm
[(1)-382.527(D)2.46317(o)24.1682(m)-11.6738(i)3.01071(n)-7.11552(g)24.1687(o)-5982.4(L)-7.11647(c)-0.548009(d)24.1677(a)24.1677(s)-0.548009(.)-3.55728(D)2.46317(ª)-5.41135(.)-3.55728(R)2.46317(.)-3.55728(H)2.46317(e)24.1677(r)-11.1263(n)-7.11647(a)24.1677(n)-7.11647(d)24.1677(e)24.1677(z)30.7362( )-3.55728(R)2.46317(o)24.1677(d)24.1677(r)-11.1263(i)3.01118(g)24.1677(u)-7.11647(e)24.1677(z)30.7362(-)-11.1263(M)-11.6743(ª)-5.40944(J)-0.549918(.)-3.55919(H)2.46317(e)24.1697(r)-11.1282(n)-7.11456(a)24.1697(n)-7.11456(d)24.1697(e)]TJ
248.432 0 Td
[(z)30.7381( )277.999]TJ
/R15 7.68 Tf
111.001 0 Td

(..............)Tj

The interesting data is contained between parenthesis. So, parsing that string, you can get something like "FARMACIAS DE SERVICIOS DE..."

I've realized that, for instance, the string represented as 14 unreadable characters at the end of the text above -which I bolded- is, in hex, 01 02 03 03 04 05 03 04 06 03 04 07 06 08, and, opening the PDF with a reader, it corresponds to "(922-72-02-80)" (in hex 28 39 32 32 2D 37 32 2D 30 32 2D 38 30 29). It's obvious that both strings are of the same length and there is a simple character substitution.

So there must be a place in the PDF, encoded or not, in which the values to decode the formatted data are stored. Call it a dictionary, a map, a table...

In fact, I've found two CMap streams in the PDF, encoded, that are like follows. Note that in one of the CMaps, the byte 01 is replaced with unicode 0046, and in the other, with unicode 0020, while I was expecting a table that substitutes 01 by 0028.

Where can that table be found?? I'm googleing with no results, the PDF specification can finish all the coffee in the block and I guess that it'd be an easy question for somebody with experience handling PDF files.

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R21 def
1 begincodespacerange
<00><ff>
endcodespacerange
64 beginbfrange

<01><01><0046>
<02><02><0041>
<03><03><0052>
<04><04><004d>
<05><05><0043>
<06><06><0049>
<07><07><0053>
<08><08><0020>
<09><0a><0044>
<0b><0b><0047>
<0c><0c><0055>
<0d><0d><005a>
<0e><0e><004f>
<0f><0f><004e>
<10><10><003a>
<11><11><004a>
<12><12><0079>
<13><13><0028>
<14><14><006f>
<15><15><006e>
<16><16><0061>
<17><17><0072>
<18><18><006d>
<19><19><0063>
<1a><1a><00e9>
<1b><1b><0075>
<1c><1c><0074>
<1d><1d><0069>
<1e><1e><0073>
<1f><1f><0054>
<20><20><0032>
<21><21><0031>
<22><22><002d>
<23><23><0029>
<24><24><0062>
<25><25><0065>
<26><26><0068>
<27><27><006c>
<28><28><0064>
<29><29><0067>
<2a><2a><004c>
<2b><2b><002e>
<2c><2c><0042>
<2d><2d><0050>
<2e><2e><007a>
<2f><2f><002f>
<30><30><002c>
<31><32><0035>
<33><33><0076>
<34><34><006a>
<35><35><0066>
<36><36><0039>
<37><37><0037>
<38><38><0030>
<39><39><0034>
<3a><3a><0038>
<3b><3b><0048>
<3c><3c><00f1>
<3d><3d><00aa>
<3e><3e><00ed>
<3f><3f><00e1>
<40><40><0071>
<41><41><0033>
<42><42><00f3>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end


/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R19 def
1 begincodespacerange
<00><ff>
endcodespacerange
41 beginbfrange

<01><01><0020>
<02><02><0043>
<03><03><006f>
<04><04><006c>
<05><05><0065>
<06><06><0067>
<07><07><0069>
<08><08><004f>
<09><09><0066>
<0a><0a><0063>
<0b><0b><0061>
<0c><0c><0064>
<0d><0d><0046>
<0e><0e><0072>
<0f><0f><006d>
<10><10><00e9>
<11><11><0075>
<12><12><0074>
<13><13><0073>
<14><14><0050>
<15><15><0076>
<16><16><006e>
<17><17><0053>
<18><18><007a>
<19><19><0054>
<1a><1a><0048>
<1b><1b><003a>
<1c><1c><0044>
<1d><1d><0039>
<1e><1e><002e>
<1f><20><0030>
<21><21><00ed>
<22><22><00f1>
<23><23><0068>
<24><24><0038>
<25><25><0033>
<26><26><004c>
<27><27><00e1>
<28><28><0062>
<29><29><002c>
<2a><2a><0034>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
 

Attachments

  • ADEJE_DICIEMBRE_2013_MENSUAL.PDF
    145.5 KB · Views: 632
Upvote 0

enemotrop

Member
Licensed User
Longtime User
One part of the extracted data (without filtering and unescape) is this:

4 0 3 26 6 40 12 75 11 352 15 718 14 1145 18 1849 19 2137 17 2324 22 2716 21 2996 10 3601 1 3649 27 3813 34 3862 36 4245 33 4425 39 4832 41 5101 38 5279 32 5590 25 5615 24 5766 <</OPM 1/Type/ExtGState>> <</R7 4 0 R>> <</R16 7 0 R/R17 8 0 R/R18 9 0 R>> <</FontBBox[0 -23 931 741]/CapHeight 741/MissingWidth 278/CharSet(/two/L/A/n/three/M/B/o/Y/N/C/Z/O/D/P/E/F/R/G/S/colon/I/U/J/a/V/space/zero/one)/Type/FontDescriptor/FontFile3 13 0 R/Descent -23/StemV 139/Flags 4/FontName/VJIFOY+Helvetica-BoldOblique/Ascent 741/ItalicAngle 0>> <</LastChar 111/BaseFont/VJIFOY+Helvetica-BoldOblique/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1/FontDescriptor 12 0 R/FirstChar 32/Widths[278 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 556 556 556 0 0 0 0 0 0 333 0 0 0 0 0 0 722 722 722 722 667 611 778 0 278 556 0 611 833 722 778 667 0 722 667 0 722 667 0 0 667 611 0 0 0 0 0 0 556 0 0 0 0 0 0 0 0 0 0 0 0 611 611]>> <</FontBBox[-18 -218 762 742]/CapHeight 742/MissingWidth 278/CharSet(/two/L/A/iacute/aacute/y/n/c/three/M/ordfeminine/B/uacute/z/o/d/Y/four/N/C/p/e/five/O/D/q/f/six/P/E/ntilde/r/g/seven/F/s/eight/R/G/oacute/t/i/nine/S/H/eacute/u/j/T/v/J/l/a/V/K/m/b/parenleft/parenright/space/comma/hyphen/period/zero/one)/Type/FontDescriptor/FontFile3 16 0 R/Descent -218/StemV 114/Flags 4/FontName/ULWQET+Helvetica/Ascent 742/ItalicAngle 0>> <</LastChar 250/BaseFont/ULWQET+Helvetica/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1/FontDescriptor 15 0 R/FirstChar 32/Widths[278 0 0 0 0 0 0 0 333 333 0 0 278 333 278 0 556 556 556 556 556 556 556 556 556 556 0 0 0 0 0 0 0 667 667 722 722 667 611 778 722 0 500 667 556 833 722 778 667 0 722 667 611 0 667 0 0 667 0 0 0 0 0 0 0 556 556 500 556 556 278 556 0 222 222 0 222 833 556 556 556 556 333 500 278 556 500 0 0 500 500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 0 0 0 0 0 0 0 556 0 0 0 278 0 0 0 556 0 556 0 0 0 0 0 0 556]>> <</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/parenleft/nine/two/hyphen/seven/zero/eight/parenright/one/four/E/d/i/f/period/L/a/s/space/T/e/r/z/comma/P/l/y/A/m/c/C/slash/n/t/o/S/R/five/six/j/three/G/B/u/v/J/I/x/H/g/M/ntilde/h/F/D/N/p/U/b/eacute/q/ordmasculine/oacute/V/O]>> <</FontBBox[-161 -215 974 681]/CapHeight 681/MissingWidth 777/Type/FontDescriptor/Descent -215/StemV 146/FontFile2 20 0 R/Flags 4/FontName/KIGHPU+TTE19F6F90t00/Ascent 681/ItalicAngle 0>> <</LastChar 65/BaseFont/KIGHPU+TTE19F6F90t00/Type/Font/Encoding 18 0 R/Subtype/TrueType/FontDescriptor 19 0 R/FirstChar 1/Widths[333 500 500 333 500 500 500 333 500 500 667 500 278 333 250 611 500 389 250 611 444 389 389 250 611 278 444 667 778 444 667 278 556 278 500 556 667 500 500 278 500 722 667 556 444 500 389 500 778 500 889 556 556 667 722 722 500 722 500 444 500 300 500 667 722]>> <</FontBBox[0 -219 824 757]/CapHeight 757/MissingWidth 278/CharSet(/two/A/y/n/c/o/four/Z/O/D/E/r/F/R/G/t/i/S/H/eacute/u/colon/T/I/U/a/m/space/hyphen/one)/Type/FontDescriptor/FontFile3 23 0 R/Descent -219/StemV 123/Flags 4/FontName/XMEXXU+Helvetica-Bold/Ascent 757/ItalicAngle 0>> <</LastChar 233/BaseFont/XMEXXU+Helvetica-Bold/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1/FontDescriptor 22 0 R/FirstChar 32/Widths[278 0 0 0 0 0 0 0 0 0 0 0 0 333 0 0 0 556 556 0 556 0 0 0 0 0 333 0 0 0 0 0 0 722 0 0 722 667 611 778 722 278 0 0 0 0 0 778 0 0 722 667 611 722 0 0 0 0 611 0 0 0 0 0 0 556 0 556 0 0 0 0 0 278 0 0 0 889 611 611 0 0 389 0 333 611 0 0 0 556 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556]>> <</R13 14 0 R/R9 11 0 R/R15 17 0 R/R11 21 0 R>> <</Parent 24 0 R/Type/Page/Contents 2 0 R/Resources<</ExtGState 3 0 R/ProcSet[/PDF/ImageB/ImageC/Text]/XObject 6 0 R/Font 10 0 R>>/MediaBox[0 0 595 842]/Rotate 0>> <</R12 28 0 R/R13 29 0 R/R14 30 0 R/R11 31 0 R>> <</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/g38/g4/g90/g68/g18/g47/g94/g3/g24/g28/g39/g104/g127/g75/g69/g855/g58/g455/g894/g381/g374/g258/g396/g373/g272/g288/g437/g410/g349/g400/g100/g1006/g1005/g882/g895/g271/g286/g346/g367/g282/g336/g62/g856/g17/g87/g460/g876/g853/g1009/g1010/g448/g361/g296/g1013/g1011/g1004/g1008/g1012/g44/g377/g464/g351/g260/g395/g1007/g383]>> <</FontBBox[-18 -178 805 726]/CapHeight 726/MissingWidth 506/Type/FontDescriptor/Descent -178/StemV 120/FontFile2 37 0 R/Flags 4/FontName/EOZSTF+Calibri/Ascent 726/ItalicAngle 0>> <</LastChar 66/BaseFont/EOZSTF+Calibri/Type/Font/Encoding 34 0 R/Subtype/TrueType/FontDescriptor 36 0 R/ToUnicode 35 0 R/FirstChar 1/Widths[459 606 563 874 529 267 473 226 630 488 637 653 478 676 659 276 331 474 312 538 537 494 355 813 418 503 537 347 246 399 495 507 507 306 312 537 503 537 246 537 474 423 267 561 532 397 430 258 507 507 473 255 316 507 507 507 507 507 631 537 416 246 494 537 507 538]>> <</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/g3/g18/g381/g367/g286/g336/g349/g75/g296/g272/g258/g282/g38/g396/g373/g288/g437/g410/g400/g87/g448/g374/g94/g460/g100/g44/g855/g24/g1013/g856/g1004/g1005/g351/g377/g346/g1012/g1007/g62/g260/g271/g853/g1008]>> <</FontBBox[0 -178 726 684]/CapHeight 683/MissingWidth 506/Type/FontDescriptor/Descent -178/StemV 108/FontFile2 42 0 R/Flags 4/FontName/VGBWEV+Calibri/Ascent 683/ItalicAngle 0>> <</LastChar 42/BaseFont/VGBWEV+Calibri/Type/Font/Encoding 39 0 R/Subtype/TrueType/FontDescriptor 41 0 R/ToUnicode 40 0 R/FirstChar 1/Widths[226 533 527 229 498 471 229 662 305 423 479 525 459 349 799 498 525 335 391 517 452 525 459 395 487 623 268 615 507 252 507 507 229 525 525 507 507 420 479 525 250 507]>> <</R9 33 0 R/R7 38 0 R>> <</Parent 24 0 R/Type/Page/Contents 26 0 R/Resources<</ProcSet[/PDF/ImageB/ImageC/Text]/XObject 27 0 R/Font 32 0 R>>/MediaBox[0 0 595 842]/Rotate 90>> <</ITXT(2.1.6)/Type/Pages/Count 2/Kids[1 0 R 25 0 R]>>

It seems our encoding table...

If we return to the unfiltered-extracted page of text that I posted above, and specifically to the portion of text that we want to decode, we see

/R15 7.68 Tf
111.001 0 Td
(..............)

The text is between the parenthesis, and it corresponds to the hex values 01 02 03 03 04 05 03 04 06 03 04 07 06 08, as I said.

We have to find, in the encoding table, what chars are supposed to represent that hex values.

Our portion of text is identified as /R15. The encoding table has several font descriptors:

FontDescriptor 12 0 R
FontDescriptor 15 0 R
FontDescriptor 19 0 R
FontDescriptor 22 0 R
FontDescriptor 36 0 R
FontDescriptor 41 0 R


So, we go to the section of FontDescriptor 15, which is:

/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1/FontDescriptor 15 0 R/FirstChar 32/Widths[278 0 0 0 0 0 0 0 333 333 0 0 278 333 278 0 556 556 556 556 556 556 556 556 556 556 0 0 0 0 0 0 0 667 667 722 722 667 611 778 722 0 500 667 556 833 722 778 667 0 722 667 611 0 667 0 0 667 0 0 0 0 0 0 0 556 556 500 556 556 278 556 0 222 222 0 222 833 556 556 556 556 333 500 278 556 500 0 0 500 500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 0 0 0 0 0 0 0 556 0 0 0 278 0 0 0 556 0 556 0 0 0 0 0 0 556]>> <</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/parenleft/nine/two/hyphen/seven/zero/eight/parenright/one/four/E/d/i/f/period/L/a/s/space/T/e/r/z/comma/P/l/y/A/m/c/C/slash/n/t/o/S/R/five/six/j/three/G/B/u/v/J/I/x/H/g/M/ntilde/h/F/D/N/p/U/b/eacute/q/ordmasculine/oacute/V/O]>>

The interesting part is bolded; it represents the substitution table of certain bytes of text, in order to make easier the file compression. Into brackets, it's formed by several fields separated by slashes, with the first field indicating the first byte correspondent to the next field. So, we have that

Byte Alias Char
01 parenleft (
02 nine 9
03 two 2
04 hyphen -
05 seven 7
06 zero 0
07 eight 8
08 parenright )
09 one 1
0A four 4
0B E E
0C d d
0D i i
0E f f
0F period .
10 L L
11 a a
12 s s
13 space
14 T T
15 e e
16 r r
17 z z
18 comma ,
19 P P
1A l l
1B y y
1C A A
1D m m
1E c c
1F C C
20 slash /
21 n n
22 t t
23 o o
24 S S
25 R R
26 five 5
27 six 6
28 j j
29 three 3
2A G G
2B B B
2C u u
2D v v
2E J J
2F I I
30 x x
31 H H
32 g g
33 M M
34 ntilde ñ
35 h h
36 F F
37 D D
38 N N
39 p p
3A U U
3B b b
3C eacute é
3D q q
3E ordmasculineº
3F oacute ó
40 V V
41 O O

Opening the PDF, is easy to realize that the encoding isn't arbitrary; it's made during the PDF creation process, assigning values to each character regarding its order in the original file.

Applying the encoding table to our coded array of bytes 01 02 03 03 04 05 03 04 06 03 04 07 06 08, we get
the string "(922-72-02-80)", which is, indeed, the formatted/coded text that we were searching.

So, apparently, the app needs to get the FontDescriptor value for a given block of coded text between parenthesis, search for that FontDescriptor in the encoding table, search for any Differences array and match our hex value with the character representation in the Differences array.

I'll try when I have some free time, and keep you informed...
 
Upvote 0

enemotrop

Member
Licensed User
Longtime User
I've done a sub that works in my case; it decompress the stream objects, searches for text streams, checks if a text is coded or not, find the correct ToUnicode map to decode it, guess when there is a line feed, and finally displays the text.

Don't be surprised if it doesn't work with several pdf files; I've checked it with the files present on a specific web page, to parse its text. But I'm open to any suggestions, hints, collaborations... even if there is someone with the knowledge and the motivation to translate to B4A an open source java library, it'll be perfect.

I attach the code of the sub. It requires that we previously open the pdf file and store its content in a string.
 

Attachments

  • ExtractTextFromPDF.sub.txt
    55.9 KB · Views: 550
Upvote 0

Robert Valentino

Well-Known Member
Licensed User
Longtime User
Do you have a version of this source that is formatted properly so I can give it a try on a PDF I need to read?

The text file you give needs to be completely reformatted to work with B4A

Could you just post the B4A basic file
 
Upvote 0

Isac

Active Member
Licensed User
Longtime User
Some of you managed to extract text from pdf?
is there any library?
thank you
 
Upvote 0

Peter Simpson

Expert
Licensed User
Longtime User
Hmm, interesting...
 
Upvote 0
Top