B4A Library jSoup HTML Parser

This is my first attempt at a wrapper so it's a work in progress. Consider it a Beta although it isn't feature complete, I'm adding features as I require them :)

Not all functions and documentation implemented or tested fully yet.

Library is compatible with B4A and B4J.

1. Download jsoup 1.8.1 jar fromhttp://jsoup.org/packages/jsoup-1.8.1.jar
2. Copy the downloaded jar file from the zip to B4A or B4J libraries folder
3. Download attached jsoup library, unzip it and copy jar and xml to the libraries folder
4. Have a read of http://jsoup.org

Latest version : v0.15

B4X:
    ' Parse a string
    html = "<html><head ><title >First parse</title></head><body><p>Parsed HTML into a doc.</html>"
    Log(js.parse_HTML(html))

    ' Parse a body fragment
    html = "<div><p>Lorem ipsum.</p>"
    Log(js.parse_BodyFragment(html))

    ' Load from URL
    url = "https://www.b4x.com/"
    Log(js.connect(url))
    Log(js.connectXtra(url, "Mozilla", 0))

    ' Load from file
    Log(js.parse_InputStream(File.OpenInput(File.DirAssets, "test.html"), "UTF-8", url))

    ' DOM methods
    local_html = File.ReadString(File.DirAssets, "test.html")

    Log(js.getElementByID(local_html, "name"))

    DOM1 = js.getElementsByTag(local_html, "a", "")
    DOM2 = js.getElementsByTag(local_html, "a", "href")

    For i = 0 To DOM1.Size -1
        Log(DOM1.Get(i))
        Log(DOM2.Get(i))
    Next

    DOM3 = js.selectorElementText(local_html, "span")

    For i = 0 To DOM3.Size -1
        Log(DOM3.Get(i))
    Next

    ' Selector Syntax - http://jsoup.org/cookbook/extracting-data/selector-syntax
    Selector1 = js.selector(local_html, "img[src$=.png]")

    For i = 0 To Selector1.Size -1
        Log(Selector1.Get(i))
    Next

     ' Extract Attributes, text & HTML
    html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"
    Extract1 = js.selectorElementText(html, "a")
    Log(Extract1.Get(0))
    Extract2 = js.selectorElementAttr(html, "a", "href")
    Log(Extract2.Get(0))
    Extract3 = js.selectorElementAttr(html, "a", "innerhtml")
    Log(Extract3.Get(0))
    Extract4 = js.selectorElementAttr(html, "a", "outerhtml")
    Log(Extract4.Get(0))
 

Attachments

  • jsoup_example_B4A.zip
    6.7 KB · Views: 1,052
  • jsoup_example_B4J.zip
    4.1 KB · Views: 867
  • jSoup_v0.15.zip
    6.3 KB · Views: 1,368
Last edited:

Martin Larsen

Active Member
Licensed User
Longtime User
First, thanks TheJinJ for making this library available! It works fine so far, although the syntax is very much different from the original version.

It would be great if it can be done to use this lib in this way:

I second that! I have used jsoup for several java based Android projects and I really like the chainable jQuery like syntax.

I would very much like to have something like that implemented for this wrapper library.

Erel, if you read this: Is it possible to use method chaining in B4A?
 

DonManfred

Expert
Licensed User
Longtime User
using a correctly written java library you can use chaining. But not with plain b4a as far as i know
 

Martin Larsen

Active Member
Licensed User
Longtime User
Do you mean that it is possible to make a library wrapper in B4A that uses chaining if the Java library is written correctly?
 

BowTieNeck

Member
Licensed User
Longtime User
The latest version of JSoup is 1.8.3. I'm getting an error because the current code is expecting version 1.8.1. I couldn't see anywhere that I could download the older version of JSoup. Would it be possible for you to get the code to just pick up whatever version is in the libraries folder?
Thanks,
Chris

Edit:
I've changed your xml file so it now depends on jsoup-1.8.3 and that works ok. However it's not really a long term solution.
 
Last edited:

mr23

Active Member
Licensed User
Longtime User
Update: a reboot of the PC and now it works, go figure.

I pulled down 1.8.1 from the first post, and the b4a example, placed the jSoup.jar,.xml and jsoup-1.8.1.jar into an additional library folder.
Using b4a v4.3, just trying to compile the project fails on line 56 with missing parameter(s).
56 Log(js.connectXtra(url, "Mozilla", 0))​
'intellisense' shows a number of additional required parameters.

Commenting that line out, and it gets to
66 DOM1 = js.getElementsByTag(local_html, "a", "")​
with 'intellisense' showing only 2 parameters in getElementsByTag.
Have I made a mistake, or is the B4A sample out of date with the supplied library files?

I was looking to try this as JTidy doesn't have any tolerance for unrecognized tags or malformation or (haven't dug in yet) html. JTidy doesn't work with 'http://google.com' nor with 'https://www.b4x.com/android/forum/forums/share-your-creations.33/page-1?order=view_count' for examples.

update: found this enhancement that may help but need to wrap it to test. https://github.com/nanndoj/jtidy

-Chris
 
Last edited:

Martin Larsen

Active Member
Licensed User
Longtime User
How do you work with a js doc read from a file like in your example:

B4X:
js.parse_InputStream(File.OpenInput(File.DirAssets, "test.html"), "UTF-8", url)

How do you eg. select an element:

B4X:
js.getElementByID(local_html, "name"))

These methods work on a local html string as in the snippet about. What if you needed to select the element from the file just read?

PS. I know you can of course read the local html with File.ReadString but since the parse_inputStream method (and likewise the connect() method) exists, there surely must be a way to work with them.
 
Last edited:

Rusty

Well-Known Member
Licensed User
Longtime User
I could not get your sample code to compile.
It looks like there are many parameters missing using the latest jsoup.jar.
Is there any updated sample anywhere?
Thanks
Rusty
 
Top