B4A Library jSoup HTML Parser

This is my first attempt at a wrapper so it's a work in progress. Consider it a Beta although it isn't feature complete, I'm adding features as I require them :)

Not all functions and documentation implemented or tested fully yet.

Library is compatible with B4A and B4J.

1. Download jsoup 1.8.1 jar fromhttp://jsoup.org/packages/jsoup-1.8.1.jar
2. Copy the downloaded jar file from the zip to B4A or B4J libraries folder
3. Download attached jsoup library, unzip it and copy jar and xml to the libraries folder
4. Have a read of http://jsoup.org

Latest version : v0.15

B4X:
    ' Parse a string
    html = "<html><head ><title >First parse</title></head><body><p>Parsed HTML into a doc.</html>"
    Log(js.parse_HTML(html))

    ' Parse a body fragment
    html = "<div><p>Lorem ipsum.</p>"
    Log(js.parse_BodyFragment(html))

    ' Load from URL
    url = "https://www.b4x.com/"
    Log(js.connect(url))
    Log(js.connectXtra(url, "Mozilla", 0))

    ' Load from file
    Log(js.parse_InputStream(File.OpenInput(File.DirAssets, "test.html"), "UTF-8", url))

    ' DOM methods
    local_html = File.ReadString(File.DirAssets, "test.html")

    Log(js.getElementByID(local_html, "name"))

    DOM1 = js.getElementsByTag(local_html, "a", "")
    DOM2 = js.getElementsByTag(local_html, "a", "href")

    For i = 0 To DOM1.Size -1
        Log(DOM1.Get(i))
        Log(DOM2.Get(i))
    Next

    DOM3 = js.selectorElementText(local_html, "span")

    For i = 0 To DOM3.Size -1
        Log(DOM3.Get(i))
    Next

    ' Selector Syntax - http://jsoup.org/cookbook/extracting-data/selector-syntax
    Selector1 = js.selector(local_html, "img[src$=.png]")

    For i = 0 To Selector1.Size -1
        Log(Selector1.Get(i))
    Next

     ' Extract Attributes, text & HTML
    html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"
    Extract1 = js.selectorElementText(html, "a")
    Log(Extract1.Get(0))
    Extract2 = js.selectorElementAttr(html, "a", "href")
    Log(Extract2.Get(0))
    Extract3 = js.selectorElementAttr(html, "a", "innerhtml")
    Log(Extract3.Get(0))
    Extract4 = js.selectorElementAttr(html, "a", "outerhtml")
    Log(Extract4.Get(0))
 

Attachments

  • jsoup_example_B4A.zip
    6.7 KB · Views: 1,052
  • jsoup_example_B4J.zip
    4.1 KB · Views: 867
  • jSoup_v0.15.zip
    6.3 KB · Views: 1,368
Last edited:

Jaames

Active Member
Licensed User
Longtime User
Great work, keep it up! and thanks for sharing this. I made jsoup wrapper but only portion of it for my needs, and this is grat lib that you are sharing! :)
 

TheJinJ

Active Member
Licensed User
Longtime User
Updated to v0.13, few small changes.

Added some extra options to connect and a single string return selector

B4X:
html = js.selectorFirst(html, "a", "") to return the first tag in a string

or

html = js.selectorFirst(html, "a", "href") to return the attrib contents in a string
 

paragkini

Member
Licensed User
Longtime User
Hi, how do I use cleanhtml method? I tried reading from jsoup.org but couldnt understand the usage.
 

Inman

Well-Known Member
Licensed User
Longtime User
Every single project I did with B4A so far involved some level of HTML parsing. I always had to parse it by hand, until now. This library is Godsend for me. Thank you so much.
 

Jaames

Active Member
Licensed User
Longtime User
Note that you could have used jTidy library to convert the html to XML and then parse the XML.
But with jsoup you can parse unformatted (messed up) html without a problem, and it works great, it's really the best library for html parsing as i know.
 

forestd

Member
Licensed User
Longtime User
You need to put the jsoup jar in your additional libraries :)

thanks.

but have new question:

url = "https://www.b4x.com/"
Log(js.connect(url))
Log(js.connectXtra(url, "Mozilla", 0))

app Unable run ;An error occurred.
log said:
" at dalvik.system.NativeStart.main(Native Method)"
"android.os.NetworkOnMainThreadException"

Where trouble to solve the error.
thank you
 

Jaames

Active Member
Licensed User
Longtime User
The real solution is to avoid using this library feature as it is not implemented correctly. It will cause your app to hang and after 5 seconds Android may kill it.

Download whatever you need to download with HttpUtils2.
How do you mean, not to use it at all or only while downloading html's?
Is it safe to download the site with httputils2 and then use it with jsoup (this OP library)?
 

Jaames

Active Member
Licensed User
Longtime User
I'm not familiar with this library so I do not know whether it allows you to set the html from a string instead of a url
Yes, it does.

Sending a http request on the main thread is a bad solution.
Aha... Thanks for clearing this up. I hope author of the lib will find solution... It's a great lib...
 

Jaames

Active Member
Licensed User
Longtime User
It would be great if it can be done to use this lib in this way:

B4X:
  Document doc = Jsoup.connect("http://www.example.com/view.jsp")
              .data("Field1", Integer.toString(Field1Mode.getValue()))
              .data("Field2", Field2Name)
              .header("Accept-Language", "en")
              .post();

B4X:
Dim doc as jSoupDocument = jSoup.connect("www.example.com/view.jsp?") _
                 .data1("Field1", Field1Mode) _
                 .data2("Field2",Field2Name) _
                 .header("Accept-Language", "en") _
                 .post

I know is possible, I saw some libs done in this way in b4A, but how? :)
 
Top