B4J Question Scraping a webpage

paddy12309 · Oct 21, 2022

Hi everyone,

I am trying to scrape the webpage from a USB dongle for 4g (I want to get the IMEI etc). I have tried the HTTP utils, cURL, and command line to chromium yet each method I have tried and found returns the HTML but none of the javascript generated content. is there a way I can get this?
I believe what I am after is called the outer.HTML ? as in the whole page post javascript execution?

any helps greatly appreciated, I've been tearing my hair out on this for 2 days!

ciao!

Mashiane · Oct 21, 2022

Check this article on how I did my scraping, perhaps it can help.

Running a web-scraper on my TailwindCSS Website

Well, I can safely confirm that one of the things on my bucket list has been ticked off. Ha ha ha.

mbanga-anele.medium.com

paddy12309 · Oct 21, 2022

Mashiane said:
Check this article on how I did my scraping, perhaps it can help.

Running a web-scraper on my TailwindCSS Website

Well, I can safely confirm that one of the things on my bucket list has been ticked off. Ha ha ha.

mbanga-anele.medium.com

thanks I'll have a read!

paddy12309 · Oct 21, 2022

To further this I have a lot of remote devices I can update the b4j program on remotely so ideally need to be able to this through B4J code/commandline without installing anything else on the Pi.. any thoughts on how to gather the HTML of the whole page including the javascript generated view?

MicroDrie · Oct 21, 2022

paddy12309 said:
I am trying to scrape the webpage from a USB dongle for 4g (I want to get the IMEI etc).

to quote Erel:

You cannot read the IMEI number on new devices.

You could be start with a B4X solution

MiniHtmlParser - simple html parser implemented with B4X
MiniHtmlParser is a cross platform class that parses html strings and creates a tree with the various elements.

Or with a specific B4A solution like:

jSoup HTML Parser
This solution is specific to B4A and allows you to search specifically for a piece of code into the HTML page

paddy12309 · Oct 21, 2022

MicroDrie said:
to quote Erel:

You could be start with a B4X solution

Or with a specific B4A solution like:

so, I am trying to scrape a website for the Imei which is in plain text in the HTML of the loaded page, not through an android device.
Thankyou for pointing me towards MiniHTMLparser, that's a useful tool! I think I will implement that in the program!

sadly my issue is more based on when I download the HTML it is incomplete/does not contain one of the divs contents (presumably generated on page load in JS?)

So I have tried a command line too,

chromium command:

chromium-browser --headless --dump-dom --virtual-time-budget=10000 192.168.8.1/html/content.html#device-information

this returns the page without populating the div I need populated. the same happens with OKHttpUtils download function.

stevel05 · Oct 21, 2022

paddy12309 said:
his returns the page without populating the div I need populated

Might be worth checking if it's a security feature of the dongle. An attempt to prevent hacking maybe?

paddy12309 · Oct 21, 2022

stevel05 said:
Might be worth checking if it's a security feature of the dongle. An attempt to prevent hacking maybe?

that's a good thought, I'll see what I can find out!

MicroDrie · Oct 21, 2022

this returns the page without populating the div I need populated. the same happens with OKHttpUtils download function.

Open Firefox or Chrome browser
Open web page
Open new tab page with this url: https://formatteronline.com/html
Open new tab page and load your page
Right click loaded web page
Show source web page
Select all page source with Ctrl-A
copy all page source with Ctrl-V
Go back to https://formatteronline.com/html
Paste your web page source
Click format button

If there is an error on your dongle web page then a red line is showed, if not your page source is OK.

It is possible that the web page is malformed. In that case go to your loaded dongle web page and right click the mouse om the loaded web page and click on inspect to show the DOM source. go to the first line and edit as HTML. then repeat point 7 till 11.

Be aware that it is verry important to use the same text string as showed in the original (DOM) page source because parsing takes place on a one to one character match between your source string and the loaded web page source string.

paddy12309 · Oct 21, 2022

MicroDrie said:
Open Firefox or Chrome browser

Open web page

Open new tab page with this url: https://formatteronline.com/html

Open new tab page and load your page

Right click loaded web page

Show source web page

Select all page source with Ctrl-A

copy all page source with Ctrl-V

Go back to https://formatteronline.com/html

Paste your web page source

Click format button

If there is an error on your dongle web page then a red line is showed, if not your page source is OK.

It is possible that the web page is malformed. In that case go to your loaded dongle web page and right click the mouse om the loaded web page and click on inspect to show the DOM source. go to the first line and edit as HTML. then repeat point 7 till 11.

Be aware that it is verry important to use the same text string as showed in the original (DOM) page source because parsing takes place on a one to one character match between your source string and the loaded web page source string.

so I have tried this, the page seems to be correct HTML, with or without the div I am trying to get.

I am trying using the following code:

GetPage:

    Dim myResponseString1 As String = ""
    Dim JobScrapeDonglePageHTML1 As HttpJob
    JobScrapeDonglePageHTML1.Initialize("JobScrapeDonglePageHTML1",Me)
    JobScrapeDonglePageHTML1.Download("http://192.168.8.1/html/content.html#deviceinformation") 
    Wait For jobdone(job As HttpJob)
    If job.Success Then
    myResponseString1 = job.GetString2("UTF8").Trim
    'If job.Response.StatusCode = 200 Then
    'Log("the repsonse from download dongle page information" & myResponseString1)
    File.WriteString("/home/pi/CollatorStartup","DonglePageinfo.html", myResponseString1)
    If myResponseString1.Contains("IMEI") Then
    Log("IMEI found!")
    'File.WriteString("/home/pi/CollatorStartup","DonglePage.txt", myResponseString)
    End If
    Log("scraped info page")
    'End If
    Else
    Log("job failed")
    Log(stderr)
    End If

essentially if I visit the page in a browser it loads fully if its through b4j/commandline etc then it is missing the div with the information I am after!

MicroDrie · Oct 21, 2022

What happens if you save the correctly displayed web page and then run it through your program? If that is correct then your program is good. If that goes wrong, you have to look into the processing of the response, which you could also write to a file.

RodM · Oct 22, 2022

I think that you want to get a content that is dinamically generatad by JavaScript.

Some time ago I was looking for the same and YES, you can do it with B4J, but it's not so easy to do (at least for me).

You will need to use a WebView with WebView extras and activate JavaScript rendering on it.

There is many content in the forum about this.

Good luck ?

paddy12309 · Oct 22, 2022

RMMIRON said:
I think that you want to get a content that is dinamically generatad by JavaScript.

Some time ago I was looking for the same and YES, you can do it with B4J, but it's not so easy to do (at least for me).

You will need to use a WebView with WebView extras and activate JavaScript rendering on it.

There is many content in the forum about this.

Good luck ?

I think you're right, thank you!

paddy12309 · Oct 22, 2022

so processing the saved page works fine, I think I need to try to capture post js rendering

stevel05 · Oct 22, 2022

Does it show correctly if you display it in a webview? Directly from the URL rather than downloading the html first.

If it does, there may be an easier way to get the information you want.

paddy12309 · Oct 25, 2022

stevel05 said:
Does it show correctly if you display it in a webview? Directly from the URL rather than downloading the html first.

If it does, there may be an easier way to get the information you want.

the program I've written is a non UI one, my understanding is that I can't use a web view with this?

stevel05 · Oct 25, 2022

paddy12309 said:
my understanding is that I can't use a web view with this?

That is true

bdunkleysmith · Oct 25, 2022

paddy12309 said:
the program I've written is a non UI one, my understanding is that I can't use a web view with this?

Just an idea, why not make it a UI app and just not display the form holding the webview so you can take the path suggested by @RMMIRON?

stevel05 · Oct 25, 2022

bdunkleysmith said:
why not make it a UI app and just not display the form holding the webview

If you decide to do that, let me know as I was thinking of using the WebEngine directly, probably no need for a webview. But it still needs javafx

drgottjr · Oct 26, 2022

to use webview in a b4j app, you need javafx. the other choices (eg, webkit.org) are less appealing.
there is 1 way to do what you're trying to do, and within that 1 way, a couple of options.

first, a thank you to member @TILogistic for his very excellent routine at https://www.b4x.com/android/forum/t...tml-elements-from-b4j-app.136957/#post-866548
starting at post #10.

assuming the information you are looking for is visible in the web page without your having to trigger some action (eg, by clicking on a button or following an
<a> link), you have to wait until the web page has loaded. the webview will raise an event (_PageFinished) when that has occurred. at that point you can
execute some javascript. (technically, you could simply count to 10 from the time you load the page...)

the 2 options referred to are:
1) obtain the (full) html text. (in your case, probably not useful).
2) obtain the value of a particular element.

there are a number of theories relating to capturing the html text, among them what you refer to as "outerHTML". each method has its adherents and
detractors. for every method proposed, someone has been able to show where it falls short. the version i prefer appears in the attached example.
feel free to use a different version. but if you've already tried a headless chrome without success, then you know this approach not going to work.

as to capturing the value of a particular element, you have to know its name or id, of course.

in the attached example, i've taken @TILogistic's routine and modified it slightly to show both options. first, it downloads the full html document (once
loaded), then it sets the value of a known element (to make things interesting), then it captures that element and stores it in a variable (to show you how
to capture the value you're looking for).

note: it is still possible the element you are looking for is not actually available after the page has loaded (eg, there could be some kind of callback involved).
in that case you might have to sleep for a few seconds. there is no way to know. the point is: in such a case, if the variable is populated by an asynch call,
you might have to request it more than once. but since you say it's visible on the screen (just not in the html text), waiting for the page to load should be
enough.

B4J Question Scraping a webpage

Member

Expert

Member

Member

Well-Known Member

Member

Expert

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Member

Expert

Member

Expert

Active Member

Expert

Expert

Attachments

Similar Threads