B4J Question Scraping a webpage

paddy12309

Member
Licensed User
Hi everyone,

I am trying to scrape the webpage from a USB dongle for 4g (I want to get the IMEI etc). I have tried the HTTP utils, cURL, and command line to chromium yet each method I have tried and found returns the HTML but none of the javascript generated content. is there a way I can get this?
I believe what I am after is called the outer.HTML ? as in the whole page post javascript execution?

any helps greatly appreciated, I've been tearing my hair out on this for 2 days!

ciao!
 

paddy12309

Member
Licensed User
To further this I have a lot of remote devices I can update the b4j program on remotely so ideally need to be able to this through B4J code/commandline without installing anything else on the Pi.. any thoughts on how to gather the HTML of the whole page including the javascript generated view?
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
I am trying to scrape the webpage from a USB dongle for 4g (I want to get the IMEI etc).
to quote Erel:
You could be start with a B4X solution
MiniHtmlParser - simple html parser implemented with B4X
MiniHtmlParser is a cross platform class that parses html strings and creates a tree with the various elements.
Or with a specific B4A solution like:
jSoup HTML Parser
This solution is specific to B4A and allows you to search specifically for a piece of code into the HTML page
 
Upvote 0

paddy12309

Member
Licensed User
to quote Erel:

You could be start with a B4X solution

Or with a specific B4A solution like:
so, I am trying to scrape a website for the Imei which is in plain text in the HTML of the loaded page, not through an android device.
Thankyou for pointing me towards MiniHTMLparser, that's a useful tool! I think I will implement that in the program!

sadly my issue is more based on when I download the HTML it is incomplete/does not contain one of the divs contents (presumably generated on page load in JS?)

So I have tried a command line too,

chromium command:
chromium-browser --headless --dump-dom --virtual-time-budget=10000 192.168.8.1/html/content.html#device-information

this returns the page without populating the div I need populated. the same happens with OKHttpUtils download function.
 
Upvote 0

stevel05

Expert
Licensed User
Longtime User
his returns the page without populating the div I need populated
Might be worth checking if it's a security feature of the dongle. An attempt to prevent hacking maybe?
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
this returns the page without populating the div I need populated. the same happens with OKHttpUtils download function.

  1. Open Firefox or Chrome browser
  2. Open web page
  3. Open new tab page with this url: https://formatteronline.com/html
  4. Open new tab page and load your page
  5. Right click loaded web page
  6. Show source web page
  7. Select all page source with Ctrl-A
  8. copy all page source with Ctrl-V
  9. Go back to https://formatteronline.com/html
  10. Paste your web page source
  11. Click format button
If there is an error on your dongle web page then a red line is showed, if not your page source is OK.

It is possible that the web page is malformed. In that case go to your loaded dongle web page and right click the mouse om the loaded web page and click on inspect to show the DOM source. go to the first line and edit as HTML. then repeat point 7 till 11.

Be aware that it is verry important to use the same text string as showed in the original (DOM) page source because parsing takes place on a one to one character match between your source string and the loaded web page source string.
 
Upvote 0

paddy12309

Member
Licensed User
  1. Open Firefox or Chrome browser
  2. Open web page
  3. Open new tab page with this url: https://formatteronline.com/html
  4. Open new tab page and load your page
  5. Right click loaded web page
  6. Show source web page
  7. Select all page source with Ctrl-A
  8. copy all page source with Ctrl-V
  9. Go back to https://formatteronline.com/html
  10. Paste your web page source
  11. Click format button
If there is an error on your dongle web page then a red line is showed, if not your page source is OK.

It is possible that the web page is malformed. In that case go to your loaded dongle web page and right click the mouse om the loaded web page and click on inspect to show the DOM source. go to the first line and edit as HTML. then repeat point 7 till 11.

Be aware that it is verry important to use the same text string as showed in the original (DOM) page source because parsing takes place on a one to one character match between your source string and the loaded web page source string.
so I have tried this, the page seems to be correct HTML, with or without the div I am trying to get.

I am trying using the following code:

GetPage:
    Dim myResponseString1 As String = ""
    Dim JobScrapeDonglePageHTML1 As HttpJob
    JobScrapeDonglePageHTML1.Initialize("JobScrapeDonglePageHTML1",Me)
    JobScrapeDonglePageHTML1.Download("http://192.168.8.1/html/content.html#deviceinformation") 
    Wait For jobdone(job As HttpJob)
    If job.Success Then
    myResponseString1 = job.GetString2("UTF8").Trim
    'If job.Response.StatusCode = 200 Then
    'Log("the repsonse from download dongle page information" & myResponseString1)
    File.WriteString("/home/pi/CollatorStartup","DonglePageinfo.html", myResponseString1)
    If myResponseString1.Contains("IMEI") Then
    Log("IMEI found!")
    'File.WriteString("/home/pi/CollatorStartup","DonglePage.txt", myResponseString)
    End If
    Log("scraped info page")
    'End If
    Else
    Log("job failed")
    Log(stderr)
    End If

essentially if I visit the page in a browser it loads fully if its through b4j/commandline etc then it is missing the div with the information I am after!
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
What happens if you save the correctly displayed web page and then run it through your program? If that is correct then your program is good. If that goes wrong, you have to look into the processing of the response, which you could also write to a file.
 
Upvote 0

RodM

Member
Licensed User
I think that you want to get a content that is dinamically generatad by JavaScript.

Some time ago I was looking for the same and YES, you can do it with B4J, but it's not so easy to do (at least for me).

You will need to use a WebView with WebView extras and activate JavaScript rendering on it.

There is many content in the forum about this.

Good luck šŸ¤ž
 
Upvote 0

paddy12309

Member
Licensed User
I think that you want to get a content that is dinamically generatad by JavaScript.

Some time ago I was looking for the same and YES, you can do it with B4J, but it's not so easy to do (at least for me).

You will need to use a WebView with WebView extras and activate JavaScript rendering on it.

There is many content in the forum about this.

Good luck šŸ¤ž
I think you're right, thank you!
 
Upvote 0

stevel05

Expert
Licensed User
Longtime User
Does it show correctly if you display it in a webview? Directly from the URL rather than downloading the html first.

If it does, there may be an easier way to get the information you want.
 
Last edited:
Upvote 0

paddy12309

Member
Licensed User
Does it show correctly if you display it in a webview? Directly from the URL rather than downloading the html first.

If it does, there may be an easier way to get the information you want.
the program I've written is a non UI one, my understanding is that I can't use a web view with this?
 
Upvote 0

stevel05

Expert
Licensed User
Longtime User
why not make it a UI app and just not display the form holding the webview
If you decide to do that, let me know as I was thinking of using the WebEngine directly, probably no need for a webview. But it still needs javafx
 
Upvote 0

drgottjr

Expert
Licensed User
Longtime User
to use webview in a b4j app, you need javafx. the other choices (eg, webkit.org) are less appealing.
there is 1 way to do what you're trying to do, and within that 1 way, a couple of options.

first, a thank you to member @TILogistic for his very excellent routine at https://www.b4x.com/android/forum/t...tml-elements-from-b4j-app.136957/#post-866548
starting at post #10.

assuming the information you are looking for is visible in the web page without your having to trigger some action (eg, by clicking on a button or following an
<a> link), you have to wait until the web page has loaded. the webview will raise an event (_PageFinished) when that has occurred. at that point you can
execute some javascript. (technically, you could simply count to 10 from the time you load the page...)

the 2 options referred to are:
1) obtain the (full) html text. (in your case, probably not useful).
2) obtain the value of a particular element.

there are a number of theories relating to capturing the html text, among them what you refer to as "outerHTML". each method has its adherents and
detractors. for every method proposed, someone has been able to show where it falls short. the version i prefer appears in the attached example.
feel free to use a different version. but if you've already tried a headless chrome without success, then you know this approach not going to work.

as to capturing the value of a particular element, you have to know its name or id, of course.

in the attached example, i've taken @TILogistic's routine and modified it slightly to show both options. first, it downloads the full html document (once
loaded), then it sets the value of a known element (to make things interesting), then it captures that element and stores it in a variable (to show you how
to capture the value you're looking for).

note: it is still possible the element you are looking for is not actually available after the page has loaded (eg, there could be some kind of callback involved).
in that case you might have to sleep for a few seconds. there is no way to know. the point is: in such a case, if the variable is populated by an asynch call,
you might have to request it more than once. but since you say it's visible on the screen (just not in the html text), waiting for the page to load should be
enough.
 

Attachments

  • paddy12309.zip
    2.4 KB · Views: 94
Upvote 0
Top