Android Question Can we get the webpage content (text only) from a Webview?

wonder · Sep 23, 2016

Is it possible to extract the text being displayed by a webpage loaded in webview?

If not, any ideas for such an algorithm?
(The equivalent of opening a webpage in a browser, Ctrl+A, Ctrl+C and Ctrl-V in Notepad.)

susu · Sep 23, 2016

I used HttpUtils with Job.Getstring to get webpage content directly.

If you still want to get it through webview so check this
https://b4x.com/android/forum/threa...webpage-with-webview-and-webviewextras.34418/

wonder · Sep 23, 2016

susu said:
I used HttpUtils with Job.Getstring to get webpage content directly.

Does it grab the entire HTML code or only the rendered text? I'm only after the webpage output, meaning the human readable content itself.

susu · Sep 23, 2016

wonder said:
Does it grab the entire HTML code or only the rendered text?

Only HTML code just like you open webpage with view-source. But you can save HTML into file then load it into WebView (however, the link to images, CSS... may be broken).

wonder · Sep 23, 2016

I'm looking after the easiest way to get the content, discarding HTML, Scripts, Forms, etc...

sorex · Sep 23, 2016

I never use that special lib but maybe it allows you to inject javascript and receive javascript data.

then you could use data=document.body.textContent; and pull back in the data value.

notice that it might add unwanted stuff like javascript portions so you might need to filter that out.

if it is in a block with an id you could use data=document.getElementById('id').textContent; instead then you don't have the script filtering misery.

moster67 · Sep 23, 2016

HttpUtils with Job.Getstring as @susu said + regex is common practice to scrape webpages/contents. Many Python scripts for web-scraping is done this way. Of course you need examine the web-sources before doing it so you can setup regex properly.

An alternative way could be the JTidyLibrary:
https://www.b4x.com/android/forum/threads/jtidy-library-convert-html-pages-to-xml.27038/

inakigarm · Sep 23, 2016

I think you can find JSoup library methods and properties very helpfull; you can access directly elements, tags, etc.. from the original html page

sorex · Sep 24, 2016

I would just go for the okHTTP method aswell but it's not clear why he really want to grab it from the webview it must have some reason

wonder · Sep 24, 2016

I'm ok with okHTTP...

Webview was just the first thing that came to my mind. I'm want to experiment with web crawling and data mining. As a starting point, I'll try to download the entire Wikipedia...

No, but seriously, getting some content (text) from Wikipedia would be a great starting point.

Android Question Can we get the webpage content (text only) from a Webview?

wonder

Expert

susu

Well-Known Member

wonder

Expert

susu

Well-Known Member

wonder

Expert

sorex

Expert

moster67

Expert

inakigarm

Well-Known Member

sorex

Expert

wonder

Expert

Similar Threads