Android Question scraped web

PABLO2013

Well-Known Member
Licensed User
Longtime User
Greetings I want to know some doubts regarding the scraping of web pages.
It happens to me that, I see the information on my screen (prices of articles), but when I save the page, and then I analyze the web code ... there is no data (prices of articles) ... how could I scrape this ... any idea. The pages that I want to scratch are from price lists ... but I see that it has some protection system against scratching, if you know how this page could be scratched, thanks.
 
Last edited:

Brian Dean

Well-Known Member
Licensed User
Longtime User
... but I see that it has some protection system against scratching

I don't think that this has much to do with protecting the data. Prices and related data will appear on several web pages and will change frequently. It makes sense to keep this data in one place - a database - and to use server code to build it into different web pages on demand. Sometimes the pages might be written in HTML but on commercial sites things are usually more sophisticated. Then the data does not appear in the page source so it cannot be simply scraped.

If you are smart enough you can discover how the data is retrieved - I think that web sniffing can help but that is another complex topic. Unfortunately I am not that smart but I would like to learn how it is done. Maybe someone on the forum can point us in the right direction. Of course, this has nothing to do with B4X.
 
Upvote 0

PABLO2013

Well-Known Member
Licensed User
Longtime User
thanks to both (areri / brian)
I use jsoup (TheJinJ / tks) and if Brian Dean does not depend on bax, I can read data from a main page, but the data that is on the screen I cannot access them and I do not know how they could be scratched. Basically it is that they guide us of what really happens and then try to scrape.

Aeric - You know how to do it or what happens with what you say about it dynamic content using JavaScript
tks
 
Upvote 0

aeric

Expert
Licensed User
Longtime User
thanks to both (areri / brian)
I use jsoup (TheJinJ / tks) and if Brian Dean does not depend on bax, I can read data from a main page, but the data that is on the screen I cannot access them and I do not know how they could be scratched. Basically it is that they guide us of what really happens and then try to scrape.

Aeric - You know how to do it or what happens with what you say about it dynamic content using JavaScript
tks
Have you tried to download the page using OkHttpUtils2 ?
 
Upvote 0

Hasan Ali

Member
Licensed User
Longtime User
Try this -
B4X:
Sub Process_Globals
End Sub

Sub Globals
   Dim WebViewExtras1 As WebViewExtras
   Dim WebView1 As WebView
End Sub

Sub Activity_Create(FirstTime As Boolean)
   Activity.LoadLayout("layoutMain")
 
   '   add the B4A javascript interface to the WebView
   WebViewExtras1.addJavascriptInterface(WebView1, "B4A")
 
   '   adding a WebChromeClient will log all browser console message to the android log
   '   so any webpage or javascript errors will be logged
   WebViewExtras1.addWebChromeClient(WebView1, "")
 
   '   now load a web page
   WebView1.LoadUrl("https://www.construplaza.com/")
End Sub

Sub Activity_Resume
End Sub

Sub Activity_Pause (UserClosed As Boolean)
End Sub

Sub WebView1_PageFinished (Url As String)
   '   Now that the web page has loaded we can get the page content as a String
   '   see the documentation http://www.b4x.com/forum/additional-libraries-classes-official-updates/12453-webviewextras.html#post70053 for details of the second parameter callUIThread

   '   wait 2 seconds for page content to fully load
   Sleep(2000)
   Dim Javascript As String
   Javascript="B4A.CallSub('Process_HTML', false, document.documentElement.outerHTML)"
 
   WebViewExtras1.executeJavascript(WebView1, Javascript)
 
   Log("page loaded")
End Sub

Sub Process_HTML(Html As String)
   '   This is the Sub that we'll get the web page to send it's HTML content to
   '   Log may truncate a large page so you'll not see all of the HTML in the log but the 'html' String should still contain all of the web page HTML

  '    Here you use jsoup for extracting data from the Html string

   Log("Process_HTML: "&Html)
End Sub

Code copied from: https://www.b4x.com/android/forum/t...h-webview-and-webviewextras.34418/post-202096
 
Last edited:
Upvote 0

PABLO2013

Well-Known Member
Licensed User
Longtime User
Thanks Hasan, see if I put "Cemento" in the search, the page shows the coinsidents on screen , but if then webview shows the code, the matching elements are not shown in this is the point. Thank you.

WebView1.LoadUrl("https://.....")

If you know what happens here or how it could be solved ... I mean, that webview shows the matching data in its code
 
Last edited:
Upvote 0

Quandalle

Member
Licensed User
but if then webview shows the code, the matching elements are not shown in this is the point.
if you visualize the HTML content by a B4A log, you have to remember that the log displays only the first 4000 characters, the rest is truncated when displayed, even if the real string is much longer than 4000.
And on the page https://www.construplaza.com/Construplaza/Pedidos?busqueda=cemento" the search results appear much further than the first 4000 characters

Eventually, instead of capturing the whole HTML page, the javascript can return only the HTML of the result area, for example by replacing
B4X:
  Javascript="B4A.CallSub('Process_HTML', false, document.documentElement.outerHTML)"
with
B4X:
      Javascript="B4A.CallSub('Process_HTML', false, document.getElementById('SearchResult').innerHTML)"
 
Upvote 0

Hasan Ali

Member
Licensed User
Longtime User
@PABLO2013, Use the following code to extract information from https://www.construplaza.com/Construplaza/Pedidos?busqueda=cemento

B4X:
Sub WebView1_PageFinished (Url As String)
    ' Now that the web page has loaded we can get the page content as a String
    ' see the documentation http://www.b4x.com/forum/additional-libraries-classes-official-updates/12453-webviewextras.html#post70053 for details of the second parameter callUIThread

    ' wait 2-3 seconds for page content to fully load
    Sleep(2000)
    
'    Dim Javascript As String
'    Javascript="B4A.CallSub('Process_HTML', false, document.documentElement.outerHTML)"
'    WebViewExtras1.executeJavascript(WebView1, Javascript)

    WebViewExtras1.executeJavascript(WebView1, $"
    var resultArr = [];
    
    var obj = document.querySelectorAll(".Producto");
    if (obj.length !== 0) {
        //console.log(obj);
        obj.forEach(function myFunction(element, index) {
            var Foto = element.querySelector(".Foto img").getAttribute("src");
            var Marca = element.querySelector(".Descripcion .Marca").innerText;
            var Desc = element.querySelector(".Descripcion a").getAttribute("title");
            Desc = Desc.split(" - ", 2)[1];
            var Price = element.querySelector(".Precio").innerText;
            
            var jsonObj = {"foto":Foto, "marca":Marca, "desc":Desc, "price":Price};
            resultArr.push(jsonObj);
            
            //console.log("Index: ", index);
            //console.log("Foto: ", "["+Foto+"]");
            //console.log("Marca: ", "["+Marca+"]");
            //console.log("Desc: ", "["+Desc+"]");
            //console.log("Price: ", "["+Price+"]");
            //console.log("--------------------------------------");
        });
    } else {
        // no data found
        console.log("No data found");
    }
    var result = JSON.stringify(resultArr);
    //console.log(result);

    B4A.CallSub('productList', false, result);
    "$)
 
    Log("page loaded")
End Sub

Sub productList(jsonStr As String) 'ignore
'    Log(jsonStr)
    Dim js As JSONParser
    js.Initialize(jsonStr)
    Dim data As List = js.NextArray
    For Each m As Map In data
        LogColor($"Foto: ${m.Get("foto")}"$, Colors.Blue)
        LogColor($"Marca: ${m.Get("marca")}"$, Colors.Magenta)
        LogColor($"Desc: ${m.Get("desc")}"$, Colors.Black)
        LogColor($"Price: ${m.Get("price")}"$, Colors.Blue)
        Log("--------------------------------------")
    Next
End Sub
 
Upvote 0

Quandalle

Member
Licensed User
just for fun a more condensed version of the javascript part from Hasan Ali post
B4X:
WebViewExtras1.executeJavascript(WebView1, $"
    let resultArr = [];
    document.querySelectorAll(".Producto").forEach(element => resultArr.push({
        Foto : element.querySelector(".Foto img").src,
        Marca :element.querySelector(".Descripcion .Marca").innerText,
        Desc : element.querySelector(".Descripcion a").title.split(" - ", 2)[1],
        Price : element.querySelector(".Precio").innerText,}));
    B4A.CallSub('productList', false, JSON.stringify(resultArr));
"$)
 
Upvote 0

Hasan Ali

Member
Licensed User
Longtime User
just for fun a more condensed version of the javascript part from Hasan Ali post
B4X:
WebViewExtras1.executeJavascript(WebView1, $"
    let resultArr = [];
    document.querySelectorAll(".Producto").forEach(element => resultArr.push({
        Foto : element.querySelector(".Foto img").src,
        Marca :element.querySelector(".Descripcion .Marca").innerText,
        Desc : element.querySelector(".Descripcion a").title.split(" - ", 2)[1],
        Price : element.querySelector(".Precio").innerText,}));
    B4A.CallSub('productList', false, JSON.stringify(resultArr));
"$)

Thank you. Actually, I'm a beginner in JavaScript.
 
Upvote 0

Quandalle

Member
Licensed User
Thank you. Actually, I'm a beginner in JavaScript.
no problem, your code is correct and well written.
Only one point for information, it is better now to use in javascript const or let to declare variables in function because it limits the scope. This is not the case for a declaration with var and this sometimes causes errors
 
Upvote 0

PABLO2013

Well-Known Member
Licensed User
Longtime User
Many thanks to the experts Hasan Ali and Quandalle
The code works very well.
There is another aspect, for example if I look for articles that begin with the letter "A" (Aceite, Alambre ....=6198 Results), the web page places them in several pages (1,2,3 ...). The question is how to automate so that the scraping is on page 1 and consequently on the others pages (2,3...) , thanks.
 
Upvote 0

Hasan Ali

Member
Licensed User
Longtime User
@PABLO2013 Use the code below to get data more easily:
B4X:
' Query - Text query to search
' Limit - Results limit
' Sort - Available sort options are: Products, Products_Precio_desc, Products_Precio_asc, Products_Descrip_asc
' Filter - Check the website for available marca for filter
Sub ApiCall(Query As String, Limit As Int, Sort As String, MarcaFilter As List) As ResumableSub
    Dim reqUrl As String = "https://mucjnsqczh-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.5.1)%3B%20Browser%20(lite)%3B%20instantsearch.js%20(4.21.0)%3B%20JS%20Helper%20(3.4.4)&x-algolia-api-key=548b5dedba445fcdb9435d2dd720562a&x-algolia-application-id=MUCJNSQCZH"
    '-------------------------------------
    Dim Marcas As String
    If MarcaFilter.IsInitialized And MarcaFilter.Size > 0 Then
        For index = 0 To MarcaFilter.Size-1
            MarcaFilter.Set(index, $"Marca:${MarcaFilter.Get(index)}"$)
        Next
        Dim js As JSONGenerator
        js.Initialize2(MarcaFilter)
        Dim su As StringUtils
        Marcas = su.EncodeUrl($"[${js.ToString}]"$, "UTF8")
    End If
    Log($"Marcas: ${Marcas}"$)
    '-------------------------------------
    Dim payload As String = $"{"requests":[{"indexName":"${Sort}","params":"clickAnalytics=true&query=${Query}&hitsPerPage=${Limit}&maxValuesPerFacet=10000&highlightPreTag=__ais-highlight__&highlightPostTag=__%2Fais-highlight__&page=0&userToken=anonymous-84f57e4d-6e70-47c8-befa-7e2ed4a6cab0&facets=%5B%22Marca%22%2C%22Unidad%22%2C%22Precio%22%2C%22Departamento%22%5D&tagFilters=&facetFilters=${Marcas}"}]}"$
    Log("payload: " & payload)
    
    Dim response As Map
    Dim job As HttpJob
    job.Initialize("", Me)
    job.PostString(reqUrl, payload)
    
'    Try
    Wait For (job) JobDone(job As HttpJob)
    If job.Success Then
'        Log("From server: " & job.GetString)
        Dim jp As JSONParser
        Try
            jp.Initialize(job.GetString)
            response = jp.NextObject
            
            Dim results As List = response.Get("results")
'            Dim hits As List = results.Get(0)
            response = results.Get(0)
        Catch
            Log("#JSONParser Error: " & LastException)
        End Try
    Else
        Log("#Job Error: " & job.ErrorMessage)
    End If
'    Catch
'        Log(LastException)
'    End Try
    job.Release
    Return response
End Sub

How to use:
B4X:
    Dim marcaFilter As List
'    marcaFilter = Array As String("España", "Lanco", "Holcim") ' uncomment this line if you want to filter results by Marca
    
    Wait For (ApiCall("cemento", 10, "Products_Descrip_asc", marcaFilter)) complete (Result As Map)
    Dim hits As List = Result.Get("hits")
    If hits.Size > 0 Then
        For Each m As Map In hits
            
            ' You can extract as much information as you need.
            ' Sample data:
'            {
'               "Descripcion":"Cemento gris Fuerte saco 50 kg Holcim",
'               "Articulo":"04421",
'               "CodBarras":"7441086600006",
'               "Departamento":"Obra Gris",
'               "Categoria":"Obra Gris > Cemento",
'               "Subcategoria":"Cemento Gris",
'               "Marca":"Holcim",
'               "Otros":"Gris",
'               "Image":"https://www.construplaza.com/Content/Thumbnails/04421.png",
'               "Precio":6450.00000026,
'               "PrecioDescuento":6450.0,
'               "Descuento":0.0,
'               "TieneDescuento":False,
'               "Unidad":"unidad",
'               "MultiploVenta":1.0,
'               "OrdenMinima":1.0,
'               "Frecuencia":3461,
'               "MontoRank":602901370.678194,
'               "Ribbon":Null,
'               "Origen":0,
'               "objectID":"04421",
'               "_highlightResult":{
'                  "Descripcion":{
'                     "value":"__ais-highlight__Cemento__/ais-highlight__ gris Fuerte saco 50 kg Holcim",
'                     "matchLevel":"full",
'                     "fullyHighlighted":False,
'                     "matchedWords":[
'                        "cemento"
'                     ]
'                  },
'                  "Articulo":{
'                     "value":"04421",
'                     "matchLevel":"none",
'                     "matchedWords":[
'                        
'                     ]
'                  },
'                  "CodBarras":{
'                     "value":"7441086600006",
'                     "matchLevel":"none",
'                     "matchedWords":[
'                        
'                     ]
'                  },
'                  "Marca":{
'                     "value":"Holcim",
'                     "matchLevel":"none",
'                     "matchedWords":[
'                        
'                     ]
'                  }
'               }
'            }
            
            LogColor($"Foto: ${m.Get("Image")}"$, Colors.Blue)
            LogColor($"Marca: ${m.Get("Marca")}"$, Colors.Magenta)
            LogColor($"Desc: ${m.Get("Descripcion")}"$, Colors.Black)
            LogColor($"Price: ${m.Get("Precio")}"$, Colors.Blue)
            LogColor($"Price (without fractions): $1.0{m.Get("Precio")}"$, Colors.Blue)
            Log("--------------------------------------")
        Next
    Else
        Log("No data found")
    End If
 
Upvote 0

PABLO2013

Well-Known Member
Licensed User
Longtime User
I know I ask a lot (excuse).

But if you look closely there is a piece of data that says Article ":" 04421, how could the query in which this value is entered as a variable ..., 04421, .... to make the queries within a for i = 0 to ...

I tell you, it is good that I do the query "cemento" ... it gives me about 1000 articles (+-).

Let's say it gives me 1000 max(out of 1200) with each query "cemento" but how can I access the following articles 1001,1002 ... 1200 and so on with the query "cemento", thank you
 
Upvote 0

Hasan Ali

Member
Licensed User
Longtime User
@PABLO2013 Updated. Now you can change the results page.
B4X:
' Query - Text query to search
' Limit - Results limit (Max. limit: 1000)
' Sort - Available sort options are: Products, Products_Precio_desc, Products_Precio_asc, Products_Descrip_asc
' Filter - Check the website for available marca for filter
' Page - Specify the page number of the result (Page start from: 0)
Sub ApiCall(Query As String, Limit As Int, Sort As String, MarcaFilter As List, Page As Int) As ResumableSub
    Dim reqUrl As String = "https://mucjnsqczh-1.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.5.1)%3B%20Browser%20(lite)%3B%20instantsearch.js%20(4.21.0)%3B%20JS%20Helper%20(3.4.4)&x-algolia-api-key=548b5dedba445fcdb9435d2dd720562a&x-algolia-application-id=MUCJNSQCZH"
    '-------------------------------------
    Dim Marcas As String
    If MarcaFilter.IsInitialized And MarcaFilter.Size > 0 Then
        For index = 0 To MarcaFilter.Size-1
            MarcaFilter.Set(index, $"Marca:${MarcaFilter.Get(index)}"$)
        Next
        Dim js As JSONGenerator
        js.Initialize2(MarcaFilter)
        Dim su As StringUtils
        Marcas = su.EncodeUrl($"[${js.ToString}]"$, "UTF8")
    End If
    Log($"Marcas: ${Marcas}"$)
    '-------------------------------------
    Dim payload As String = $"{"requests":[{"indexName":"${Sort}","params":"clickAnalytics=true&query=${Query}&hitsPerPage=${Limit}&maxValuesPerFacet=10000&highlightPreTag=__ais-highlight__&highlightPostTag=__%2Fais-highlight__&page=${Page}&userToken=anonymous-84f57e4d-6e70-47c8-befa-7e2ed4a6cab0&facets=%5B%22Marca%22%2C%22Unidad%22%2C%22Precio%22%2C%22Departamento%22%5D&tagFilters=&facetFilters=${Marcas}"}]}"$
    Log("payload: " & payload)
    
    Dim response As Map
    Dim job As HttpJob
    job.Initialize("", Me)
    job.PostString(reqUrl, payload)
    
'    Try
    Wait For (job) JobDone(job As HttpJob)
    If job.Success Then
'        Log("From server: " & job.GetString)
        Dim jp As JSONParser
        Try
            jp.Initialize(job.GetString)
            response = jp.NextObject
            
            Dim results As List = response.Get("results")
'            Dim hits As List = results.Get(0)
            response = results.Get(0)
        Catch
            Log("#JSONParser Error: " & LastException)
        End Try
    Else
        Log("#Job Error: " & job.ErrorMessage)
    End If
'    Catch
'        Log(LastException)
'    End Try
    job.Release
    Return response
End Sub

To check out how many products and pages there are:
B4X:
Wait For (ApiCall("", 1000, "Products", marcaFilter, 0)) complete (Result As Map)
Dim totalProducts As Int = Result.Get("nbHits")
Dim totalPages As Int = Result.Get("nbPages")

Log($"Total products: ${totalProducts}"$)
Log($"Total pages: ${totalPages}"$)

Note: Use "Products" in the sorting option. Other options do not show complete results.
 
Upvote 0
Top