Android Question RegEx Matcher with HTML Code

Discussion in 'Android Questions' started by hasexxl1988, Nov 12, 2017.

  1. hasexxl1988

    hasexxl1988

    i have follow problem:

    i have HTML Code from a Website:

    <span itemprop='name'>
                    Ferrari TestCar
    i need the Name of the Car.

    i have tryed with following Code:

    If Job.JobName = "PageJob" Then
    Dim mAutoName As Matcher = Regex.Matcher("<span itemprop='name'>""([^""]+)""</span>", Job.GetString)
    Do While mAutoName.Find
    End If
    Result is only: []

    Download and Job function works perfect with my ImageDownloader

    Images URLs with this Code Working:
    Dim m As Matcher = Regex.Matcher("src=\""https://mywebsite/mmo([^""]+)""", Job.GetString)
    i have found de RegEx Pattern List:

    Unfortunately, I do not know how to put together the value that the HTML code is removed
  2. sorex

    sorex

    do a replace of linefeeds, tabs and double spacings (it makes it a lot easier) and then try

    Regex.Matcher("<span itemprop='name'>(.*?)</span>", Job.GetString)
  3. hasexxl1988

    hasexxl1988

    Not Working :/

    i have try:
    If Job.JobName = "PageJob" Then
    Dim xtemp As String
                xtemp = Job.GetString
    Log ("IndexOf: " & xtemp.IndexOf("<span itemprop='name'>"))
    Dim m As Matcher = Regex.Matcher("<span itemprop='name'>(.*?)</span>", Job.GetString)
    Do While m.Find
    Log (m.Group(1))
    End If
    Log result with IndexOf: IndexOf: 112851

    With IndexOf i can find the <span itemprop='name'> in the String. With Matcher not found.
  4. inakigarm

    inakigarm

  5. udg

    udg

    I tried the following on an on-line regex tool and it works, altough I don't think is an elegant solution; it simply works with data from post #1.
    <span itemprop='name'>\s*(.*)\s*<\/span>
    In Group 1 you read Ferrari TestCar.
    Fundamentally it matches any number of whitespaces after "'name'>", followed by the group containing the car model, followed again by any number of whitespaces chars, finally followed by </span>
  6. sorex

    sorex

    you didn't do it right. I told you to remove line breaks, tabs and extra spacing. this breaks regex lookups unless you add more lookup data.
  7. Erel

    Erel

    You should use jSoup or jTidy to parse html.
