Get all links from an html file using regex

Discussion in 'Questions (Windows Mobile)' started by MM2forever, Oct 12, 2007.

  1. MM2forever

    MM2forever Active Member Licensed User

    Hi guys,
    I am trying to get links from an html file. My code (important parts of it) looks like this:
    regex.new1("href=(.*?)[\s>]")

    match.value=regex.match(htmtemp)
    Do While match.success=True
    list.add(SubString(htmtemp,match.index,match.length))
    match.value=match.nextmatch
    Loop

    Im not getting any results, whats wrong? Is it my regular expression itself?

    Thank you for your help
    Christian
    [MM2forever]
     
  2. Erel

    Erel Administrator Staff Member Licensed User

    The pattern is taken from this site: http://sastools.com/b2/post/79393902
    You should add a Regex object and a Match object.
    Code:
    Sub Globals
        
    'Declare the global variables here.

    End Sub

    Sub App_Start
        Form1.Show
        
    If OpenDialog1.Show = cCancel Then AppClose
        q = 
    Chr(34) & Chr(34)
        r = 
    "(?:[hH][rR][eE][fF]\s*=)"
        r = r & 
    "(?:[\s"&q&"(']*)"
        r = r & 
    "(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.*css|.*this\.)"
        r = r & 
    "(.*?)(?:[\s>)"&q&"'])"
        
    Regex.New2(r,true,true)
        FileOpen(c1,OpenDialog1.File,cRead)
        s = FileReadToEnd(c1)
        FileClose(c1)
        Match.New1
        Match.Value = 
    regex.Match(s)
        
    Do While Match.Success
            lstLinks.Add(Match.GetGroup(
    1))
            Match.Value = Match.NextMatch
        
    Loop
    End Sub
     
  3. MM2forever

    MM2forever Active Member Licensed User

    thank you, the regex works great, but i took the "bracket exception" or whatever I should call it out, because it gave my trouble with links like "gnfgn (1)"
     
Loading...