How to capture URLs in HTML file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • hdbbdh
    New Member
    • Dec 2008
    • 24

    How to capture URLs in HTML file

    Hello everyone,

    I am trying to capture URLs in HTML file which appears like

    Code:
    <a[string]href[space(s) or nothing]=[space(s) or nothing]["][URL]["][string]>
    I found this code but it does not work well.

    Code:
    Imports System.Text.RegularExpressions
    Public Class Form1
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    Dim rx As New Regex("[<]a[\s][\w\W]*[href=](?<word>\S*)[\s\W\w]*[>]", _
    RegexOptions.Compiled Or RegexOptions.IgnoreCase)
    Dim text As String = "<a href=http:// name=as>"
    Dim matches As MatchCollection = rx.Matches(text)
    For Each m As Match In matches
    MsgBox(m.Groups("word").Value)
    Next
    End Sub
    End Class
    Thank You
  • Plater
    Recognized Expert Expert
    • Apr 2007
    • 7872

    #2
    Assuming the regex matches your given pattern it would probably work.
    But remember, not all websites are built with " in the href field (they should be, but aren't)
    You might find a single quote ' or no quoting at all

    Comment

    • hdbbdh
      New Member
      • Dec 2008
      • 24

      #3
      Hello everyone,
      Thank you Plater for your answer and your important tip.
      I make some modifications in the code and it must be work, but finaly I got only the last URL, then can you help me in this.

      Code:
      Dim sr As New StreamReader("c:\cas.html")
      Dim text As String = sr.ReadToEnd()
      sr.Close()
      text = text.Replace(Chr(13), "")
      text = text.Replace("  ", " ")
      Dim spattern As String = "<\s*a\s+[\w\W]*href\s*=[\s'" & Chr(34) & "]*(?<word>[^" & Chr(34) & "'\s]+)[\s\S\W\w]*[>]"
      Dim rx As New Regex(spattern, _
      RegexOptions.Compiled Or RegexOptions.IgnoreCase)
      Dim matches As MatchCollection = rx.Matches(text)
      For Each m As Match In matches
          ListBox1.Items.Add(m.Groups("word").Value)
      Next
      The html page that I use is attached, and if you try this code for this page you will get contact.htm

      Thank you
      Attached Files

      Comment

      • hdbbdh
        New Member
        • Dec 2008
        • 24

        #4
        I think I found the solution
        Code:
        Dim sr As New StreamReader("c:\cas.html")
        Dim text As String = sr.ReadToEnd()
        sr.Close()
        text = text.Replace(Chr(13), "")
        Do While InStr(text, "  ")
            text = text.Replace("  ", " ")
        Loop
        Dim spattern As String = "<\s*a\s+[^>]*href\s*=[\s'" & Chr(34) & "]*(?<word>[^" & Chr(34) & "'\s]+)[^>]*>"
        For Each m As Match In Regex.Matches(text, spattern, RegexOptions.Compiled Or RegexOptions.IgnoreCase)
            ListBox1.Items.Add(m.Groups("word").Value)
        Next
        The problem is that when I use \W\w its include > symbol.

        Thank you

        Comment

        • raids51
          New Member
          • Nov 2007
          • 59

          #5
          wouldnt it be easier to use the mshtml class?

          Comment

          Working...