Parsing content for links

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Tony

    Parsing content for links

    I have a content management system that has links within the content
    field in the database and I need to verify if those links are correct.
    What I need to have happen is have a php script query the database and
    then parse through the content field to find all the <a hreftags to
    get the href attribute value and the link text.

    Does anyone have a way of doing this or a regex to do this?

    Thanks,
    Tony

  • Arjen

    #2
    Re: Parsing content for links

    Tony schreef:
    I have a content management system that has links within the content
    field in the database and I need to verify if those links are correct.
    What I need to have happen is have a php script query the database and
    then parse through the content field to find all the <a hreftags to
    get the href attribute value and the link text.
    >
    Does anyone have a way of doing this or a regex to do this?
    >
    preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
    "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
    $html, &$matches);


    --
    Arjen
    http://www.hondenpage.com - Mijn site over honden

    Comment

    • Curtis

      #3
      Re: Parsing content for links

      Tony wrote:
      I have a content management system that has links within the content
      field in the database and I need to verify if those links are correct.
      What I need to have happen is have a php script query the database and
      then parse through the content field to find all the <a hreftags to
      get the href attribute value and the link text.
      >
      Does anyone have a way of doing this or a regex to do this?
      >
      Thanks,
      Tony
      >
      Yeah, regex would be easiest, and there should be plenty out there,
      but I might do something like this:

      $re = '%
      <a[^<>]+ # href may or may not come first
      href=([\'"]) # capture single/double quote

      # match a valid URI
      (
      [\w.-]+:(?://)? # scheme
      [^?"]+ # authority

      # possible query string and fragment
      (?:
      \\? [^#]+
      (?: \\# [^"]+ )?
      )?
      )

      \1 # captured quote from above
      [^<>]* # possible remaining attributes
      >( .*? ) # allow for nested tags
      </a> # closing <atag
      %xi';

      The match for the URI would be in $match[2] and the text for the <a>
      tag is in $match[3].

      Just use this $re var in the preg_* functions.

      Hope this helps,
      Curtis

      Comment

      Working...