Best way to extract URL from random string?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • deko

    Best way to extract URL from random string?

    If I have random and unpredictable user agent strings containing URLs, what is
    the best way to extract the URL?

    For example, let's say the string looks like this:

    registered NYSE 943 <a href="http://netforex.net"Fo rex Trading Network
    Organization </ainfo@netforex.o rg

    What's the best way to extract http://netforex.net ?

    I have code that checks for identifiable browsers and bots, but when the agent
    string has no identifiable information other than a URL, I want to grab the URL.

    Here's a first crack at it:
    ..
    ..
    ..
    [code omitted]
    ..
    ..
    ..
    elseif (eregi("http://", $agent))
    {
    $agent = stristr($agent, "http://");
    $agent = parse_url($agen t);
    $agent = $agent['host'];
    //check for subdomains
    $agent_a = explode(".", $agent);
    $agent_r = array_reverse($ agent_a);
    $sub = count($agent_r) - 1;
    $tld3 = substr($agent_r[0], 0, 3);
    if (eregi("^(com|n et|org|edu|biz| gov)$", $tld3)) //common tld's
    {
    while ($sub 0)
    {
    $domain = $domain.$agent_ r[$sub].".";
    $sub--;
    }
    $refurl = $domain.$tld3;
    }
    $referrer = "<a href='".$refurl ."'>".$refurl." </a>";
    }
    else
    {
    $referrer = "unknown";
    }

    Are there any PHP functions that will help here? How to handle sub domains?
    International domains?

    Thanks in advance.

  • deciacco

    #2
    Re: Best way to extract URL from random string?

    How about:

    if
    (preg_match('/\\b(https?|ftp| file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0
    -9+&@#\/%=~_|]/i', $subject, $result)) {
    $url = $result[0];
    } else {
    $url = "";
    }

    -----Original Message-----
    From: deko [mailto:deko@nos pam.com]
    Posted At: Friday, February 09, 2007 2:15 PM
    Posted To: comp.lang.php
    Conversation: Best way to extract URL from random string?
    Subject: Best way to extract URL from random string?

    If I have random and unpredictable user agent strings containing URLs,
    what is
    the best way to extract the URL?

    For example, let's say the string looks like this:

    registered NYSE 943 <a href="http://netforex.net"Fo rex Trading Network

    Organization </ainfo@netforex.o rg

    What's the best way to extract http://netforex.net ?

    I have code that checks for identifiable browsers and bots, but when the
    agent
    string has no identifiable information other than a URL, I want to grab
    the URL.

    Here's a first crack at it:
    ..
    ..
    ..
    [code omitted]
    ..
    ..
    ..
    elseif (eregi("http://", $agent))
    {
    $agent = stristr($agent, "http://");
    $agent = parse_url($agen t);
    $agent = $agent['host'];
    //check for subdomains
    $agent_a = explode(".", $agent);
    $agent_r = array_reverse($ agent_a);
    $sub = count($agent_r) - 1;
    $tld3 = substr($agent_r[0], 0, 3);
    if (eregi("^(com|n et|org|edu|biz| gov)$", $tld3)) //common tld's
    {
    while ($sub 0)
    {
    $domain = $domain.$agent_ r[$sub].".";
    $sub--;
    }
    $refurl = $domain.$tld3;
    }
    $referrer = "<a href='".$refurl ."'>".$refurl." </a>";
    }
    else
    {
    $referrer = "unknown";
    }

    Are there any PHP functions that will help here? How to handle sub
    domains?
    International domains?

    Thanks in advance.

    Comment

    • BKDotCom

      #3
      Re: Best way to extract URL from random string?

      On Feb 9, 2:15 pm, "deko" <d...@nospam.co mwrote:
      Are there any PHP functions that will help here? How to handle sub domains?
      International domains?
      >
      Thanks in advance.
      well, you found parse_url
      you might want to use regular expressions as well

      $long_string = 'A HREF="http://something.else. example.com/blah/?
      joe=bob"';
      if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_stri ng,$matches) )
      {
      $url = $matches[1]; // http://something.else.example.com/blah/?
      joe=bob
      $parts = parse_url($url) ;
      if ( preg_match('/(.+)\.\w+\.\w+/',$parts['host'],$matches) )
      echo $matches[1]; // something.else
      }

      Comment

      • Rik

        #4
        Re: Best way to extract URL from random string?

        On Fri, 09 Feb 2007 22:02:18 +0100, BKDotCom <bkfake-google@yahoo.co m
        wrote:
        On Feb 9, 2:15 pm, "deko" <d...@nospam.co mwrote:
        >
        >Are there any PHP functions that will help here? How to handle sub
        >domains?
        >Internationa l domains?
        >>
        >Thanks in advance.
        >
        well, you found parse_url
        you might want to use regular expressions as well
        >
        $long_string = 'A HREF="http://something.else. example.com/blah/?
        joe=bob"';
        if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_stri ng,$matches) )
        Afaik protocols can only be a-z+, you don't have to capture the entire
        match, and the url should have at least one character, so a little
        optimised it would be:

        '|[a-z]+://[^\s"\']+|i'

        {
        $url = $matches[1]; // http://something.else.example.com/blah/?
        joe=bob
        $url = $matches[0];

        --
        Rik Wasmus

        Comment

        • deko

          #5
          Re: Best way to extract URL from random string?

          "BKDotCom" <bkfake-google@yahoo.co mwrote in message
          news:1171054938 .335019.14900@k 78g2000cwa.goog legroups.com...
          On Feb 9, 2:15 pm, "deko" <d...@nospam.co mwrote:
          >
          >Are there any PHP functions that will help here? How to handle sub domains?
          >Internationa l domains?
          >>
          >Thanks in advance.
          >
          well, you found parse_url
          you might want to use regular expressions as well
          >
          $long_string = 'A HREF="http://something.else. example.com/blah/?
          joe=bob"';
          if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_stri ng,$matches) )
          {
          $url = $matches[1]; // http://something.else.example.com/blah/?
          joe=bob
          $parts = parse_url($url) ;
          if ( preg_match('/(.+)\.\w+\.\w+/',$parts['host'],$matches) )
          echo $matches[1]; // something.else
          }
          use regex to handle subdomains... I see!

          but wouldn't the first few lines of my original code be a more efficient
          starting point?
          elseif (eregi("http://", $agent))
          {
          $agent = stristr($agent, "http://");
          $agent = parse_url($agen t);
          //now use preg_match() to return everything beginning with a '.' up to
          the next word boundary (?)

          still testing...

          Comment

          • deko

            #6
            Re: Best way to extract URL from random string?


            "Rik" <luiheidsgoeroe @hotmail.comwro te in message
            news:op.tnh2l8s gqnv3q9@misant. ..
            On Fri, 09 Feb 2007 22:02:18 +0100, BKDotCom <bkfake-google@yahoo.co m>
            wrote:
            On Feb 9, 2:15 pm, "deko" <d...@nospam.co mwrote:
            >
            >Are there any PHP functions that will help here? How to handle sub domains?
            >Internationa l domains?
            >>
            >Thanks in advance.
            >
            well, you found parse_url
            you might want to use regular expressions as well
            >
            $long_string = 'A HREF="http://something.else. example.com/blah/?
            joe=bob"';
            if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_stri ng,$matches) )
            Afaik protocols can only be a-z+, you don't have to capture the entire
            match, and the url should have at least one character, so a little
            optimised it would be:

            '|[a-z]+://[^\s"\']+|i'

            {
            $url = $matches[1]; // http://something.else.example.com/blah/?
            joe=bob
            $url = $matches[0];


            =============== =============== =============

            I've been thinking about this... see http://www.liarsscourge.com

            I need to decide:

            1) what TLDs I will accept
            2) what protocols I will accept

            so...

            1 = common TLDs, including international TLDs
            2 = http only

            next...

            -- assemble array of common/international TLDs
            -- construct regex to search for TLDs in this array

            developing...

            Comment

            Working...