preg_match() regex to validate URL

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • deko

    preg_match() regex to validate URL

    As I understand it, the characters that make up an Internet domain name can
    consist of only alpha-numeric characters and a hyphen
    (http://tools.ietf.org/html/rfc3696)

    So I'm trying to write regex that will provide a basic url format validation:

    starts with http or https (the only 2 prots I'm interested in), is followed by
    '://', then ([any alpha-numeric or hyphen] followed by a '.' appearing 1 or more
    times), then followed by anything *, and is case-insensitive.

    I tried this:

    if (preg_match('/^(http|https):\/\/([a-z0-9-]\.+)*/i', $urlString))
    {
    $valid == true;
    }
    else
    {
    $valid == false;
    }

    but no luck.

    Any suggestions welcome...

    Thanks in advance.

  • Rik

    #2
    Re: preg_match() regex to validate URL

    deko wrote:

    Deko, while you enthusiasm is appreciated, please stay in the same thread
    when making a post about the same subject. Starting several threads not
    only creates confusion about answers already given and context, it also
    gives off the feeling of being very pushy.
    As I understand it, the characters that make up an Internet domain name
    can consist of only alpha-numeric characters and a hyphen
    (http://tools.ietf.org/html/rfc3696)
    ...."Any characters, or combination of bits (as octets), are permitted in
    DNS names. However, there is a preferred form that is required by most
    applications.". ....
    So I'm trying to write regex that will provide a basic url format
    validation:
    >
    starts with http or https (the only 2 prots I'm interested in), is
    followed by '://', then ([any alpha-numeric or hyphen] followed by a '.'
    appearing 1 or more times), then followed by anything *, and is
    case-insensitive.
    >
    I tried this:
    >
    if (preg_match('/^(http|https):\/\/([a-z0-9-]\.+)*/i', $urlString))
    This bit "([a-z0-9-]\.+)" does not do what you think it does, it matches
    _one_ single character in the [a-z0-9-]-range, followed by at least one,
    but an arbitrary amount of literal dots. And that repeated zero or more
    times.. So 'http://a.b.c.d......d. .e....a......' would match.

    Further more, you seem to have anchorder this with ^, so it will only
    match if http(s):// is at the very beginning of the string. Is that whatr
    you want?

    '/^https?:\/\/[a-z0-9-]+(\.[a-z0-9-]+)+/i'

    --
    Rik Wasmus

    Comment

    • deko

      #3
      Re: preg_match() regex to validate URL

      >As I understand it, the characters that make up an Internet domain name can
      >consist of only alpha-numeric characters and a hyphen
      >(http://tools.ietf.org/html/rfc3696)
      ..."Any characters, or combination of bits (as octets), are permitted in DNS
      names. However, there is a preferred form that is required by most
      applications.". ....
      I just tried registering various domain names with an underscore. The
      registrar's system rejected it. While this may not be the best verification, I
      have yet to see a valid Internet domain with an underscore or any other
      non-alphanumeric character (other than a hyphen).
      '/^https?:\/\/[a-z0-9-]+(\.[a-z0-9-]+)+/i'
      Thanks, Rik

      Comment

      • Rik

        #4
        Re: preg_match() regex to validate URL

        On Mon, 12 Feb 2007 10:29:26 +0100, deko <deko@nospam.co mwrote:
        >>As I understand it, the characters that make up an Internet domain
        >>name can consist of only alpha-numeric characters and a hyphen
        >>(http://tools.ietf.org/html/rfc3696)
        >..."Any characters, or combination of bits (as octets), are permitted
        >in DNS names. However, there is a preferred form that is required by
        >most applications.". ....
        >
        I just tried registering various domain names with an underscore. The
        registrar's system rejected it. While this may not be the best
        verification, I have yet to see a valid Internet domain with an
        underscore or any other non-alphanumeric character (other than a hyphen).
        There are efforts to fully internationalis e DNS entries, so even non-roman
        based character sets are allowed. See for instance
        <http://www.ietf.org/rfc/rfc4185.txt>. We're not there yet by a long shot,
        but there's no doubt it will happen.

        --
        Rik Wasmus

        Comment

        • deko

          #5
          Re: preg_match() regex to validate URL

          >>>As I understand it, the characters that make up an Internet domain name
          >>>can consist of only alpha-numeric characters and a hyphen
          >>>(http://tools.ietf.org/html/rfc3696)
          >>..."Any characters, or combination of bits (as octets), are permitted in
          >>DNS names. However, there is a preferred form that is required by most
          >>applications. ".....
          >>
          >I just tried registering various domain names with an underscore. The
          >registrar's system rejected it. While this may not be the best
          >verification , I have yet to see a valid Internet domain with an underscore
          >or any other non-alphanumeric character (other than a hyphen).
          >
          There are efforts to fully internationalis e DNS entries, so even non-roman
          based character sets are allowed. See for instance
          <http://www.ietf.org/rfc/rfc4185.txt>. We're not there yet by a long shot,
          but there's no doubt it will happen.
          Eventually, I'm sure.

          Getting back to my regex question, I wonder if it would be better to check for
          illegal characters:

          if
          (preg_match('/(`|~|!|@|#|$|%| ^|&|*|(|\)|_|\+ |=|\[|\{|\]|\}|\||;|\:|\'| \"|\<|\>|\?| )/',
          $url_a['host'])) ???

          I'm not having much luck catching invalid hostnames otherwise...

          Comment

          • Toby A Inkster

            #6
            Re: preg_match() regex to validate URL

            Rik wrote:
            There are efforts to fully internationalis e DNS entries, so even non-roman
            based character sets are allowed. See for instance
            <http://www.ietf.org/rfc/rfc4185.txt>. We're not there yet by a long shot,
            but there's no doubt it will happen.
            Not there yet?!

            Try telling that to "www.한글.kr" !

            --
            Toby A Inkster BSc (Hons) ARCS
            Contact Me ~ http://tobyinkster.co.uk/contact
            Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

            * = I'm getting there!

            Comment

            • Michael Fesser

              #7
              Re: preg_match() regex to validate URL

              ..oO(Rik)
              >'/^https?:\/\/[a-z0-9-]+(\.[a-z0-9-]+)+/i'
              With another delimiter you could avoid the escaping of slashes and make
              the regexp a bit more readable (IMHO):

              '#^https?://[a-z0-9-]+(\.[a-z0-9-]+)+#i'

              Just my 2 cents.

              Micha

              Comment

              • Rik

                #8
                Re: preg_match() regex to validate URL

                On Mon, 12 Feb 2007 16:39:24 +0100, Toby A Inkster
                <usenet200702@t obyinkster.co.u kwrote:
                Rik wrote:
                >
                >There are efforts to fully internationalis e DNS entries, so even
                >non-roman
                >based character sets are allowed. See for instance
                ><http://www.ietf.org/rfc/rfc4185.txt>. We're not there yet by a long
                >shot,
                >but there's no doubt it will happen.
                >
                Not there yet?!
                >
                Try telling that to "www.한글.kr" !
                Yup, works. Isn't understood by a lot of programs though, most browsers
                will handle it just fine, but browsing is not the only thing we want to
                use it for.

                Simple example: I cannot ping this with ease in my Windows version...
                --
                Rik Wasmus

                Comment

                • deko

                  #9
                  Re: preg_match() regex to validate URL

                  >>'/^https?:\/\/[a-z0-9-]+(\.[a-z0-9-]+)+/i'
                  >
                  With another delimiter you could avoid the escaping of slashes and make
                  the regexp a bit more readable (IMHO):
                  >
                  '#^https?://[a-z0-9-]+(\.[a-z0-9-]+)+#i'
                  Thanks for the tip.

                  I recently found this: http://baseclass.modulweb.dk/urlvali...viewsource.php

                  which looks interesting, if not overkill.

                  Comment

                  • Toby A Inkster

                    #10
                    Re: preg_match() regex to validate URL

                    Rik wrote:
                    Yup, works. Isn't understood by a lot of programs though, most browsers
                    will handle it just fine, but browsing is not the only thing we want to
                    use it for.
                    >
                    Simple example: I cannot ping this with ease in my Windows version...
                    There are IDN-enabled versions of ping available, but few if any operating
                    systems ship with them as standard yet.

                    Though the hope is that operating systems will integrate IDN support
                    directly into their own gethostbyname() type functions, so there is no
                    need to explicitly compile IDN support into all software that uses domain
                    names.

                    (On the other hand, many software has to parse URLs too, in which case
                    they'd probably need to update their URL-parsing code to cope with IDN.)

                    libidn exists, which makes it really easy to drop in support for
                    internationalis ed domain names into existing network apps. It's LGPL too,
                    which even makes it available for use by closed-source software.

                    --
                    Toby A Inkster BSc (Hons) ARCS
                    Contact Me ~ http://tobyinkster.co.uk/contact
                    Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

                    * = I'm getting there!

                    Comment

                    Working...