urlencode vs rawurlencode

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Joshua Beall

    urlencode vs rawurlencode

    Hi All,

    I can see from the manual that the difference between urlencode and
    rawurlencode is that urlencode translates spaces to '+' characters, whereas
    rawurlencode translates it into it's hex code.

    My question is, is there any real world difference between these two
    functions? Or perhaps another way of asking the question: *why* are there
    two different functions? In what situation would you need one, and not be
    able to use the other?

    Thanks!

    -Josh


  • John Dunlop

    #2
    Re: urlencode vs rawurlencode

    Joshua Beall wrote:
    [color=blue]
    > I can see from the manual that the difference between urlencode and
    > rawurlencode is that urlencode translates spaces to '+' characters, whereas
    > rawurlencode translates it into it's hex code.
    >
    > My question is, is there any real world difference between these two
    > functions?[/color]

    I don't know.
    [color=blue]
    > Or perhaps another way of asking the question: *why* are there two
    > different functions?[/color]

    A good question. I don't know the answer to that either.

    A plus sign is reserved in the query component. A reserved character
    may be used for its reserved purpose or, if it doesn't conflict with
    the reserved purpose, as data.

    Spaces encoded as plus signs is specific to form encoding. The
    HTML4.01 specification describes the encoding process: "[i]f the
    method is 'get' and the action is an HTTP URI, the user agent takes
    the value of action, appends a `?' to it, then appends the form data
    set, encoded using the 'application/x-www-form-urlencoded' content
    type" (HTML4.01, sec. 17.13.3). So, here, spaces are encoded as plus
    signs; elsewhere, spaces are encoded as "%20", as explained in
    RFC2396, section 2.4.

    Consider:

    1. <http://domain.example/?baz=foo+bar>
    2. <http://domain.example/?baz=foo%20bar>
    3. <http://domain.example/?baz=foo%2Bbar>

    All three are syntactically valid URIs. The first could be a URI
    generated from an HTML form, where the action specified was
    <http://domain.example/>, the method GET and the form data set
    consisting of a control named "baz" with current value "foo bar". The
    space in the current value is replaced with a plus sign.

    Reading Björn Höhrmann's explanation of reserved characters in

    "Re: Good/Bad - URI encoding in HTML editor",


    we see that numbers one and two are *not* equivalent.

    Also related is Terje Bless' request for clarification

    "Ambiguity of Allowed/Recommended URI Syntax and Escaping",

    [color=blue]
    > In what situation would you need one, and not be able to use the other?[/color]

    That depends on the URI generator, I think.

    The documentation for urlencode says "[t]his function is convenient
    when encoding a string to be used in a query part of a URL" [1]. I
    don't see any reason to favour it over rawurlencode, however, which
    encodes as per section 2.4 of RFC2396 (modulo the fact it always
    encodes certain unreserved characters [2]).

    Refs.:

    "Uniform Resource Identifiers (URI): Generic Syntax", 1998,


    "Uniform Resource Locators (URL)", 1994,



    [1] "PHP: urlencode - Manual",


    [2] Section 2.3 of RFC2396 says:

    | Unreserved characters can be escaped without changing the semantics
    | of the URI, but this should not be done unless the URI is being used
    | in a context that does not allow the unescaped character to appear.

    --
    Jock

    Comment

    • John Dunlop

      #3
      Re: urlencode vs rawurlencode

      If I sound confused, that's because I am.

      John Dunlop wrote:
      [color=blue]
      > Consider:
      >
      > 1. <http://domain.example/?baz=foo+bar>
      > 2. <http://domain.example/?baz=foo%20bar>
      > 3. <http://domain.example/?baz=foo%2Bbar>[/color]

      [ ... ]
      [color=blue]
      > Reading Björn Höhrmann's explanation of reserved characters in
      >
      > "Re: Good/Bad - URI encoding in HTML editor",
      > http://lists.w3.org/Archives/Public/...2May/0032.html
      >
      > we see that numbers one and two are *not* equivalent.[/color]

      Actually, I think, numbers one and two are equivalent. Hopefully I've
      got this straight in my head now. :-)

      RFC1630, which I hadn't read before, sums up Tim BL's original intent:

      | Within the query string, the plus sign is reserved as shorthand
      | notation for a space. Therefore, real plus signs must be encoded.
      | This method was used to make query URIs easier to pass in systems
      | which did not allow spaces.

      According to RFC1738, sec. 3.3, however, plus signs weren't reserved
      in the query component ("searchpart ") of an HTTP URL. That means they
      had no reserved purpose, so a plus sign meant a plus sign, not a
      space, and they didn't need encoded.

      Then came along RFC2396 and the plus sign became reserved in the query
      component again. Real plus signs must now be encoded. It doesn't say
      what the reserved purpose is for plus signs. I guess, then, plus
      signs are shorthand for spaces.

      Previously, I was under the impression that a question mark mustn't
      appear in query components. It seems I was wrong. A URI may contain
      more than one question mark, although URI generators are discouraged
      from generating such URIs. The second "?" should always be treated as
      data by parsers. See

      Roy T. Fielding, 2002-11-17, "Re: Ambiguity of Allowed/Recommended URI
      Syntax and Escaping",


      Refs.:

      RFC1630 (informational) , 1994-06, "Universal Resource Identifiers in
      WWW: A Unifying Syntax for the Expression of Names and Addresses of
      Objects on the Network as used in the World-Wide Web",


      RFC1738 (proposed standard), 1994-12, "Uniform Resource Locators
      (URL)",


      RFC2396 (draft standard), 1998-08, "Uniform Resource Identifiers
      (URI): Generic Syntax",


      --
      Jock

      Comment

      • Adriaan

        #4
        Re: urlencode vs rawurlencode

        "John Dunlop" wrote[color=blue]
        > 1. <http://domain.example/?baz=foo+bar>
        > 2. <http://domain.example/?baz=foo%20bar>[/color]

        Note that these are also the same as for using $_GET['baz'] (which by design
        holds the decoded values). But when explicitely (manually) decoding
        $_SERVER['QUERY_STRING']: http://php.net/rawurldecode does *not* convert +
        characters into spaces!

        Adriaan


        Comment

        Working...