htmlentities & charencoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Taras_96

    htmlentities & charencoding

    Hi all,

    I was hoping to get some clarification on a couple of questions I have:

    1) When should htmlspecial characters be used? As a general rule should
    it be used for text that may contain special characters that is going
    to be rendered in the browser (ie: text that isn't in tags)? I've got a
    javascript onclick handler whose code includes an ampersand and the
    HTML validator complains. I don't know if I should escape the
    ampersand, or even if its possible (seeing that the text is inside a
    HTML attribute).

    Why would you ever use htmlentities as opposed to htmlspecialchar s? The
    only reason I can think of is if you're page's charset doesn't support
    the special character you're trying to render (for example, the euro
    using Latin1), but then why wouldn't you just change the pages charset
    to UTF-8 (unless you're editor can't save in UTF-8, which might
    indicate its time to get another editor). The comment on the PHP manual
    entry for html entities, 'Please, don't use htmlentities to avoid XSS!
    Htmlspecialchar s is enough!' seems to suggest that the uses for
    htmlentities is limited, since it needn't be used to avoid XSS.

    2) A comment in the PHP manual entry for htmlentities states that their
    function can be used to 'replace any characters in a string that could
    be 'dangerous' to put in an HTML/XML file with their numeric entities
    (e.g. &#233 for [e acute])'. Why would it be dangerous!?

    3) What are some typical uses of specifying HTTP input/output character
    encoding? If it is used to convert output, why wouldn't you just change
    the output page's char encoding? If its used to convert input from say
    UTF-8 to Latin1, couldn't you just use a function to do this?

    That's about it!

    Thanks in advance

    Taras

  • flamer die.spam@hotmail.com

    #2
    Re: htmlentities & charencoding


    Taras_96 wrote:
    Hi all,
    >
    I was hoping to get some clarification on a couple of questions I have:
    >
    1) When should htmlspecial characters be used? As a general rule should
    it be used for text that may contain special characters that is going
    to be rendered in the browser (ie: text that isn't in tags)? I've got a
    javascript onclick handler whose code includes an ampersand and the
    HTML validator complains. I don't know if I should escape the
    ampersand, or even if its possible (seeing that the text is inside a
    HTML attribute).
    >
    Well.. bascially your either saying show this image to the user
    "copyrightsymbo l" OR giving an instruction to the browser to display a
    copyright symbol. I think the "dangerous" comment comes from the fact
    that often MS will simply blank sometimes when they will display
    correctly in *nix or when an undefined notation is used in a page is it
    not known what the effect will be on some platforms or how it will be
    displayed.

    Flamer.
    Why would you ever use htmlentities as opposed to htmlspecialchar s? The
    only reason I can think of is if you're page's charset doesn't support
    the special character you're trying to render (for example, the euro
    using Latin1), but then why wouldn't you just change the pages charset
    to UTF-8 (unless you're editor can't save in UTF-8, which might
    indicate its time to get another editor). The comment on the PHP manual
    entry for html entities, 'Please, don't use htmlentities to avoid XSS!
    Htmlspecialchar s is enough!' seems to suggest that the uses for
    htmlentities is limited, since it needn't be used to avoid XSS.
    >
    2) A comment in the PHP manual entry for htmlentities states that their
    function can be used to 'replace any characters in a string that could
    be 'dangerous' to put in an HTML/XML file with their numeric entities
    (e.g. &#233 for [e acute])'. Why would it be dangerous!?
    >
    3) What are some typical uses of specifying HTTP input/output character
    encoding? If it is used to convert output, why wouldn't you just change
    the output page's char encoding? If its used to convert input from say
    UTF-8 to Latin1, couldn't you just use a function to do this?
    >
    That's about it!
    >
    Thanks in advance
    >
    Taras

    Comment

    • Geoff Berrow

      #3
      Re: htmlentities & charencoding

      Message-ID: <1152576115.197 347.115450@35g2 000cwc.googlegr oups.comfrom
      Taras_96 contained the following:
      >1) When should htmlspecial characters be used? As a general rule should
      >it be used for text that may contain special characters that is going
      >to be rendered in the browser (ie: text that isn't in tags)? I've got a
      >javascript onclick handler whose code includes an ampersand and the
      >HTML validator complains.
      The people without javascript will complain too, when they can't
      navigate your site.

      Just change the ampersand for &amp;
      --
      Geoff Berrow (put thecat out to email)
      It's only Usenet, no one dies.
      My opinions, not the committee's, mine.
      Simple RFDs http://www.ckdog.co.uk/rfdmaker/

      Comment

      • Jerry Stuckle

        #4
        Re: htmlentities &amp; charencoding

        Taras_96 wrote:
        Hi all,
        >
        I was hoping to get some clarification on a couple of questions I have:
        >
        1) When should htmlspecial characters be used? As a general rule should
        it be used for text that may contain special characters that is going
        to be rendered in the browser (ie: text that isn't in tags)? I've got a
        javascript onclick handler whose code includes an ampersand and the
        HTML validator complains. I don't know if I should escape the
        ampersand, or even if its possible (seeing that the text is inside a
        HTML attribute).
        >
        Well, I haven't looked at the code, but I suspect htmlspecialchar s(),
        since it converts fewer characters and has fewer options, it would be
        faster.

        The HTML validator on w3.org is decent, but it doesn't handle javascript
        very well. I just ignore the errors in javascript; for instance,
        something like:

        j=4&i;

        The "&i" is not a valid html entity - but it's valid javascript code.
        And this javascript wouldn't work:

        j = 4%amp;i;

        Why would you ever use htmlentities as opposed to htmlspecialchar s? The
        only reason I can think of is if you're page's charset doesn't support
        the special character you're trying to render (for example, the euro
        using Latin1), but then why wouldn't you just change the pages charset
        to UTF-8 (unless you're editor can't save in UTF-8, which might
        indicate its time to get another editor). The comment on the PHP manual
        entry for html entities, 'Please, don't use htmlentities to avoid XSS!
        Htmlspecialchar s is enough!' seems to suggest that the uses for
        htmlentities is limited, since it needn't be used to avoid XSS.
        >
        Just changing the page charset doesn't change what PHP uses. You can
        pass a charset to either function, but if you need more than the five
        chars handled by htmlspecialchar s() you need to use htmlentities().

        And the notes are comments - from users, not the PHP developers. I give
        it some credence, but not as much as the "official" word from the PHP
        developers. And if you look through them enough, you'll find errors and
        other people who get in and correct the errors. Not that much different
        than what you find here on usenet.
        2) A comment in the PHP manual entry for htmlentities states that their
        function can be used to 'replace any characters in a string that could
        be 'dangerous' to put in an HTML/XML file with their numeric entities
        (e.g. &#233 for [e acute])'. Why would it be dangerous!?
        >
        Don't know here, but I suspect browsers may act differently in different
        languages. But I have enough trouble with my native language, so I
        really haven't worried about it. But again that's a user comment.
        3) What are some typical uses of specifying HTTP input/output character
        encoding? If it is used to convert output, why wouldn't you just change
        the output page's char encoding? If its used to convert input from say
        UTF-8 to Latin1, couldn't you just use a function to do this?
        >
        I use it anytime I'm displaying data input by the user, read from a
        database, etc. You never know when the data might contain a '<', a '"',
        etc.

        Changing the char encoding for the page doesn't convert any characters.
        All it does is tell the browser how to handle the characters. It's up
        to you, the programmer, to ensure the character encoding you use matches
        that of the page.

        That's about it!
        >
        Thanks in advance
        >
        Taras
        >

        --
        =============== ===
        Remove the "x" from my email address
        Jerry Stuckle
        JDS Computer Training Corp.
        jstucklex@attgl obal.net
        =============== ===

        Comment

        • Mel

          #5
          Re: htmlentities &amp; charencoding

          On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attg lobal.netsaid:
          Taras_96 wrote:
          >Hi all,
          >>
          >I was hoping to get some clarification on a couple of questions I have:
          >>
          >1) When should htmlspecial characters be used? As a general rule should
          >it be used for text that may contain special characters that is going
          >to be rendered in the browser (ie: text that isn't in tags)? I've got a
          >javascript onclick handler whose code includes an ampersand and the
          >HTML validator complains. I don't know if I should escape the
          >ampersand, or even if its possible (seeing that the text is inside a
          >HTML attribute).
          >>
          >
          Well, I haven't looked at the code, but I suspect htmlspecialchar s(),
          since it converts fewer characters and has fewer options, it would be
          faster.
          >
          The HTML validator on w3.org is decent, but it doesn't handle
          javascript very well. I just ignore the errors in javascript; for
          instance, something like:
          >
          j=4&i;
          >
          The "&i" is not a valid html entity - but it's valid javascript code.
          And this javascript wouldn't work:
          >
          j = 4%amp;i;
          No, it wouldn't, but valid XHTML _requires_ you to preclude the
          embedded JavaScript with the appropriate CDATA marker. The character
          '&' is reserved by the markup just like '>' and '<'. Not adhering to
          the outlined standards simply encourages bad markup and makes
          cross-browser compatibility more difficult. It's a big stretch to
          equate cross-browser issues with unencoded ampersands, but it's not
          that difficult to deal with. Javascript has some functional string
          methods for encoding HTML entities.
          >
          >
          >Why would you ever use htmlentities as opposed to htmlspecialchar s? The
          >only reason I can think of is if you're page's charset doesn't support
          >the special character you're trying to render (for example, the euro
          >using Latin1), but then why wouldn't you just change the pages charset
          >to UTF-8 (unless you're editor can't save in UTF-8, which might
          >indicate its time to get another editor). The comment on the PHP manual
          >entry for html entities, 'Please, don't use htmlentities to avoid XSS!
          >Htmlspecialcha rs is enough!' seems to suggest that the uses for
          >htmlentities is limited, since it needn't be used to avoid XSS.
          >>
          >
          Just changing the page charset doesn't change what PHP uses. You can
          pass a charset to either function, but if you need more than the five
          chars handled by htmlspecialchar s() you need to use htmlentities().
          >
          And the notes are comments - from users, not the PHP developers. I
          give it some credence, but not as much as the "official" word from the
          PHP developers. And if you look through them enough, you'll find
          errors and other people who get in and correct the errors. Not that
          much different than what you find here on usenet.
          >
          >2) A comment in the PHP manual entry for htmlentities states that their
          >function can be used to 'replace any characters in a string that could
          >be 'dangerous' to put in an HTML/XML file with their numeric entities
          >(e.g. &#233 for [e acute])'. Why would it be dangerous!?
          >>
          >
          Don't know here, but I suspect browsers may act differently in
          different languages. But I have enough trouble with my native
          language, so I really haven't worried about it. But again that's a
          user comment.
          >
          >3) What are some typical uses of specifying HTTP input/output character
          >encoding? If it is used to convert output, why wouldn't you just change
          >the output page's char encoding? If its used to convert input from say
          >UTF-8 to Latin1, couldn't you just use a function to do this?
          >>
          >
          I use it anytime I'm displaying data input by the user, read from a
          database, etc. You never know when the data might contain a '<', a
          '"', etc.
          >
          Changing the char encoding for the page doesn't convert any characters.
          All it does is tell the browser how to handle the characters. It's
          up to you, the programmer, to ensure the character encoding you use
          matches that of the page.
          >
          >
          >That's about it!
          >>
          >Thanks in advance
          >>
          >Taras

          Comment

          • Jerry Stuckle

            #6
            Re: htmlentities &amp; charencoding

            Mel wrote:
            On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attg lobal.netsaid:
            >
            >>
            >Well, I haven't looked at the code, but I suspect htmlspecialchar s(),
            >since it converts fewer characters and has fewer options, it would be
            >faster.
            >>
            >The HTML validator on w3.org is decent, but it doesn't handle
            >javascript very well. I just ignore the errors in javascript; for
            >instance, something like:
            >>
            > j=4&i;
            >>
            >The "&i" is not a valid html entity - but it's valid javascript code.
            >And this javascript wouldn't work:
            >>
            > j = 4%amp;i;
            >
            >
            No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded
            JavaScript with the appropriate CDATA marker. The character '&' is
            reserved by the markup just like '>' and '<'. Not adhering to the
            outlined standards simply encourages bad markup and makes cross-browser
            compatibility more difficult. It's a big stretch to equate cross-browser
            issues with unencoded ampersands, but it's not that difficult to deal
            with. Javascript has some functional string methods for encoding HTML
            entities.
            >
            Who said anything about XHTML? This is straight html.

            And the point is - this is valid javascript, but the validator on w3.org
            doesn't recognize it as such. Therefore it spits out errors where there
            are none.



            --
            =============== ===
            Remove the "x" from my email address
            Jerry Stuckle
            JDS Computer Training Corp.
            jstucklex@attgl obal.net
            =============== ===

            Comment

            • Andy Hassall

              #7
              Re: htmlentities &amp; charencoding

              On Tue, 11 Jul 2006 17:36:20 -0400, Jerry Stuckle <jstucklex@attg lobal.net>
              wrote:
              >Mel wrote:
              >On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attg lobal.netsaid:
              >>
              >>The HTML validator on w3.org is decent, but it doesn't handle
              >>javascript very well. I just ignore the errors in javascript; for
              >>instance, something like:
              >>>
              >> j=4&i;
              >>>
              >>The "&i" is not a valid html entity - but it's valid javascript code.
              >>And this javascript wouldn't work:
              >>>
              >> j = 4%amp;i;
              >>
              >No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded
              >JavaScript with the appropriate CDATA marker. The character '&' is
              >reserved by the markup just like '>' and '<'. Not adhering to the
              >outlined standards simply encourages bad markup and makes cross-browser
              >compatibilit y more difficult. It's a big stretch to equate cross-browser
              >issues with unencoded ampersands, but it's not that difficult to deal
              >with. Javascript has some functional string methods for encoding HTML
              >entities.
              >
              >Who said anything about XHTML? This is straight html.
              >
              >And the point is - this is valid javascript, but the validator on w3.org
              >doesn't recognize it as such. Therefore it spits out errors where there
              >are none.
              Yes, this seems to be backed up by HTML 4.01 appendix B.3.2, which even has an
              example of the contents of a <scriptelemen t in VBScript using & as a string
              concatenation operator.



              It discusses how to avoid accidentally closing the <scriptelemen t, but seems
              to indicate that & doesn't start a character reference inside <script>, as
              that's automatically CDATA. So validators producing errors in this case would
              appear to be wrong.

              However, validator.w3.or g currently handles the example given without error. I
              uploaded the following:

              <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Strict //EN"
              "http://www.w3.org/TR/html4/strict.dtd">
              <html>
              <head>
              <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">
              <title>Page</title>
              </head>
              <body>

              <script type="text/javascript">
              j=4&i;
              </script>

              </body>
              </html>

              It responded:

              This Page Is Valid -//W3C//DTD HTML 4.01 Strict //EN!

              (it also validates as Transitional, unsurprisingly) Has its behaviour changed
              recently? Did it used to produce errors in this case?

              The "HTML Tidy" validator as used in the HTML Validator Firefox extension also
              accepts & within <scriptwithou t complaint, and correctly complains about "</"
              appearing in the script source.

              --
              Andy Hassall :: andy@andyh.co.u k :: http://www.andyh.co.uk
              http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

              Comment

              • Jerry Stuckle

                #8
                Re: htmlentities &amp; charencoding

                Andy Hassall wrote:
                On Tue, 11 Jul 2006 17:36:20 -0400, Jerry Stuckle <jstucklex@attg lobal.net>
                wrote:
                >
                >
                >>Mel wrote:
                >>
                >>>On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attg lobal.netsaid:
                >>>
                >>>
                >>>>The HTML validator on w3.org is decent, but it doesn't handle
                >>>>javascrip t very well. I just ignore the errors in javascript; for
                >>>>instance, something like:
                >>>>
                >>> j=4&i;
                >>>>
                >>>>The "&i" is not a valid html entity - but it's valid javascript code.
                >>>>And this javascript wouldn't work:
                >>>>
                >>> j = 4%amp;i;
                >>>
                >>>No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded
                >>>JavaScript with the appropriate CDATA marker. The character '&' is
                >>>reserved by the markup just like '>' and '<'. Not adhering to the
                >>>outlined standards simply encourages bad markup and makes cross-browser
                >>>compatibilit y more difficult. It's a big stretch to equate cross-browser
                >>>issues with unencoded ampersands, but it's not that difficult to deal
                >>>with. Javascript has some functional string methods for encoding HTML
                >>>entities.
                >>
                >>Who said anything about XHTML? This is straight html.
                >>
                >>And the point is - this is valid javascript, but the validator on w3.org
                >>doesn't recognize it as such. Therefore it spits out errors where there
                >>are none.
                >
                >
                Yes, this seems to be backed up by HTML 4.01 appendix B.3.2, which even has an
                example of the contents of a <scriptelemen t in VBScript using & as a string
                concatenation operator.
                >

                >
                It discusses how to avoid accidentally closing the <scriptelemen t, but seems
                to indicate that & doesn't start a character reference inside <script>, as
                that's automatically CDATA. So validators producing errors in this case would
                appear to be wrong.
                >
                However, validator.w3.or g currently handles the example given without error. I
                uploaded the following:
                >
                <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Strict //EN"
                "http://www.w3.org/TR/html4/strict.dtd">
                <html>
                <head>
                <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">
                <title>Page</title>
                </head>
                <body>
                >
                <script type="text/javascript">
                j=4&i;
                </script>
                >
                </body>
                </html>
                >
                It responded:
                >
                This Page Is Valid -//W3C//DTD HTML 4.01 Strict //EN!
                >
                (it also validates as Transitional, unsurprisingly) Has its behaviour changed
                recently? Did it used to produce errors in this case?
                >
                The "HTML Tidy" validator as used in the HTML Validator Firefox extension also
                accepts & within <scriptwithou t complaint, and correctly complains about "</"
                appearing in the script source.
                >
                Andy,

                They might have fixed it. I hope so. I've had problems with it before.
                I just ignore any errors within <scriptelements .


                --
                =============== ===
                Remove the "x" from my email address
                Jerry Stuckle
                JDS Computer Training Corp.
                jstucklex@attgl obal.net
                =============== ===

                Comment

                Working...