Input Character Set Handling

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kulgan

    Input Character Set Handling

    Hi

    I am struggling to find definitive information on how IE 5.5, 6 and 7
    handle character input (I am happy with the display of text).


    I have two main questions:


    1. Does IE automaticall convert text input in HTML forms from the
    native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
    the input back to the server?

    2. Does IE Javascript do the same? So if I write a Javascript function
    that compares a UTF-8 string to a string that a user has inputted into
    a text box, will IE convert the user's string into UTF-8 before doing
    the comparison?


    I think that the answer to question 1 is probably "YES", but I cannot
    find any information on question 2!


    Many thanks for your help


    Kulgan.

  • Bart Van der Donck

    #2
    Re: Input Character Set Handling

    Kulgan wrote:
    1. Does IE automaticall convert text input in HTML forms from the
    native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
    the input back to the server?
    With <form method="get" , the browser tries to pass the characters
    to the server in the character set of the page, but it will only
    succeed if the characters in question can be represented in that
    character set. If not, browsers calculate "their best bet" based on
    what's available (old style) or use an Unicode set (new style).

    Example: western browsers send 'é' as '%E9' by default (URL encoding).
    But when the page is in UTF-8, the browser will first lookup the
    Unicode multibyte encoding of 'é'. In this case, it are 2 bytes
    because 'é' lies in UTF code point range 128-256. Those two bytes
    correspond to à and ©, and will result in '%C3%A9' (URL encoding) in
    the eventual query string.

    <form method="post" enctype="applic ation/x-www-form-urlencoded" is
    the same as <form method="post" and uses the same general principle
    as GET.

    In <form method="post" enctype="multip art/form-data" there is no
    default encoding at all, because this encoding type needs to be able to
    transfer non-base64-ed binaries. 'é' will be passed as 'é' and that's
    it.
    2. Does IE Javascript do the same? So if I write a Javascript function
    that compares a UTF-8 string to a string that a user has inputted into
    a text box, will IE convert the user's string into UTF-8 before doing
    the comparison?
    Browsers only encode form values between the moment that the user
    submits the form and the moment that the new POST/GET request is made.
    You should have no problem to use any of the Unicode characters in
    javascript as long as you haven't sent the form.

    Hope this helps,

    --
    Bart

    Comment

    • Kulgan

      #3
      Re: Input Character Set Handling

      Browsers only encode form values between the moment that the user
      submits the form and the moment that the new POST/GET request is made.
      You should have no problem to use any of the Unicode characters in
      javascript as long as you haven't sent the form.
      >
      Thanks for the helpful info.

      On the Javascript subject, if the user's input character set is not
      UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
      UTF-8, how does Javascript see the characters? Does the browser do an
      SJIS to UTF-8 conversion on the characters before they are used (e.g.
      to find the length of the string?)

      Thanks,

      Kulgan.

      Comment

      • VK

        #4
        Re: Input Character Set Handling

        Kulgan wrote:
        2. Does IE Javascript do the same? So if I write a Javascript function
        that compares a UTF-8 string to a string that a user has inputted into
        a text box, will IE convert the user's string into UTF-8 before doing
        the comparison?
        That is confusion inspired by Unicode, Inc. and W3C (I'm wondering
        rather often if they have any clue at all about Unicode).

        Unicode is a *charset* : a set of characters where each character unit
        is represented by two bytes (taking the original Unicode 16-bit
        encoding). At the same time TCP/IP protocol is an 8-bit media: its
        atomic unit is one byte. This way one cannot directly send Unicode
        entities over the Internet: same way as you cannot place a 3D box on a
        sheet of paper, you can only emulate it (making its 2D projection). So
        it is necessary to use some 8-bit *encoding* algorithm to split Unicode
        characters onto sequences of bytes, send them over the Internet and
        glue them back together on the other end. Here UTF-8 *encoding* (not
        *charset*) comes into play. By some special algorithm it encodes
        Unicode characters into base ACSII sequences and send them to the
        recipient. The recipient - informed in advance by Content-Type header
        what i's coming - uses UTF-8 decoder to get back the original Unicode
        characters.
        The Fact Number One unknown to the majority of specialists, including
        the absolute majority of W3C volunteers - so feel yourselve a choosen
        one :-) -
        Pragma <?xml version="1.0" encoding="utf-8"?which one sees left and
        right in XML and pseudo-XHTML documents *does not* mean that this
        document is in UTF-8 encoding. It means that the document is in Unicode
        charset and it must be transmitted (if needed) over an 8-bit media
        using UTF-8 encoding algorithm. Respectively if the document is not
        using Unicode charset then you are making a false statement with
        numerous nasty outcomes pending if ever used on the Internet.
        Here is even more secret knowledge, shared between myself and Sir
        Berners-Lee only :-) -
        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
        *does not* mean that the characters you see on your screen are in
        "UTF-8 charset" (there is not such). It means: "The input stream was
        declared as Unicode charset characters encoded using UTF-8 transport
        encoding. The result you are seeing (if seeing anything) is the result
        of decoding the input stream using UTF-8 decoder".
        "charset" term here is totally misleading one - it remained from the
        old times with charsets of 256 entities maximum thus encoding matching
        charset and vice versa. The proper header W3C should insist on is
        ....content="te xt/html; charset=Unicode ; encoding=UTF-8"
        As I said before very few people on the Earth knows the truth and the
        Web did not collapse so far for two main reason:
        1) Content-Type header sent by server takes precedence over META tag on
        the page. This HTTP standard is one of most valuable ones left to us by
        fathers. They saw in advance the ignorance ruling so left the chance to
        server admins to save the world :-)
        2) All modern UA's have special neuristic built in to sort out real
        UTF-8 input streams and authors mistakes. A note for the "Content-Type
        in my heart" adepts: it means that over the last years a great amount
        of viewer-dependant XML/XHTML documents was produced.

        Sorry for such extremely long preface, but I considered dangerous to
        just keep giving "short fix" advises: it is fighting with symptoms
        instead of the sickness. And the sickness is growing worldwide: out
        helpdesk is flooded with requests like "my document is in UTF-8
        encoding, why..." etc.

        Coming back to your original question: the page will be either Unicode
        or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
        exists only during the transmission and parsing stages. The maximum one
        can do is to have UTF-8 encoded characters right in the document like
        %D0%82... But in such case it is just row UTF-8 source represented
        using ASCII charset.
        >From the other side JavaScript operates with Unicode only and it sees
        the page content "through the window of Unicode" no matter what the
        actual charset is. So to reliably compare user input / node values with
        JavaScript strings you have to:
        1) The most reliable one for an average-small amount of non-ASCII
        characters:
        Use \u Unicode escape sequences

        2) Lesser reliable as can be easily smashed once open in a non-Unicode
        editor:
        Have the entire .js file in Unicode with non-ASCII characters typed as
        they are and your server sending the file in UTF-8 encoding.

        P.S. There is whole another issue which could be named "How do I handle
        Unicode 32-bit characters or How did Unicode, Inc. screw the whole
        world". But your primary question is answered, and it's beer time
        anyway. :-)

        Comment

        • Bart Van der Donck

          #5
          Re: Input Character Set Handling

          Kulgan wrote:
          [...]
          On the Javascript subject, if the user's input character set is not
          UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
          UTF-8, how does Javascript see the characters?
          Always the same, as their Unicode code points.
          Does the browser do an SJIS to UTF-8 conversion on the characters
          before they are used (e.g. to find the length of the string?)
          No conversion/encoding is possible on that level. I think you're not
          fully aware of the distinction between
          (1) the user's (available) charsets
          (2) the charset of the web page
          (3) how javascript handles characters internally

          Only (3) is of importance in your case:

          Paste into input field:<br>
          ヤツカ
          <hr>
          <form>
          <input name="i">
          <input type="button" value="check" onClick="
          if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
          alert('equal') }
          else {
          alert('not equal')
          }
          ">
          </form>

          Note that it doesn't matter whether the user has SJIS installed. It
          also doesn't matter what the charset of the page is.

          --
          Bart

          Comment

          • Bart Van der Donck

            #6
            Re: Input Character Set Handling

            VK wrote:
            [...]
            Unicode is a *charset* : a set of characters where each character unit
            is represented by two bytes (taking the original Unicode 16-bit
            encoding).
            [...]
            I wouldn't put it that way. Some Unicode characters consist of 2 bytes,
            yes, but Unicode's primary idea is the multi-byte concept; characters
            can also consist of 1 byte, or more than 2.

            --
            Bart

            Comment

            • VK

              #7
              Re: Input Character Set Handling


              Bart Van der Donck wrote:
              [...]
              Unicode is a *charset* : a set of characters where each character unit
              is represented by two bytes (taking the original Unicode 16-bit
              encoding).
              [...]
              I wouldn't put it that way. Some Unicode characters consist of 2 bytes,
              yes, but Unicode's primary idea is the multi-byte concept; characters
              can also consist of 1 byte, or more than 2.
              I humbly disagree: the very original Unicode idea is that 8 bits cannot
              accommodate all charcodes for all characters currently used in the
              world. This way it was an obvious idea to use a two bytes encoding with
              respectively 65,535 possible character units: to represent all
              *currently used* systems of writing. While some Far East systems
              (Hangul, Traditional Chinese) would be a space challenge - the majority
              of other systems are based on the Phoenician phonetic alphabet (>Greek
              Latin Others) so relatively very compact. This way 65,535 storage units were more than generous for the task.
              >From the other end at the moment the project started the US English
              (base ASCII) texts were absolutely prevailing in the transmission so
              the task was do not double the HTTP traffic with useless 0x00 bytes. To
              avoid that it was decided that the bytes 0-127 will be treated
              literally as base ASCII characters and anything 128-255 will be treated
              as the beginning of a double-byte Unicode sequence. Alas it meant that
              0x8000 - 0xFFFF ( a good half of the table) would be unusable. Lucky
              Pike and Thompson found a way of an economic unambiguous transmission
              of any characters in 0-65535 range meeting the core requirement do not
              double the traffic with Unicode-encoded base-ASCII characters. This
              algorithm - later called UTF-8 - went into wide production. It
              doesn't mean that English "A" is represented with a single byte
              in Unicode: it means that Unicode double byte character 0x0041 (Basic
              Latin LATIN CAPITAL LETTER A) has an universally recognized single-byte
              shortcut 0x41
              That would be a happy ending but misfortunately Unicode, Inc. treated
              65,535 storage places as a teenager would treat his first credit card
              - thus rolling it on the first occasion without thinking of the
              consequences. Any shaister coming with any kind of crap tables was
              immediately welcome and accounted. This way Unicode, Inc. started to
              work on the "first came - first got" basis and the original idea
              "all currently used charsets" was seamlessly transformed into
              "all symbolic systems ever used for any purposes by the human
              civilization". Well predictably for language specialists - but
              surprisingly for Unicode, Inc. amateurs - it appeared that the
              humanity produced a countless amount f systems to denote sounds,
              syllables, words, ideas, musical sounds, chemical elements and an
              endless amount of other material and spiritual entities. This way they
              spent all available storage space for rarely used crap before even
              fixing the place for such "minor" issues as Chinese or Japanese. As
              the result they had to go from 2-byte system to 3-byte system and now
              they seem exploring the storage space of a 4-byte system. And this is
              even without touching yet Egyptian hieratic/demotic and all variants of
              Cuneiform. And there is no one so far to come, send the fn amateurs to
              hell and to bring the Unicode system in order.

              You come to say to any Java team guy "Unicode" (unlike
              "Candyman" one time will suffice :-) and then run away quickly
              before he started beating you.

              Yes I am biased on the matter: I hate "volunteers " ensured that
              whatever they are doing is right just because they are doing it for
              free (and seemly for free).

              Comment

              • Michael Winter

                #8
                Re: Input Character Set Handling

                VK wrote:
                Kulgan wrote:
                >2. Does IE Javascript do the same? So if I write a Javascript
                >function that compares a UTF-8 string to a string that a user has
                >inputted into a text box, will IE convert the user's string into
                >UTF-8 before doing the comparison?
                >
                That is confusion inspired by Unicode, Inc. and W3C (I'm wondering
                rather often if they have any clue at all about Unicode).
                Oh, here we go.
                Unicode is a *charset* ...
                It's a character encoding: characters are encoded as an integer within a
                certain "codespace" , namely the range 0..10FFFF. There are then
                "encoding forms" that transform values in this range to "code units",
                specifically the three Unicode Transformation Formats, UTF-8, -16, and
                -32. These code units can be used to store or transport sequences of
                "encoded characters". The "encoding scheme" (which includes big- and
                little-endian forms for UTF-16 and -32) defines precisely how each form
                is serialised into octets.

                [snip]
                Here UTF-8 *encoding* (not *charset*) comes into play. By some
                special algorithm it encodes Unicode characters into base ACSII
                sequences and send them to the recipient.
                Whilst some encoded characters will map directly to ASCII (specifically
                the Unicode code points, 0..7F), most won't. For a start, ASCII is a
                7-bit encoding (128 characters in the range 0..7F), whereas UTF-8 is an
                8-bit, variable-width format.

                The word you are looking for is "octet".

                [snip]
                Pragma <?xml version="1.0" encoding="utf-8"?>
                It is the XML declaration and takes the form of a processing instruction.
                ... *does not* mean that this document is in UTF-8 encoding.
                That depends on what you mean by "in UTF-8 encoding". If you meant
                "serialised using the UTF-8 encoding scheme", then that's precisely what
                it means. However, it is unnecessary to include an XML declaration for
                documents that use either the UTF-8 or -16 encoding form (see 4.3.3
                Character Encoding in Entities).
                It means that the document is in Unicode charset ...
                All XML documents (and HTML, for that matter) use the Unicode
                repertoire. The issue is the form in which the document is transported.
                Should a higher protocol not signal the encoding form in use (UTF-8,
                ISO-8859-1, etc.) then the XML declaration serves that purpose.

                [snip]
                Coming back to your original question: the page will be either Unicode
                or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
                exists only during the transmission and parsing stages.
                UTF-8 can be used any time the document needs to be serialised into a
                sequence of octets. Therefore, a document might stored on disk using
                UTF-8, and then transmitted verbatim across a network.

                [snip]

                Mike

                Comment

                • Jim Land

                  #9
                  Re: Input Character Set Handling

                  "Bart Van der Donck" <bart@nijlen.co mwrote in
                  news:1163177593 .704196.278080@ h48g2000cwc.goo glegroups.com:
                  Paste into input field:<br>
                  ヤツカ
                  <hr>
                  <form>
                  <input name="i">
                  <input type="button" value="check" onClick="
                  if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
                  alert('equal') }
                  else {
                  alert('not equal')
                  }
                  ">
                  </form>
                  Not equal.

                  2 Paste ヤ
                  if (document.forms[0].i.value == '\uFF94;')
                  Not equal

                  3 Paste ヤ
                  if (document.forms[0].i.value == 'ヤ')
                  Not equal

                  4 Paste &amp;
                  if (document.forms[0].i.value == '&amp;')
                  Not equal

                  5 Paste abc
                  if (document.forms[0].i.value == 'abc')
                  Equal

                  6 Paste &
                  if (document.forms[0].i.value == '&')
                  Equal

                  7 Paste &
                  if (document.forms[0].i.value == '&#38;') //ascii decimal
                  Equal

                  8 Paste &
                  if (document.forms[0].i.value == '\x26') //ascii hex
                  Equal

                  9 Paste &
                  if (document.forms[0].i.value == '\46') //ascii octal
                  Equal

                  10 Paste &
                  if (document.forms[0].i.value == '\u0026') //unicode
                  Equal

                  11 Paste &
                  if (document.forms[0].i.value == '&amp;') //html character entity
                  Equal

                  Are the following conclusions correct?

                  1. When a single character is typed in an input box, Javascript can
                  correctly recognize it as itself, as its ascii code (decimal, hex, or
                  octal), as its unicode, or as its html character entity.

                  2. However, Javascript does *not* correctly recognize a character entered
                  by typing its ascii code, unicode, or html character entity into a text
                  box.

                  Comment

                  • Kulgan

                    #10
                    Re: Input Character Set Handling

                    On the Javascript subject, if the user's input character set is not
                    UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
                    UTF-8, how does Javascript see the characters?
                    >
                    Always the same, as their Unicode code points.
                    >
                    Many thanks for the advice. I am starting to get an understanding of
                    what is going on now!! Are you saying that if the user's Windows
                    character set is not Unicode that Javascript sees characters inputted
                    into text boxes as Unicode? Or are modern Windows (XP) installations
                    always Unicode for data input anyway??

                    Can of worms...!

                    Kulgan.

                    Comment

                    • Bart Van der Donck

                      #11
                      Re: Input Character Set Handling

                      Jim Land (NO SPAM) wrote:
                      "Bart Van der Donck" <bart@nijlen.co mwrote in
                      news:1163177593 .704196.278080@ h48g2000cwc.goo glegroups.com:
                      Posts like yours are dangerous; Gougle Groups displays html char/num
                      entities where you haven't typed them and vice versa. I can imagine
                      that most News Readers will have trouble with it too; that's why I've
                      put some work to restrict my previous post to ISO-8859-1 so everybody
                      sees it correctly.
                      Paste into input field:<br>
                      ヤツカ
                      <hr>
                      <form>
                      <input name="i">
                      <input type="button" value="check" onClick="
                      if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
                      alert('equal') }
                      else {
                      alert('not equal')
                      }
                      ">
                      </form>
                      Not equal.
                      >
                      2 Paste ヤ
                      if (document.forms[0].i.value == '\uFF94;')
                      Not equal
                      >
                      3 Paste ヤ
                      if (document.forms[0].i.value == 'ヤ')
                      Not equal
                      >
                      4 Paste &amp;
                      if (document.forms[0].i.value == '&amp;')
                      Not equal
                      >
                      5 Paste abc
                      if (document.forms[0].i.value == 'abc')
                      Equal
                      >
                      6 Paste &
                      if (document.forms[0].i.value == '&')
                      Equal
                      >
                      7 Paste &
                      if (document.forms[0].i.value == '&#38;') //ascii decimal
                      Equal
                      >
                      8 Paste &
                      if (document.forms[0].i.value == '\x26') //ascii hex
                      Equal
                      >
                      9 Paste &
                      if (document.forms[0].i.value == '\46') //ascii octal
                      Equal
                      >
                      10 Paste &
                      if (document.forms[0].i.value == '\u0026') //unicode
                      Equal
                      >
                      11 Paste &
                      if (document.forms[0].i.value == '&amp;') //html character entity
                      Equal
                      I suppose your testing results should be fine, two thoughts:
                      - beware of leading/trailing spaces when you copy/paste
                      - (document.forms[0].i.value == '\uFF94;') doesn't equal because the
                      semicolon shouldn't be there
                      Are the following conclusions correct?
                      >
                      1. When a single character is typed in an input box, Javascript can
                      correctly recognize it as itself,
                      Yes.
                      as its ascii code (decimal, hex, or octal),
                      Yes, but only when it's an ASCII character (which is nowadays too
                      narrow to work with).
                      as its unicode,
                      Yes.
                      or as its html character entity.
                      I'ld say this is a bridge too far; there might be browser dependencies
                      when it comes too num/char entity handling in forms. I would tend to
                      not rely too much on this kind of stuff.
                      2. However, Javascript does *not* correctly recognize a character entered
                      by typing its ascii code, unicode, or html character entity into a text
                      box.
                      Correct by definition; eg when you type "\x41", it will be treated as
                      "\x4" and not as "A", because you typed "\x4" and not "A" :-) But it's
                      possible to write a script too modify such behaviour.

                      --
                      Bart

                      Comment

                      • Bart Van der Donck

                        #12
                        Re: Input Character Set Handling

                        Kulgan wrote:
                        Many thanks for the advice. I am starting to get an understanding of
                        what is going on now!! Are you saying that if the user's Windows
                        character set is not Unicode that Javascript sees characters inputted
                        into text boxes as Unicode?
                        Yes, always.
                        Or are modern Windows (XP) installations always Unicode for data
                        input anyway??
                        I'm not sure of that, but it doesn't matter here. You can input
                        whatever you want from any charset on any OS using any decent browser.
                        Javascript will always handle it internally as Unicode code-points;
                        each javascript implementation is built that way.
                        Can of worms...!
                        True, but with some basic rules and a lot of common sense, most
                        situations can be dealt with.

                        --
                        Bart

                        Comment

                        • VK

                          #13
                          Re: Input Character Set Handling

                          Oh, here we go.

                          Oh, here we go :-): someone gonna teach me about the Unicode. For some
                          reasons - which I'll skip to disclose - it is funny to me, but go ahead
                          anyway.
                          It's a character encoding: characters are encoded as an integer within a
                          certain "codespace" , namely the range 0..10FFFF.
                          Unicode is a charset (set of characters) with each character unit
                          represented by words (in the programming sense) with the smallest word
                          consisting of 2 bytes (16 bits) . This way the range doesn't go from 0:
                          there is not such character in Unicode. Unicode starts from the
                          character 0x0000. Again you are thinking and talking about character
                          entities, bytes, Unicode and UTF-8 at once: which is not helpful if one
                          tries to understand the matter.
                          There are then
                          "encoding forms" that transform values in this range to "code units",
                          specifically the three Unicode Transformation Formats, UTF-8, -16, and
                          -32. These code units can be used to store or transport sequences of
                          "encoded characters". The "encoding scheme" (which includes big- and
                          little-endian forms for UTF-16 and -32) defines precisely how each form
                          is serialised into octets.
                          That is correct.

                          <snip>
                          For a start, ASCII is a
                          7-bit encoding (128 characters in the range 0..7F)
                          I prefer to use the old term lower-ASCII to refer to 0-127 part where
                          the 128-255 variable part used for extra entities and variable from one
                          charset to another. This way more academically correct term could be
                          "IBM tables" and respectively "lower part of IBM tables" but who
                          remembers this term now? "lower-ASCII" in the sense "0-127 characters"
                          or "US ASCII" is good enough for the matter.
                          whereas UTF-8 is an
                          8-bit, variable-width format.
                          Again you are mixing charsets and bytes. UTF-8 is a transport encoding
                          representing Unicode characters using "US ASCII" only character
                          sequences.
                          a document might stored on disk using
                          UTF-8, and then transmitted verbatim across a network.
                          Technically well possible but for what reason? (besides making a copy
                          in another storage place). Such document is not viewable without
                          specially written parser and not directly usable for Internet. So what
                          purpose would be of such document?

                          Comment

                          • Jim Land

                            #14
                            Re: Input Character Set Handling

                            "Bart Van der Donck" <bart@nijlen.co mwrote in
                            news:1163240155 .050373.25790@f 16g2000cwb.goog legroups.com:
                            Jim Land (NO SPAM) wrote:
                            >
                            >"Bart Van der Donck" <bart@nijlen.co mwrote in
                            >news:116317759 3.704196.278080 @h48g2000cwc.go oglegroups.com:
                            >
                            Posts like yours are dangerous; Gougle Groups displays html char/num
                            entities where you haven't typed them and vice versa. I can imagine
                            that most News Readers will have trouble with it too; that's why I've
                            put some work to restrict my previous post to ISO-8859-1 so everybody
                            sees it correctly.
                            >
                            Thank you for pointing this out. For those reading posts in a reader
                            that mangles, I have clarified below by inserting spaces so the string
                            cannot be rendered as a special character.
                            Paste into input field:<br>
                            ヤツカ \\ & # 65428; & # 65410; & # 65398;
                            <hr>
                            <form>
                            <input name="i">
                            <input type="button" value="check" onClick="
                            if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
                            \\ \ u FF94 \ u FF82 \ u FF76
                            alert('equal') }
                            else {
                            alert('not equal')
                            }
                            ">
                            </form>
                            >Not equal.
                            >>
                            >2 Paste ヤ \\ & # 65428 ;
                            >if (document.forms[0].i.value == '\uFF94;') \\ \ u FF94 ;
                            >Not equal
                            >>
                            >3 Paste ヤ \\ & # 65428 ;
                            >if (document.forms[0].i.value == 'ヤ') \\ & # 65428 ;
                            >Not equal
                            >>
                            >4 Paste &amp; \\ & amp ;
                            >if (document.forms[0].i.value == '&amp;') \\ & amp ;
                            >Not equal
                            >>
                            >5 Paste abc \\ abc
                            >if (document.forms[0].i.value == 'abc') \\ abc
                            >Equal
                            >>
                            >6 Paste & \\ single character
                            >if (document.forms[0].i.value == '&') \\ single character
                            >Equal
                            >>
                            >7 Paste & \\ single character
                            >if (document.forms[0].i.value == '&#38;') // & # 38; ascii decimal
                            >Equal
                            >>
                            >8 Paste & \\ single character
                            >if (document.forms[0].i.value == '\x26') // \ x 26 ascii hex
                            >Equal
                            >>
                            >9 Paste & \\ single character
                            >if (document.forms[0].i.value == '\46') // \ 46 ascii octal
                            >Equal
                            >>
                            >10 Paste & \\ single character
                            >if (document.forms[0].i.value == '\u0026') // \ u 0026 unicode
                            >Equal
                            >>
                            >11 Paste & \\ single character
                            >if (document.forms[0].i.value == '&amp;')
                            // & amp ; html character entity
                            >Equal
                            >
                            I suppose your testing results should be fine, two thoughts:
                            - beware of leading/trailing spaces when you copy/paste
                            - (document.forms[0].i.value == '\uFF94;') doesn't equal because the
                            semicolon shouldn't be there
                            Thanks, my typo. But still not equal when semicolon is removed.
                            >
                            >Are the following conclusions correct?
                            >>
                            >1. When a single character is typed in an input box, Javascript can
                            >correctly recognize it as itself,
                            >
                            Yes.
                            >
                            >as its ascii code (decimal, hex, or octal),
                            >
                            Yes, but only when it's an ASCII character (which is nowadays too
                            narrow to work with).
                            >
                            >as its unicode,
                            >
                            Yes.
                            >
                            >or as its html character entity.
                            >
                            I'ld say this is a bridge too far; there might be browser dependencies
                            when it comes too num/char entity handling in forms. I would tend to
                            not rely too much on this kind of stuff.
                            >
                            >2. However, Javascript does *not* correctly recognize a character
                            >entered by typing its ascii code, unicode, or html character entity
                            >into a text box.
                            >
                            Correct by definition; eg when you type "\x41", it will be treated as
                            "\x4" and not as "A", because you typed "\x4" and not "A" :-) But it's
                            possible to write a script too modify such behaviour.
                            >
                            I believe you meant, 'when you type "\x41" (\ x 41), it will be treated
                            as
                            "\x41" (\ x 41) and not as "A", because you typed "\x41" (\ x 41) and
                            not "A"'

                            Comment

                            • Michael Winter

                              #15
                              Re: Input Character Set Handling

                              VK wrote:

                              [snip]
                              >It's a character encoding: characters are encoded as an integer
                              >within a certain "codespace" , namely the range 0..10FFFF.
                              >
                              Unicode is a charset (set of characters)
                              Character set and character encoding are synonymous, however Unicode is
                              not defined using the former.
                              with each character unit represented by words (in the programming
                              sense) with the smallest word consisting of 2 bytes (16 bits).
                              If by "character unit" you mean code point, that's nonsense. A code
                              point is an integer, simple as that. How it is represented varies.
                              This way the range doesn't go from 0: there is not such character in
                              Unicode.
                              In the Unicode Standard, the codespace consists of the integers
                              from 0 to 10FFFF [base 16], comprising 1,114,112 code points
                              available for assigning the repertoire of abstract characters.
                              -- 2.4 Code Points and Characters,
                              The Unicode Standard, Version 4.1.0
                              Unicode starts from the character 0x0000.
                              The Unicode codespace starts from the integer 0. The first assigned
                              character exists at code point 0.
                              Again you are thinking and talking about character entities, bytes,
                              Unicode and UTF-8 at once:
                              No, I'm not. I used terms that are distinctly abstract.

                              It seems to me that you are confusing a notational convention -
                              referring to characters with the form U+xxxx - for some sort of definition.
                              which is not helpful if one tries to understand the matter.
                              Quite. Why then do you try so hard to misrepresent technical issues?

                              [snip]
                              "lower-ASCII" in the sense "0-127 characters" or "US ASCII" is good
                              enough for the matter.
                              I'm not really going to debate the issue, so long as you understand what
                              I mean when I refer to ASCII.
                              >whereas UTF-8 is an 8-bit, variable-width format.
                              >
                              Again you are mixing charsets and bytes.
                              No, I'm not.
                              UTF-8 is a transport encoding representing Unicode characters using
                              "US ASCII" only character sequences.
                              My point was that, given your own definition of (US-)ASCII above, this
                              sort of statement is absurd. The most significant bit is important in
                              the octets generated when using the UTF-8 encoding scheme - all scalar
                              values greater than 7F are serialised to two or more octets, each of
                              which have the MSB set - yet you are describing it in terms of something
                              where only the lowest 7-bits are use to represent characters.

                              For example, U+0430 is represented by the octets D0 and B0. In binary,
                              these octets are 11010000 and 10110000, respectively. If UTF-8 uses "US
                              ASCII only character sequences", and you agree that US-ASCII is strictly
                              7-bit, do you care to explain that evident contradiction?
                              >a document might stored on disk using UTF-8, and then transmitted
                              >verbatim across a network.
                              >
                              Technically well possible but for what reason? ...
                              Efficiency. Most Western texts will be smaller when the UTF-8 encoding
                              scheme is employed as the 0..7F code points are the most common,
                              encompassing both common letters, digits, and punctuation.
                              Such document is not viewable without specially written parser and
                              not directly usable for Internet.
                              Oh dear. Of all of the documents that use one of the Unicode encoding
                              schemes on the Web, I should think that the /vast/ majority of them use
                              UTF-8. As for "specially written parser", XML processors are required to
                              accept UTF-8 input and browsers at least as far back as NN4 also do so.

                              [snip]

                              Mike

                              Comment

                              Working...