Simple high-ascii character encoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • chandy@totalise.co.uk

    Simple high-ascii character encoding

    Hi,

    I have an Html document that declares that it uses the utf-8 character
    set. As this document is editable via a web interface I need to make
    sure than high-ascii characters that may be accidentally entered are
    properly represented when the document is served. My programming
    language allows me to get the ascii value for any individual character
    so what I am doing when a change is saved is to look at each character
    in the content and if the ascii value for a character > 127 then I
    replace 'character' with '&#AsciiValue;' .

    I am not very well up on character sets and document encoding
    mechanisms so I would like to know, is this a sensible idea?

    TIA

    Chandy

  • Jukka K. Korpela

    #2
    Re: Simple high-ascii character encoding

    chandy@totalise .co.uk wrote:
    [color=blue]
    > I have an Html document that declares that it uses the utf-8 character
    > set.[/color]

    Does it do that properly? Prove it, show us the URL! :-)
    [color=blue]
    > As this document is editable via a web interface I need to make
    > sure than high-ascii characters that may be accidentally entered are
    > properly represented when the document is served.[/color]

    There are no high-ascii characters. Ascii stops at 127, has always
    stopped, and will always stop.

    If your document is adequately UTF-8 encoded, then form data sent via a
    form on the page will appear as UTF-8 encoded, too, though naturally it
    will _also_ be encoded as specified for form data encoding in general.
    [color=blue]
    > My programming
    > language allows me to get the ascii value for any individual character
    > so what I am doing when a change is saved is to look at each character
    > in the content and if the ascii value for a character > 127 then I
    > replace 'character' with '&#AsciiValue;' .[/color]

    Why would you do that, given the fact that there are no Ascii values
    greater than 127 and the fact that your form data handler gets the data
    in UTF-8 encoding? What would be the point in replacing it by a
    character reference, when the page itself is UTF-8 encoded?

    Comment

    • Alan J. Flavell

      #3
      Re: Simple high-ascii character encoding

      On Thu, 25 Aug 2005 chandy@totalise .co.uk wrote under the
      heading:
      [color=blue]
      > Simple high-ascii character encoding[/color]

      Hmmm. What's that supposed to mean in an HTML context?
      [color=blue]
      > I have an Html document that declares that it uses the utf-8 character
      > set.[/color]

      Terminology again! utf-8 is not a "character set", but a character
      encoding scheme of unicode. I can't help it that, way back, MIME chose
      the attribute name of "charset=" for this, which in current terminology
      is very misleading, but utf-8 still isn't a "character set".
      [color=blue]
      > As this document is editable via a web interface I need to make
      > sure than high-ascii characters that may be accidentally entered[/color]

      I think you'd benefit from getting rid of this obsolete term
      "high-ascii". ASCII is a 7-bit code, containing a mere 95 displayable
      characters, whereas the document character set of HTML is Unicode,
      containing vastly more characters than ASCII.

      Modern OSes often define input methods for wide ranges of these
      non-ASCII characters...
      [color=blue]
      > are properly represented when the document is served.[/color]

      Details depend on your OS and editing application, but modern OSes don't
      mind storing utf-8, and serving them out as such.
      [color=blue]
      > My programming language allows me to get the ascii value for any
      > individual character[/color]

      But most of the characters aren't in ASCII, so how could they have
      an "ascii value"? Character representation in HTML isn't hard, but
      you *do* have to use the terms with some care, if you want to make
      sense.
      [color=blue]
      > so what I am doing when a change is saved is to look at each character
      > in the content and if the ascii value for a character > 127[/color]

      There ARE no ASCII characters with a value above 127 !
      [color=blue]
      > then I replace 'character' with '&#AsciiValue;' .[/color]

      There *are* no ASCII values greater than 127.

      Representing non-ASCII characters as &#number; , using their character
      number in Unicode, is a feasible approach - but rather voluminous if you
      have many of them.

      I have a checklist that's been quite widely peer-reviewed: I'd
      recommend that you work your way down the scenarios, and pick one that
      seems to fit your needs.



      Hope this helps a bit.

      Comment

      • Chandy

        #4
        Re: Simple high-ascii character encoding

        Yep, clearly I make no sense to people who understand this better than
        I do :) Okay, the langauge returns integer values for the standard as
        well as 'extended' ascii characters (as detailed, for example, on
        http://www.asciitable.com/). My document is not public but starts
        with:

        <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
        <html lang="en">
        <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

        The system is publishing content in english to the web but is
        poentially for world-wide consumption. Generally the extra characters
        I have to represent will be items like &reg;, &copy; and &trade; and
        some accented letters, but I was wanting to avoid having to have a
        lookup of ascii value->Html Entity by just changing the character for
        &#Value; when it seemed to have a value that put it outwith the
        standard ascii range. I'll re-ask the question 'is this sensible'
        while I read through the document you referred to.

        Thanks!

        Chandy

        Comment

        • Andreas Prilop

          #5
          Re: Simple high-ascii character encoding

          On 25 Aug 2005, Chandy wrote:
          [color=blue]
          > (as detailed, for example, on http://www.asciitable.com/).[/color]

          | Not Found
          | The requested URL /). was not found on this server.

          Did you mean http://www.asciitable.com/ ? This is just bullshit!
          Please refer to



          for reliable information.

          Comment

          • Harlan Messinger

            #6
            Re: Simple high-ascii character encoding

            Chandy wrote:[color=blue]
            > Yep, clearly I make no sense to people who understand this better than
            > I do :) Okay, the langauge returns integer values for the standard as
            > well as 'extended' ascii characters (as detailed, for example, on
            > http://www.asciitable.com/).[/color]

            As that page itself says, "it took a while to get a single standard for
            these extra characters and hence there are few varying 'extended' sets.
            The most popular is presented below." This is all self-contradictory.
            The point is there is no character set correctly called "extended
            ASCII". Anyone using that term to refer to *a* mapping of a collection
            of characters to codes 128-255 is using it because either:

            (a) He thinks that "ASCII" itself refers to the numeric range 0-127, and
            that "extended ASCII" therefore means, unambiguously, the range 128-255.
            This is incorrect, because "ASCII" doesn't in the first place refer to
            the range of numbers, it refers to a very specific set of characters (or
            control codes) and its *assignment* to those numbers.

            (b) He got the impression somewhere that the particular set of
            characters he's seen assigned to the range 128-255 is *the* set of
            characters so assigned, and that that particular set is known as
            "extended ASCII". So what was introduced to MS-DOS users as "extended
            ASCII" was the set of Microsoft line draw characters. Many other users
            think it refers to the Western Europe (Latin-1) extension. And so forth.
            The term is used by different people to refer to what they think it
            means. In other words, it doesn't technically mean anything. It's a
            misconception.

            Comment

            • Harlan Messinger

              #7
              Re: Simple high-ascii character encoding

              Harlan Messinger wrote:[color=blue]
              > As that page itself says, "it took a while to get a single standard for
              > these extra characters and hence there are few varying 'extended' sets.
              > The most popular is presented below." This is all self-contradictory.
              > The point is there is no character set correctly called "extended
              > ASCII". Anyone using that term to refer to *a* mapping of a collection
              > of characters to codes 128-255 is using it because either:[/color]

              [snip]

              To be fair, *any* of the character sets of which ASCII is a subset can
              legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
              extension, as is Unicode. But still, it makes no sense to speak of
              "extended ASCII characters".

              First, a given character may appear in one or more of these schemes and
              *not* appear in one or more others. Would that character be an "extended
              ASCII" character or not? The answer is that it's a character in some of
              those character sets or that's represented in some of those encodings,
              and not in others. The question of whether it's an "extended ASCII"
              character is meaningless.

              Second, a given character may appear in two different character sets but
              mapped to different codes. What's the "extended ASCII code" for an em
              dash? Well, under the standard Windows character set, an em-dash is
              character 151; if you're using Unicode, it's character 8212; and if
              you're using ISO-8859-1, it isn't anything at all because the em dash
              isn't part of that character set. In other words, again, it's
              meaningless to talk about a character's extended ASCII code.

              Comment

              • Guy Macon

                #8
                Re: Simple high-ascii character encoding




                Harlan Messinger wrote:
                [color=blue]
                >To be fair, *any* of the character sets of which ASCII is a subset can
                >legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
                >extension, as is Unicode. But still, it makes no sense to speak of
                >"extended ASCII characters".[/color]

                I was about to ask if anyone had bothered to list all the
                different character sets that are identical to ASCII in the
                first 127 characters, but perhaps it is easier to simply ask
                if there are any character sets that are *not* identical to
                ASCII in the first 127 characters...


                Comment

                • Andreas Prilop

                  #9
                  Re: Simple high-ascii character encoding

                  On Thu, 25 Aug 2005, it was written:
                  [color=blue]
                  > but perhaps it is easier to simply ask
                  > if there are any character sets that are *not* identical to
                  > ASCII in the first 127 characters...[/color]
                  ^
                  (Characters 0 to 127 are the first 128 characters.)

                  All of these



                  Comment

                  • Alan J. Flavell

                    #10
                    Re: Simple high-ascii character encoding

                    On Thu, 25 Aug 2005, Harlan Messinger wrote:
                    [color=blue]
                    > To be fair, *any* of the character sets of which ASCII is a subset
                    > can legimately be called *an* "extension of ASCII".[/color]

                    It could - but it's not a particularly informative statement, as I
                    hope you'd agree.
                    [color=blue]
                    > Latin-1 is an ASCII extension,[/color]

                    To be pedantic, "Latin-1" defines a repertoire of characters:
                    CP-1047 is the "EBCDIC Latin-1 character encoding". When you
                    said Latin-1, I suspect you really meant iso-8859-1, which indeed
                    has ASCII as its lower half.
                    [color=blue]
                    > as is Unicode.[/color]

                    Indeed.
                    [color=blue]
                    > But still, it makes no sense to speak of "extended ASCII
                    > characters".[/color]

                    Right!
                    [color=blue]
                    > Second, a given character may appear in two different character sets
                    > but mapped to different codes. What's the "extended ASCII code" for
                    > an em dash? Well, under the standard Windows character set, an
                    > em-dash is character 151; if you're using Unicode, it's character
                    > 8212; and if you're using ISO-8859-1, it isn't anything at all
                    > because the em dash isn't part of that character set. In other
                    > words, again, it's meaningless to talk about a character's extended
                    > ASCII code.[/color]

                    Right!!

                    And even in MS-DOS land, which is where this unfortunate phrase
                    *"extended ASCII" seems to have grown, there's a bushel of different
                    encodings: CP-437 for the USans, CP-850 for "multinatio nal" use (which
                    contains approximately an MS-DOS encoding of the Latin-1 repertoire,
                    but organised completely differently than iso-8859-1), plus loads of
                    national-specific code pages too. I've got an MS-DOS version 6 manual
                    somewhere which lists page after page of the wretched things.

                    Thank goodness we rarely have to go there these days (except where
                    some user has blundered and converted DOS to Windows where they ought
                    not, or failed to do so when they should've).

                    best

                    Comment

                    • Alan J. Flavell

                      #11
                      Re: Simple high-ascii character encoding

                      On Thu, 25 Aug 2005, Chandy wrote:
                      [color=blue]
                      > http://www.asciitable.com/[/color]

                      Bleagh.

                      On cursory inspection, this appears to be the US-National MS-DOS code
                      page, CP-437. Utterly useless in the modern world: it's absolute
                      nonsense for them to claim that it's the "most popular", as indeed is
                      their claim that "it took a while to get a single standard", since
                      there never *has* been a "single" standard of the kind that they are
                      talking about. Possibly in the distant future, when this babel of
                      8-bit character codes has been forgotten, Unicode *will* be that
                      "single standard". Possibly.

                      Ho hum

                      Comment

                      • Harlan Messinger

                        #12
                        Re: Simple high-ascii character encoding

                        Guy Macon wrote:[color=blue]
                        > Harlan Messinger wrote:
                        >
                        >[color=green]
                        >>To be fair, *any* of the character sets of which ASCII is a subset can
                        >>legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
                        >>extension, as is Unicode. But still, it makes no sense to speak of
                        >>"extended ASCII characters".[/color]
                        >
                        >
                        > I was about to ask if anyone had bothered to list all the
                        > different character sets that are identical to ASCII in the
                        > first 127 characters, but perhaps it is easier to simply ask
                        > if there are any character sets that are *not* identical to
                        > ASCII in the first 127 characters...[/color]

                        EBCDIC, for starters.

                        Then there are all the non-standard arrangements that font designers
                        used in the past to map alphabets and symbol sets other than the basic
                        English one to the sub-128 positions so that foreign text and special
                        symbols could be rendered before more sophisticated means became
                        available. For example, the various Symbols and Wingdings fonts.

                        Comment

                        • Harlan Messinger

                          #13
                          Re: Simple high-ascii character encoding

                          Alan J. Flavell wrote:[color=blue]
                          > On Thu, 25 Aug 2005, Harlan Messinger wrote:
                          >[color=green]
                          >>To be fair, *any* of the character sets of which ASCII is a subset
                          >>can legimately be called *an* "extension of ASCII".[/color]
                          >
                          > It could - but it's not a particularly informative statement, as I
                          > hope you'd agree.[/color]

                          Yes. Still, it's been convenient that for purposes of composing in
                          English most people (pre-Unicode) haven't had to worry about whether
                          their editor supported a particular encoding because it hasn't mattered
                          with respect to the common ASCII subset.
                          [color=blue][color=green]
                          >>Latin-1 is an ASCII extension,[/color]
                          >
                          > To be pedantic, "Latin-1" defines a repertoire of characters:
                          > CP-1047 is the "EBCDIC Latin-1 character encoding". When you
                          > said Latin-1, I suspect you really meant iso-8859-1, which indeed
                          > has ASCII as its lower half.[/color]

                          I did, and thanks for the adjustment. I'm trying really hard to stop
                          mixing up character sets and encodings. (By the way--is a "repertoire "
                          different from a "set"?)

                          Comment

                          • Alan J. Flavell

                            #14
                            Re: Simple high-ascii character encoding

                            On Thu, 25 Aug 2005, Harlan Messinger wrote:
                            [color=blue][color=green]
                            > > CP-1047 is the "EBCDIC Latin-1 character encoding". When you said
                            > > Latin-1, I suspect you really meant iso-8859-1, which indeed has
                            > > ASCII as its lower half.[/color]
                            >
                            > I did, and thanks for the adjustment. I'm trying really hard to stop
                            > mixing up character sets and encodings.[/color]

                            "character sets" versus "encodings" is yet another layer! - although
                            that's hardly noticeable with the old 8-bit codings, it gets quite
                            critical with encodings of Unicode.
                            [color=blue]
                            > (By the way--is a "repertoire " different from a "set"?)[/color]

                            Well, the term "character set" is usually understood to define not
                            only a particular repertoire of characters, but also the assignment of
                            each character to a "small" integer number. This assignment is of
                            course different in EBCDIC-based codings from what it is in
                            ASCII-based codings, to take the obvious example.

                            As such, I'd tend to avoid the use of the term "set" to refer to a
                            character repertoire, if I'm trying to avoid implying a particular
                            ordering of the characters or their assignment to "small" integers.
                            The "repertoire " is the unordered selection of characters, without
                            reference to one or other "character sets" which might be defined
                            comprising that repertoire.

                            hope that helps.

                            Btw, recall that after a certain point, the Latin-x repertoire is
                            encoded by the iso-8859-y character code, where x is no longer equal
                            to y. This is because some of the intervening codes weren't for Latin
                            at all, but for Greek, Arabic, Cyrillic, Hebrew etc. So, for example,
                            iso-8859-15 is the ISO encoding for Latin-9.

                            Comment

                            • RobG

                              #15
                              Re: Simple high-ascii character encoding

                              Harlan Messinger wrote:
                              [...]
                              [color=blue]
                              > Then there are all the non-standard arrangements that font designers
                              > used in the past to map alphabets and symbol sets other than the basic
                              > English one to the sub-128 positions so that foreign text[/color]

                              While we're being pedantic about words, should the phrase 'foreign text'
                              be 'non-English text'? Or in the context of ASCII, are the two terms
                              identical?

                              [...]

                              --
                              Rob

                              Comment

                              Working...