Named vs. numerical entities

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andreas Prilop

    #16
    Re: Named vs. numerical entities

    On Fri, 16 Jul 2004, Pierre Goiffon wrote:
    [color=blue][color=green]
    >> And in any case, probably the best choice (if no other constraints
    >> apply) of Unicode encoding scheme for HTML used in a WWW context is
    >> utf-8, not utf-16LE/BE.[/color]
    >
    > Do you mean, when using a vast majority of latin characters ?
    > If not, wouldn't the file will get very large ?[/color]

    Not bigger than a simple image.
    [color=blue]
    > Wouldn't it be better to use UTF-16 ?[/color]

    Only if you prefer not to be indexed by Google correctly.
    <http://www.google.com/search?q=%22UTF-1+6%22>

    --
    Top-posting.
    What's the most irritating thing on Usenet?


    Comment

    • Nick Kew

      #17
      Re: Named vs. numerical entities

      In article <Pine.LNX.4.53. 0407160944450.7 114@ppepc56.ph. gla.ac.uk>,
      "Alan J. Flavell" <flavell@ph.gla .ac.uk> writes:
      [color=blue]
      > There are third-party Apache modules which take care of this "on the
      > fly",[/color]

      mod_deflate is standard. No need for third-party modules.

      --
      Nick Kew

      Comment

      • Alan J. Flavell

        #18
        Re: Named vs. numerical entities

        On Fri, 16 Jul 2004, Pierre Goiffon wrote:
        [color=blue]
        > "Alan J. Flavell" <flavell@ph.gla .ac.uk> a écrit dans le message de
        > news:Pine.LNX.4 .53.04071613340 00.7123@ppepc56 .ph.gla.ac.uk[color=green]
        > > And in any case, probably the best choice (if no other constraints
        > > apply) of Unicode encoding scheme for HTML used in a WWW context is
        > > utf-8, not utf-16LE/BE.[/color]
        >
        > Do you mean, when using a vast majority of latin characters ?[/color]

        Not necessarily: Greek, Cyrillic, Arabic, Hebrew are all represented
        by 2 octets in utf-8. Armenian, Syriac and Coptic too, hmmm. The
        cutoff (IINM) is U+07FF.

        CJK scripts are a different matter, but AFAICS they are still usually
        represented in one of their traditional encodings, rather than in a
        Unicode-based scheme.

        Indic scripts will also need 3 octets per character in utf-8 (and in
        this case AIUI the use of unicode-based encodings is very beneficial,
        since there /was/ no widely accepted pre-unicode scheme: I'm told that
        in order to read Indian newspapers on the web, pretty much each
        newspaper needed a different "font" i.e in effect was using its own
        private character encoding. But I'm no expert in that field, so the
        information is only second-hand).
        [color=blue]
        > If not, wouldn't the file will get very large ? Wouldn't it be
        > better to use UTF-16 ?[/color]

        I haven't widely tested browser compatibility for utf-16 encodings, so
        I can't comment on that aspect. But keep in mind that the markup,
        styles, etc. etc. are expressed by ASCII characters, and by using
        utf-16 you're going to double the size of *those* as compared with
        utf-8.

        But yes, if your material is such that most of the data characters
        need 3 octets in utf-8, and you've decided to use a unicode scheme,
        then utf-16 could well be more-compact, you're right.

        Comment

        • Alan J. Flavell

          #19
          Re: Named vs. numerical entities

          On Fri, 16 Jul 2004, Nick Kew wrote:
          [color=blue]
          > "Alan J. Flavell" <flavell@ph.gla .ac.uk> writes:
          >[color=green]
          > > There are third-party Apache modules which take care of this "on the
          > > fly",[/color]
          >
          > mod_deflate is standard. No need for third-party modules.[/color]

          Thanks for the information!

          Comment

          • Andreas Prilop

            #20
            Re: Named vs. numerical entities

            On Fri, 16 Jul 2004, Alan J. Flavell wrote:
            [color=blue]
            > Indic scripts will also need 3 octets per character in utf-8 (and in
            > this case AIUI the use of unicode-based encodings is very beneficial,
            > since there /was/ no widely accepted pre-unicode scheme: I'm told that
            > in order to read Indian newspapers on the web, pretty much each
            > newspaper needed a different "font" i.e in effect was using its own
            > private character encoding.[/color]

            But there's also <http://www.bbc.co.uk/hindi/>
            and <http://www.bbc.co.uk/tamil/> .

            --
            Top-posting.
            What's the most irritating thing on Usenet?

            Comment

            • Brian

              #21
              Re: Named vs. numerical entities

              Jonas Smithson wrote:
              [color=blue]
              > However, some of my pages have numerous character entities on
              > them... let's say up to fifty on a page, perhaps; if they each
              > entailed an extra six bytes (for example) over some alternate
              > method, then that might add up to an extra 300 bytes. What does
              > that equal in download time? How many bytes of difference do *you*
              > think would make a "noticeable difference" between two documents...
              > say, to a user on a 56K modem?[/color]

              Well, do the math. 300/56000 is not very significant. I suppose,
              300/~33000 is more accurate a comparison, but even there, it's nothing
              to worry about. Spending time tuning one image on a page will likely
              have a greater impact than encoding will.

              You should only worry about encoding if it causes rendering problems.

              --
              Brian (remove ".invalid" to email me)

              Comment

              • Brian

                #22
                Re: Named vs. numerical entities

                Alan J. Flavell wrote:
                [color=blue]
                > On Fri, 16 Jul 2004, Brian wrote:
                >[color=green]
                >> UTF-8 is an 8-bit character set[/color]
                >
                > No, utf-8 isn't a "character set" at all (that MIME "charset"
                > parameter denotes what we nowadays call a "character encoding
                > scheme").[/color]

                Cripes, I cannot keep the terminology straight. I wish they had called
                that thing by its name, charenc or something. Yes, utf-8 is an encoding.

                --
                Brian (remove ".invalid" to email me)

                Comment

                • Harlan Messinger

                  #23
                  Re: Named vs. numerical entities


                  "Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
                  news:Pine.GSO.4 .44.04071614250 10.9642-100000@s5b003.. .[color=blue]
                  > On Fri, 16 Jul 2004, Harlan Messinger wrote:
                  >[color=green]
                  > > Why can't a document be encoded (and transmitted) in Unicode?[/color]
                  >
                  > It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
                  > and in addition in different byte order for UTF-16 and UTF-32.
                  > <http://www.unicode.org/unicode/faq/utf_bom.html>
                  >[color=green]
                  > > If
                  > > Windows Notepad lets you save a text file as Unicode (big- or
                  > > little-endian), isn't that the same thing?[/color]
                  >
                  > "Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
                  > UTF-32 isn't used in MS Windows AFAIK.[/color]

                  I'm really interested in what the distinction is. I admit I don't know what
                  UTF-16 or why it's different from what I would call "Unicode encoding", but
                  why wouldn't a fixed 16-bit encoding scheme where "A" is encoded as 0040, an
                  em-dash is encoded as 2014, a katakana "pu" is encoded as 30D7, and so forth
                  not be "Unicode encoding"?

                  Is it that this encoding scheme already existed and had the name "UTF-16"
                  before the term "Unicode" was coined? So that the reason we don't call it
                  "Unicode encoding" is simply that it already has another name?

                  Comment

                  • Andreas Prilop

                    #24
                    Re: Named vs. numerical entities

                    On Fri, 16 Jul 2004, Harlan Messinger wrote:
                    [color=blue][color=green]
                    >> <http://www.unicode.org/unicode/faq/utf_bom.html>[/color]
                    >
                    > I'm really interested in what the distinction is. I admit I don't know what
                    > UTF-16 or why it's different from what I would call "Unicode encoding", [...][/color]

                    Err, did you read the page above, which I cited with reason?

                    --
                    Top-posting.
                    What's the most irritating thing on Usenet?

                    Comment

                    • Harlan Messinger

                      #25
                      Re: Named vs. numerical entities


                      "Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
                      news:Pine.GSO.4 .44.04071617052 20.11169-100000@s5b003.. .[color=blue]
                      > On Fri, 16 Jul 2004, Harlan Messinger wrote:
                      >[color=green][color=darkred]
                      > >> <http://www.unicode.org/unicode/faq/utf_bom.html>[/color]
                      > >
                      > > I'm really interested in what the distinction is. I admit I don't know[/color][/color]
                      what[color=blue][color=green]
                      > > UTF-16 or why it's different from what I would call "Unicode encoding",[/color][/color]
                      [...][color=blue]
                      >
                      > Err, did you read the page above, which I cited with reason?
                      >[/color]

                      Sorry, I missed it somehow. I intend to read it later, but from glancing at
                      it, I have the following thoughts:

                      1. There's nothing any more nonsensical about the concept of a Unicode
                      encoding for the Unicode character set than there is about ASCII encoding
                      for the ASCII character set, but for whatever reasons (I assume efficiency
                      has something to do with it) it's not *used*.

                      2. EBCDIC and ASCII define the same characters, IIRC; but as character sets
                      they just number them differently. A document could be encoded in EBCDIC
                      just as easily as in ASCII. It wouldn't make any sense to speak of an EBCDIC
                      encoding of an ASCII document or an ASCII encoding of an EBCDIC document:
                      each is a separate encoding of a document based on the representations of
                      the document's characters in the respective character sets.

                      So why are the UTF-* encoding, "encodings of the Unicode character set"? Is
                      it because they are closely related to the Unicode character set by virtue
                      of the fact that there is a mapping from UCS to UTF-* produced by applying a
                      small set of simple functions?



                      2.

                      Comment

                      • Andreas Prilop

                        #26
                        Re: Named vs. numerical entities

                        On Fri, 16 Jul 2004, Harlan Messinger wrote:
                        [color=blue]
                        > 1. There's nothing any more nonsensical about the concept of a Unicode
                        > encoding for the Unicode character set than there is about ASCII encoding
                        > for the ASCII character set,[/color]

                        Maybe I could understand this sentence with fewer negatives :-)
                        [color=blue]
                        > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]

                        ASCII is a coded character set of 128 characters defined in ANSI X3.4
                        and ISO 646.
                        EBCDIC is a generic term for several (many?) coded character sets of
                        256 characters defined by IBM. Just four of them are listed here:
                        <http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/>
                        [color=blue]
                        > So why are the UTF-* encoding, "encodings of the Unicode character set"?[/color]

                        Think of Unicode as assigning characters to natural numbers - currently
                        from 0 to x10FFFF = 1114111. For example, number 945 = x3B1 means
                        the Greek small letter alpha.

                        The UTFs define how these numbers are represented by _byte_ sequences
                        (in a computer or on the Internet).

                        --
                        Top-posting.
                        What's the most irritating thing on Usenet?

                        Comment

                        • C A Upsdell

                          #27
                          Re: Named vs. numerical entities

                          > ASCII is a coded character set of 128 characters defined in ANSI X3.4[color=blue]
                          > and ISO 646.[/color]

                          Not quite. You are thinking of US-ASCII. There are a variety of national
                          ASCII character sets.




                          Comment

                          • Alan J. Flavell

                            #28
                            Re: Named vs. numerical entities

                            On Fri, 16 Jul 2004, Harlan Messinger wrote:
                            [color=blue]
                            > Sorry, I missed it somehow. I intend to read it later,[/color]

                            Call back here when you have done?
                            [color=blue]
                            > 1. There's nothing any more nonsensical about the concept of a Unicode
                            > encoding for the Unicode character set than there is about ASCII encoding
                            > for the ASCII character set,[/color]

                            Actually there are substantial differences. And you see this also
                            with that MIME parameter which is (mis)named "charset" - but specifies
                            what we now would call a "character encoding scheme".

                            Back when 7 or 8 bits were sufficient to represent all of the
                            characters of a repertoire, it was quasi-obvious that the "coded
                            character set" was defined by assigning numbers (0-127 or 0-255 as the
                            case may be) to the characters of the repertoire, and then to lay out
                            the fonts according to that scheme, and to transmit the characters by
                            means of bytes having that value.

                            Consequently, back then it looked as if the things that we now call
                            "coded character set", "character encoding" and "font arrangement"
                            were just different names for the same thing. Of course, you needed a
                            different font for each "charset" (i.e character encoding), which got
                            to be a considerable drag.

                            Nowadays these concepts have to be disambiguated. Unicode characters
                            are designated by a code point which can, in principle, go up to 2**31
                            (it hasn't got that far yet). Those numbers then have to be
                            represented in a way which is convenient for transmission and/or
                            storage (different design criteria apply for different purposes).
                            [color=blue]
                            > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]

                            Actually not. But discussing that would be a pointless digression, so
                            let's move on.
                            [color=blue]
                            > So why are the UTF-* encoding, "encodings of the Unicode character set"?[/color]

                            It's not practical, for various reasons, to transmit characters as
                            32-bit units. For one thing, it's very wasteful. For another,
                            there's no unique byte-ordering, hence all this fuss about endian-ness
                            when units of 16 or 32 bits are involved.

                            There's also the question of representing unicode characters in a
                            mail-safe context (hence utf-7). That will fade with time, but even
                            8-bit-safe mail formats ban null bytes, which means that utf-16 or
                            utf-32/ucs-4 representations cannot be used without a further layer of
                            encoding.
                            [color=blue]
                            > Is it because they are closely related to the Unicode character set[/color]

                            Is it because you won't read the tutorial before asking further
                            questions?

                            ttfn

                            Comment

                            • Lars Eighner

                              #29
                              Re: Named vs. numerical entities

                              In our last episode,
                              <BjTJc.185939$r CA1.116992@news 01.bloor.is.net .cable.rogers.c om>,
                              the lovely and talented C A Upsdell
                              broadcast on comp.infosystem s.www.authoring.html:
                              [color=blue][color=green]
                              >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
                              >> and ISO 646.[/color][/color]
                              [color=blue]
                              > Not quite. You are thinking of US-ASCII. There are a variety of national
                              > ASCII character sets.[/color]


                              No. There is only one ASCII. It is a 7-bit code with 128 characters.
                              Think: what does the A in ASCII stand for?

                              --
                              Lars Eighner -finger for geek code- eighner@io.com http://www.io.com/~eighner/
                              If it wasn't for muscle spasms, I wouldn't get any exercise at all.

                              Comment

                              • Andreas Prilop

                                #30
                                Re: Named vs. numerical entities

                                On Fri, 16 Jul 2004, C A Upsdell wrote:
                                [color=blue][color=green]
                                >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
                                >> and ISO 646.[/color]
                                >
                                > Not quite. You are thinking of US-ASCII.[/color]

                                ASCII and US-ASCII are synonyms.
                                <http://www.iana.org/assignments/character-sets>
                                [color=blue]
                                > There are a variety of national ASCII character sets.[/color]

                                No, they are called "7-bit codes" or "7-bit coded character sets"
                                as defined in ISO 646. <http://www.itscj.ipsj. or.jp/ISO-IR/>

                                Comment

                                Working...