Named vs. numerical entities

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jonas Smithson

    Named vs. numerical entities

    I recently read the claim somewhere that numerical entities (such as
    —) have a speed advantage over the equivalent named entities
    (such as —) because the numerical entity requires just a single
    byte to be downloaded to the browser, while the named entity requires
    one byte for each letter. (So in this case, it would presumably be one
    byte vs. seven bytes.) I found this claim a little surprising -- I
    would have thought *each* numeral in the numerical entity would require
    one byte. Does the Web server really send the entire numerical entity
    as a single... character or whatever... I don't even know how to phrase
    this question correctly!

    Also, which form of the entity enjoys wider browser support? They both
    seem to work with modern browsers... but what about older or very buggy
    browsers?
  • Brian

    #2
    Re: Named vs. numerical entities

    Jonas Smithson wrote:[color=blue]
    > I recently read the claim somewhere that numerical entities (such
    > as —) have a speed advantage over the equivalent named
    > entities (such as —) because the numerical entity requires
    > just a single byte to be downloaded to the browser, while the named
    > entity requires one byte for each letter.[/color]

    My, that was a load of poppycock you were told.
    [color=blue]
    > I found this claim a little surprising[/color]

    That's being too kind.
    [color=blue]
    > I would have thought *each* numeral in the numerical entity would
    > require one byte.[/color]

    That depends on the encoding. You'd best consult the guides if you
    want to know more. I wish I understood it all better. I don't, despite
    reading **numerous** posts from folks here who are quite well-versed.
    If you're interested, Google the group for "Alan Flavell encoding" or
    "Andreas Prilop charset". That'll turn up lots of posts. I'd suggest
    you read what they say carefully; read those who argue with them, at
    least on character encoding issues, with a grain of salt.
    [color=blue]
    > Also, which form of the entity enjoys wider browser support? They
    > both seem to work with modern browsers... but what about older or
    > very buggy browsers?[/color]

    Again, A. Flavell is your man. Brace yourself for some heavy reading:



    --
    Brian (remove ".invalid" to email me)

    Comment

    • Stan Brown

      #3
      Re: Named vs. numerical entities

      "Jonas Smithson" <smithsonNOSPAM @REMOVETHISboar dermail.com> wrote in
      comp.infosystem s.www.authoring.html:[color=blue]
      >I recently read the claim somewhere that numerical entities (such as
      >—) have a speed advantage over the equivalent named entities
      >(such as &mdash;) because the numerical entity requires just a single
      >byte to be downloaded to the browser, while the named entity requires
      >one byte for each letter. (So in this case, it would presumably be one
      >byte vs. seven bytes.) I found this claim a little surprising -- I
      >would have thought *each* numeral in the numerical entity would require
      >one byte.[/color]

      It does.

      Where the difference arises is if you actually create your document
      in Unicode instead of an 8-bit character set. If the document is
      actually composed in Unicode, and transmitted in Unicode, then there
      is an advantage of the actual 8212 character because it needs only
      two bytes whereas &mdash; is 7 characters. (I can't remember whether
      that's 7*2=14 bytes or some compression goes on, but it's certainly
      more than 2 bytes.)

      --
      Stan Brown, Oak Road Systems, Tompkins County, New York, USA

      HTML 4.01 spec: http://www.w3.org/TR/html401/
      validator: http://validator.w3.org/
      CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
      2.1 changes: http://www.w3.org/TR/CSS21/changes.html
      validator: http://jigsaw.w3.org/css-validator/

      Comment

      • Brian

        #4
        Re: Named vs. numerical entities

        Jonas Smithson wrote:
        [color=blue]
        > I recently read the claim somewhere that numerical entities (such
        > as —) have a speed advantage over the equivalent named
        > entities (such as &mdash;) because the numerical entity requires
        > just a single byte to be downloaded to the browser, while the named
        > entity requires one byte for each letter. (So in this case, it
        > would presumably be one byte vs. seven bytes.)[/color]

        BTW, did the person whose work you read actually claim that there
        would be a noticeable difference in 2 documents, where document (a)
        had 6 (or 12, or, heck, even 60) bytes more than document (b)?

        --
        Brian (remove ".invalid" to email me)

        Comment

        • Brian

          #5
          Re: Named vs. numerical entities

          Stan Brown wrote:
          [color=blue]
          > Where the difference arises is if you actually create your document
          > in Unicode[/color]

          I'm not sure what you mean by this. Unicode is a character set, not an
          encoding. AIUI, all HTML documents are presumed to be written in
          Unicode, although that's an awkward thing to say.
          [color=blue]
          > instead of an 8-bit character set. If the document is actually
          > composed in Unicode, and transmitted in Unicode,[/color]

          There's no such thing as "transmitte d in Unicode". You mean
          encoded in UTF-8? But UTF-8 is an 8-bit character set (hence the name).
          [color=blue]
          > then there is an advantage of the actual 8212 character because it
          > needs only two bytes whereas &mdash; is 7 characters.[/color]

          The only sense I can make of this is that if you use an encoding that
          permits a direct representation of a charcter instead of requiring an
          entity you'll save few byes. So, in UTF-8, the letter A requires 1
          byte where &#65; would require 5. Is that what you meant?

          --
          Brian (remove ".invalid" to email me)

          Comment

          • Jonas Smithson

            #6
            Re: Named vs. numerical entities

            Brian wrote:
            [color=blue]
            > BTW, did the person whose work you read actually claim that there
            > would be a noticeable difference in 2 documents, where document (a)
            > had 6 (or 12, or, heck, even 60) bytes more than document (b)?[/color]

            No, he didn't put the remark in context, as I recall... although I
            don't even remember whether I read it online or in some computer book,
            and the whole subject of encodings is totally confusing to me so I
            probably misunderstood whatever context there may have been.

            However, some of my pages have numerous character entities on them...
            let's say up to fifty on a page, perhaps; if they each entailed an
            extra six bytes (for example) over some alternate method, then that
            might add up to an extra 300 bytes. What does that equal in download
            time? How many bytes of difference do *you* think would make a
            "noticeable difference" between two documents... say, to a user on a
            56K modem?

            Comment

            • Alan J. Flavell

              #7
              Re: Named vs. numerical entities

              On Fri, 16 Jul 2004, Brian wrote:
              [color=blue]
              > There's no such thing as "transmitte d in Unicode".[/color]

              Agreed.
              [color=blue]
              > You mean encoded in UTF-8? But UTF-8 is an 8-bit character set[/color]

              No, utf-8 isn't a "character set" at all (that MIME "charset"
              parameter denotes what we nowadays call a "character encoding
              scheme").
              [color=blue]
              > (hence the name).[/color]

              The utf-8 scheme is built with 8-bit units, indeed, but characters are
              represented by variable numbers of those units. (As you obviously
              know).

              cheers

              Comment

              • Alan J. Flavell

                #8
                Re: Named vs. numerical entities

                On Thu, 15 Jul 2004, Brian wrote:
                [color=blue][color=green]
                > > I found this claim a little surprising[/color]
                >
                > That's being too kind.[/color]

                ;-)

                If the hon Usenaut is worried about the size of their HTML documents,
                it may be worth noting that most current browsers are happy to accept
                gzip-compressed HTML. At least for documents which are in a Latin
                base-language, this can make far more difference to total size than
                worrying about the difference between a few &-notations and utf-8
                encoding.

                But it's probably not worth doing this until the individual HTML items
                are significantly larger than the amount of HTTP red-tape involved in
                retrieving the item. More than a few kBytes each, let's say.

                For extra brownie points, the server can be set to honour the
                browser's Accept-encoding header, sending gzip-compressed format to
                those who say they accept it, and straight HTML to any who don't.

                There are third-party Apache modules which take care of this "on the
                fly", but it can be done more simply (i.e with MultiViews) if one is
                willing to store both versions on the server. Disk space is cheap
                nowadays, after all.

                good luck

                Comment

                • Alan J. Flavell

                  #9
                  Re: Named vs. numerical entities

                  On Fri, 16 Jul 2004, Jonas Smithson wrote:
                  [color=blue]
                  > I recently read the claim somewhere that numerical entities (such as
                  > —) have a speed advantage over the equivalent named entities[/color]

                  Others have rightly explained what nonsense that is...
                  [color=blue]
                  > Also, which form of the entity enjoys wider browser support?[/color]

                  You've been given the URL of my checklist for the wider picture, but
                  to summarise the relevant points:

                  - utf-8 encoding is widely supported and a compact representation; its
                  problem is more the possibility of mishandling in the hands of authors
                  who are not yet familiar with it.

                  - The Latin-1 named entities (those proposed in the appendix to
                  RFC1866/HTML2.0) are very well supported

                  - Generally speaking the entities introduced in HTML4 are now
                  supported, but there are still browsers around (e.g NN4.*) that don't
                  understand them. For almost all of these characters, I'd still say
                  that the &#number; representation is somewhat more widely supported.

                  It's best, of course, if your HTML authoring software takes care of
                  the details for you, according to some options which you can set.

                  &euro; is widely recognised, and at least still comprehensible in
                  browsers which don't implement it (since browsers usually display
                  character entities literally if they don't understand them).
                  [color=blue]
                  > They both seem to work with modern browsers... but what about older
                  > or very buggy browsers?[/color]

                  The checklist does its best to take that into account and choose best
                  compromises depending on the character repertoire which you need.

                  WebTV seemed to be hopeless with anything outside of a subset of
                  Windows-1252 repertoire. If you have anything more challenging as
                  your content, then you'd basically have to write it off. I hear that
                  they're working on it.

                  Comment

                  • Harlan Messinger

                    #10
                    Re: Named vs. numerical entities

                    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                    [color=blue]
                    >On Fri, 16 Jul 2004, Brian wrote:
                    >[color=green]
                    >> There's no such thing as "transmitte d in Unicode".[/color]
                    >
                    >Agreed.[/color]

                    Why can't a document be encoded (and transmitted) in Unicode? If
                    Windows Notepad lets you save a text file as Unicode (big- or
                    little-endian), isn't that the same thing?


                    --
                    Harlan Messinger
                    Remove the first dot from my e-mail address.
                    Veuillez ôter le premier point de mon adresse de courriel.

                    Comment

                    • Andreas Prilop

                      #11
                      Re: Named vs. numerical entities

                      On Fri, 16 Jul 2004, Harlan Messinger wrote:
                      [color=blue]
                      > Why can't a document be encoded (and transmitted) in Unicode?[/color]

                      It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
                      and in addition in different byte order for UTF-16 and UTF-32.
                      <http://www.unicode.org/unicode/faq/utf_bom.html>
                      [color=blue]
                      > If
                      > Windows Notepad lets you save a text file as Unicode (big- or
                      > little-endian), isn't that the same thing?[/color]

                      "Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
                      UTF-32 isn't used in MS Windows AFAIK.

                      --
                      Top-posting.
                      What's the most irritating thing on Usenet?

                      Comment

                      • Andreas Prilop

                        #12
                        Re: Named vs. numerical entities

                        On Fri, 16 Jul 2004, Jonas Smithson wrote:
                        [color=blue]
                        > I recently read the claim somewhere that numerical entities (such as
                        > —) have a speed advantage over the equivalent named entities
                        > (such as &mdash;) because the numerical entity requires just a single
                        > byte to be downloaded to the browser, while the named entity requires
                        > one byte for each letter.[/color]

                        Others told you already that isn't true. But even if it were true,
                        a single image is usually bigger than your source text. So length
                        doesn't really matter. [ Oops, what did I write :-) ]

                        But as <http://ppewww.ph.gla.a c.uk/~flavell/charset/checklist.html# s6>
                        explains, decimal references are somewhat better supported among
                        (older) browsers than hexadecimal references or entities.

                        --
                        Top-posting.
                        What's the most irritating thing on Usenet?

                        Comment

                        • Neal

                          #13
                          Re: Named vs. numerical entities

                          On Fri, 16 Jul 2004 05:34:11 GMT, Jonas Smithson
                          <smithsonNOSPAM @REMOVETHISboar dermail.com> wrote:
                          [color=blue]
                          > Brian wrote:
                          >[color=green]
                          >> BTW, did the person whose work you read actually claim that there
                          >> would be a noticeable difference in 2 documents, where document (a)
                          >> had 6 (or 12, or, heck, even 60) bytes more than document (b)?[/color]
                          >
                          > No, he didn't put the remark in context, as I recall... although I
                          > don't even remember whether I read it online or in some computer book,
                          > and the whole subject of encodings is totally confusing to me so I
                          > probably misunderstood whatever context there may have been.
                          >
                          > However, some of my pages have numerous character entities on them...
                          > let's say up to fifty on a page, perhaps; if they each entailed an
                          > extra six bytes (for example) over some alternate method, then that
                          > might add up to an extra 300 bytes. What does that equal in download
                          > time? How many bytes of difference do *you* think would make a
                          > "noticeable difference" between two documents... say, to a user on a
                          > 56K modem?[/color]

                          Negligible. Probably most pages have that much deletable/editable crap in
                          them plus some...

                          Comment

                          • Alan J. Flavell

                            #14
                            Re: Named vs. numerical entities

                            On Fri, 16 Jul 2004, Harlan Messinger wrote:
                            [color=blue]
                            > Why can't a document be encoded (and transmitted) in Unicode?[/color]

                            Because "Unicode" is not the name of an encoding scheme.
                            [color=blue]
                            > If Windows Notepad lets you save a text file as Unicode (big- or
                            > little-endian), isn't that the same thing?[/color]

                            You're talking about just two of the possible encoding schemes for
                            Unicode. MS using baby-talk is maybe "good enough for government
                            work", but this here is a technical forum. What MS's terms are
                            denoting are utf-16LE and utf-16BE encoding schemes.

                            And in any case, probably the best choice (if no other constraints
                            apply) of Unicode encoding scheme for HTML used in a WWW context is
                            utf-8, not utf-16LE/BE.

                            Comment

                            • Pierre Goiffon

                              #15
                              Re: Named vs. numerical entities

                              "Alan J. Flavell" <flavell@ph.gla .ac.uk> a écrit dans le message de
                              news:Pine.LNX.4 .53.04071613340 00.7123@ppepc56 .ph.gla.ac.uk[color=blue]
                              > And in any case, probably the best choice (if no other constraints
                              > apply) of Unicode encoding scheme for HTML used in a WWW context is
                              > utf-8, not utf-16LE/BE.[/color]

                              Do you mean, when using a vast majority of latin characters ?
                              If not, wouldn't the file will get very large ? Wouldn't it be better to use
                              UTF-16 ?

                              Comment

                              Working...