Named vs. numerical entities

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • C A Upsdell

    #46
    Re: Named vs. numerical entities

    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
    news:Pine.LNX.4 .53.04071619590 60.7123@ppepc56 .ph.gla.ac.uk.. .[color=blue]
    > On Fri, 16 Jul 2004, C A Upsdell wrote:
    >[color=green]
    > > Standards written later appear to have disassociated the term ASCII
    > > from the national variants[/color]
    >
    > Uh-uh, it's an international conspiracy to hide the origin of these
    > codes, is it? You don't seriously believe that the US American
    > national standards body would go making national character codes for
    > other countries, do you?[/color]

    I generally respect what you say, even when I disagree with you. But a
    paragraph like this is unworthy of you. International conspiracy? ISO an
    American standards body? Standards being set by one national standards body
    without consulting with other nations? You speak as if the US were the only
    legitimate country in the world! Surely you are not (gasp!) a US
    Republican!
    [color=blue][color=green]
    > > and extended sets,[/color]
    >
    > At this point nobody's arguing about "extended sets". It's about national[/color]
    variants based on the 7-bit code called ASCII.

    And as I said before, there were 8-bit ASCII sets, sometimes called extended
    ASCII: 7 bits are not adequate to code characters for most European
    languages, or for specialized character sets.

    I do wish I had never discarded the manuals I used 3 decades ago. And I
    wish that people would refuse to believe that information does not exist if
    it does not make its way to the Internet. I have used computers, languages,
    operating systems, tools, and manuals that have long been extinct. E.g.,
    how many remember 8080 assembly programming using Intel MDS Development
    Systems running the ISIS-II operating system. Or my favourite programmer's
    editor, the Sage Professional Editor for Windows and OS/2? Or how to
    program Intel's 8259A UART for either 7- and 8-bit serial communications? )
    Sigh?



    Comment

    • Stan Brown

      #47
      Re: Named vs. numerical entities

      "Jonas Smithson" <smithsonNOSPAM @REMOVETHISboar dermail.com> wrote in
      comp.infosystem s.www.authoring.html:[color=blue]
      >But I got the core information I needed: there's no
      >speed advantage of — over &mdash;.[/color]

      It's true that there's no speed advantage.

      There is another advantage, however, one that I have not seen
      mentioned in this thread: Netscape 4 understands — but does
      not understand &mdash;. That might weigh in your decision.

      --
      Stan Brown, Oak Road Systems, Tompkins County, New York, USA

      HTML 4.01 spec: http://www.w3.org/TR/html401/
      validator: http://validator.w3.org/
      CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
      2.1 changes: http://www.w3.org/TR/CSS21/changes.html
      validator: http://jigsaw.w3.org/css-validator/

      Comment

      • Tim

        #48
        Re: Named vs. numerical entities

        On Fri, 16 Jul 2004, Pierre Goiffon wrote:
        [color=blue][color=green]
        >> If not, wouldn't the file will get very large ? Wouldn't it be
        >> better to use UTF-16 ?[/color][/color]


        "Alan J. Flavell" <flavell@ph.gla .ac.uk> posted:
        [color=blue]
        > I haven't widely tested browser compatibility for utf-16 encodings, so
        > I can't comment on that aspect.[/color]

        Not that long ago I tried utf-16 on several different (and *current*
        versions of) web browsers. Only some could use it.

        I know that's vague, and I'm not inclined to run all the tests before I
        post this response. But it was enough to convince *me* that it was a bad
        idea.

        --
        If you insist on e-mailing me, use the reply-to address (it's real but
        temporary). But please reply to the group, like you're supposed to.

        This message was sent without a virus, please delete some files yourself.

        Comment

        • Leif K-Brooks

          #49
          Re: Named vs. numerical entities

          Jonas Smithson wrote:[color=blue]
          > I can't find anything in the editor's preferences or
          > dialogs about "utf-8". When they say "Save as Unicode", is it likely
          > they mean the same thing you mean by "save in utf-8 format"?[/color]

          I have way too much time on my hands, so I think I'll write a
          (hopefully) easy to understant explanation of this stuff. I'm not an
          expert, and I'm sure one will correct me on some of the finer points,
          but I should at least be able to give you a good enough idea of this stuff.

          Computers store things in bytes, which are numbers between 0 and 255.
          This system works great for numbers, since you can use multiple bytes to
          store numbers larger than 255, but text is a bit problematic when all
          you have to work with is numbers.

          Enter character sets and encodings. A character set is just that: a set
          of characters. An encoding is a way to convert characters in a character
          set into a series of bytes. Some simple character sets which define 256
          characters or less can also be considered encodings, since nothing
          special is required to convert them into bytes.

          The first character set, which was also an encoding because it defined
          only 128 characters, was called ASCII. It was fine for early computers,
          but there was a problem: it only defined the Latin alphabet, digits, and
          a few simple symbols. Countries which needed accented letters had
          trouble, and countries which had entirely different alphabets couldn't
          use ASCII at all.

          In an attempt to fix all of those problems, the International
          Orginization for Standardization and others defined encodings which kept
          the 128 ASCII characters, but also used the other 128 integers in a byte
          for other characters. Unfortunately, there were more than 128 characters
          needed for other alphabets, so several incompatible encodings defining
          different characters were created instead of just one. That worked for a
          while, but the incompatibility of the different encodings stopped
          characters from different alphabets from being used in the same
          document, which some people needed to do.

          The most important character set today is called Unicode. It currently
          defines 96000 characters, and reserves the right to define a total of
          1114112 characters in the future. It has Latin, Greek, Chinese, and
          everything in between; hopefully enough for anyone.

          Note that I said Unicode is a character set, not an encoding. It has
          three different encodings: UTF-8, UTF-16, and UTF-32. UTF-8 is probably
          the most used; it uses a different number of bytes (between 1 and 4) for
          different characters, and all ASCII text is also valid UTF-8 text.
          UTF-16 also uses a variable number of bytes; 2-4 in this case. UTF-32 is
          the simplest for programs to process; it uses 4 bytes for every character.

          As to whether your editor means UTF-8 by Unicode, I'm not sure. It
          doesn't really mean Unicode, but whether it means UTF-8, UTF-16, or
          UTF-32 is difficult to say.

          [color=blue]
          > If I were working and saving in unicode, would that mean (for example)
          > that I could type an emdash the way we Mac users do it
          > (command-option-hyphen) and that would actually work in the HTML
          > document on other platforms, without my using any character entity (or
          > character reference or whatever it's called)?[/color]

          Yes. I believe Mac OS X handles these things very nicely, so you
          shouldn't have any trouble.
          [color=blue]
          > And would the emdash character then be more
          > "compact" (smaller download) than the character reference (—)
          > I've been using?[/color]

          Yes. — is 7 bytes in UTF-8, but the emdash encoded in UTF-8 is
          only two bytes.
          [color=blue]
          > But...um... didn't I read somewhere that unicode
          > documents are much larger than... the other kind... (what's a
          > 'non-Unicode' document called?) and so should only be used if you need
          > support for large character sets like Chinese etc...?[/color]

          Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,
          but UTF-16 and UTF-32 documents are larger.
          [color=blue]
          > And then, of course, there's the whole other issue that my FTP program
          > automatically converts code to iso-8859-1 charset when you upload it,
          > unless you tell it not to, and when BBEdit talks directly to the FTP
          > server I don't know what it does.[/color]

          My advice would be to replace your FTP client if it's that broken, but
          you might be able to fix it by uploading in binary mode instead of text.
          As for what BBEdit does, my guess would be that it does the right
          thing if it has an option for Unicode when saving.
          [color=blue]
          > And if I did save a text file as unicode, when I opened it later in a
          > text editor (perhaps even a different one), would I be able to tell
          > what it was saved as?[/color]

          Not unless your text editor told you, which it might.

          Comment

          • Alan J. Flavell

            #50
            Re: Named vs. numerical entities

            On Sat, 17 Jul 2004, Jonas Smithson wrote:
            [color=blue]
            > The editor (an old version of BBEdit) gives me two options for the
            > document while I'm working on it: "Encode as Unicode" and, if that's
            > enabled, the option to "Swap Bytes".[/color]

            Feel free to play around with this stuff and see what happens. E.g
            put some interesting characters into a file, save it with the various
            options, open the file in a unicode-capable web browser and play with
            its view> character encoding options (whatever it calls them) till the
            result makes sense. Then you'll have a better idea of what you've
            got. View the source to make sure you're getting coded characters
            instead of &-notations.

            My hunch is that your editor is talking about the forerunner of utf-16
            which was called ucs-2, back when the Unicode range could all be
            represented in two bytes. For this subset of characters, you may be
            able to treat utf-16 and ucs-2 as effectively synonymous.

            My reading of Alan Wood's pages on editors (please consult them) is
            that current versions of BBEdit support utf-8:

            Text editors, HTML editors and word processors with Unicode, UTF-8 or multilingual support that run under Mac OS 9. Part of Alan Wood's Unicode Resources.

            [color=blue]
            > (It also gives me a choice of Macintosh, Unix, or DOS line breaks,
            > which I assume wouldn't affect the HTML display.)[/color]

            Agreed
            [color=blue]
            > If I were working and saving in unicode, would that mean (for example)
            > that I could type an emdash the way we Mac users do it
            > (command-option-hyphen) and that would actually work in the HTML
            > document on other platforms, without my using any character entity[/color]

            Right
            [color=blue]
            > And would the emdash character then be more "compact" (smaller
            > download) than the character reference (—) I've been using?[/color]

            Yes
            [color=blue]
            > But...um... didn't I read somewhere that unicode
            > documents are much larger than... the other kind...[/color]

            utf-8 is a good compromise for western writing systems. We've
            discussed some of the issues elsewhere on this thread.
            [color=blue]
            > And then, of course, there's the whole other issue that my FTP program
            > automatically converts code to iso-8859-1 charset when you upload it,
            > unless you tell it not to, and when BBEdit talks directly to the FTP
            > server I don't know what it does.[/color]

            This is a detail which you'd need to get a grasp on, right.

            But play around a bit, and read around a bit, so that competences and
            understanding stay reasonably in step. In the end, it's all much
            simpler and straightforward that it might have seemed at the outset.
            But if your software doen't properly support what you're trying to do,
            then you're confronted with extra difficulties. So do take a look at
            Alan Wood's overview as it relates to your particular platform(s) and
            pick something that appeals to you, at least for the first steps.
            Then you'd be able to assess whether the software that you're already
            using is actually capable of what you need.

            Comment

            • Alan J. Flavell

              #51
              Re: Named vs. numerical entities

              On Sat, 17 Jul 2004, Leif K-Brooks wrote:
              [color=blue]
              > Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,[/color]

              Er, no. The characters in the upper half of iso-8859-1 need two bytes
              per character in utf-8; only one in iso-8859-1.
              [color=blue]
              > My advice would be to replace your FTP client if it's that broken,[/color]

              Cue A.Prilop and the anti-Pirard league (that's an in-joke, don't
              worry about it). The FTP software is not "broken", it's got extra
              functionality, for mapping between traditional MacRoman encoding and
              iso-8859-1. That function needs to be off when the material isn't
              encoded in MacRoman.

              Comment

              • Andy Dingley

                #52
                Re: Named vs. numerical entities

                On Fri, 16 Jul 2004 17:50:35 GMT, "C A Upsdell"
                <cupsdell0311XX X@-@-@XXXrogers.com> wrote:
                [color=blue]
                >I routinely worked with
                >various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
                >codes).[/color]

                Gray codes are a red-herring here. They've nothing to do with
                character encodings.

                Comment

                • Brian

                  #53
                  Re: Named vs. numerical entities

                  C A Upsdell wrote:[color=blue]
                  > "Alan J. Flavell" wrote...
                  >[color=green]
                  >> C A Upsdell wrote:
                  >>[color=darkred]
                  >>> Standards written later appear to have disassociated the term
                  >>> ASCII from the national variants[/color]
                  >>
                  >> Uh-uh, it's an international conspiracy to hide the origin of
                  >> these codes, is it? You don't seriously believe that the US
                  >> American national standards body would go making national
                  >> character codes for other countries, do you?[/color]
                  >
                  > a paragraph like this is unworthy of you. International
                  > conspiracy?[/color]

                  "Don't you know sarcasm when you hear it?!" [1]
                  [color=blue]
                  > ISO an American standards body?[/color]

                  Not *quite* what he was saying. ;-)
                  [color=blue]
                  > Standards being set by one national standards body without
                  > consulting with other nations? You speak as if the US were the
                  > only legitimate country in the world![/color]

                  Hard to imagine how you could have misread that post more than you did.

                  --
                  Brian (remove ".invalid" to email me)

                  Comment

                  • Alan J. Flavell

                    #54
                    Re: Named vs. numerical entities

                    On Sat, 17 Jul 2004, Brian wrote [to C A Upsdell ]:
                    [color=blue]
                    > Hard to imagine how you could have misread that post more than you did.[/color]

                    It's comforting to know that someone could perceive the
                    discrepancy ;-)

                    I don't think it's worth my while to even start on responding to the
                    various non-sequiturs. Suffice it to say that I'm well near the front
                    in the crabby old b*gger stakes, I met my first computer in 1958 and
                    some of my early programs are for converting between different
                    character encodings. I've had an interest in character
                    representation, specifications, standards, usage and terminology in
                    this field ever since.

                    Oh, and ASCII is a 7-bit code.

                    all the best.

                    Comment

                    • Brian

                      #55
                      Re: Named vs. numerical entities

                      Brian wrote:
                      [color=blue]
                      > "Don't you know sarcasm when you hear it?!" [1][/color]

                      That note marker was meant to be followed by a citation. Here it is: I
                      lifted that from one Charles Brown.

                      --
                      Brian (remove ".invalid" to email me)

                      Comment

                      • Leif K-Brooks

                        #56
                        Re: Named vs. numerical entities

                        Alan J. Flavell wrote:[color=blue]
                        > On Sat, 17 Jul 2004, Leif K-Brooks wrote:
                        >
                        >[color=green]
                        >>Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,[/color]
                        >
                        >
                        > Er, no. The characters in the upper half of iso-8859-1 need two bytes
                        > per character in utf-8; only one in iso-8859-1.[/color]

                        Darn, you're right. Could've sworn I read that somewhere, even though it
                        doesn't make any sense; I guess this is why one shouldn't make Usenet
                        posts after midnight.

                        Comment

                        • Andy Dingley

                          #57
                          Re: Named vs. numerical entities

                          On Fri, 16 Jul 2004 19:57:52 GMT, Jonas Smithson
                          <smithsonNOSPAM @REMOVETHISboar dermail.com> wrote:
                          [color=blue]
                          >But I got the core information I needed: there's no
                          >speed advantage of — over &mdash;.[/color]

                          ...I've never understood encodings or entities either....


                          How portable is —, as a very general thing, relative to say,
                          &#160; ?

                          I'd always assumed that both were effectively portable, but just this
                          week I've been having trouble with a system (Vodafone's PartnerML)
                          that can't handle apostrophes from M$oft Word, that appear as ’

                          Comment

                          • Jonas Smithson

                            #58
                            Re: Named vs. numerical entities

                            Well, thanks again to all of you; you've given me a good starting point
                            for figuring out at least the basics of this stuff, and you've been
                            very patient with me (although not, I think, with each other!).

                            I realize now that some of what I've been reading in books has been
                            misleading or simply wrong; it's odd that a Usenet newsgroup could be
                            more reliable than some books from reputable publishers, but that seems
                            to be the case... which makes it hard to know how to "filter"
                            information as I go forward. In fact, much as I dislike the combative
                            or sneering tone that many Usenetters adopt (unnecessarily, I think), I
                            see that the contentiousness does serve one useful purpose -- when I'm
                            reading a book that contains misinformation, it would be useful if a
                            critic could be there to step in with a demurral!

                            Jonas

                            Comment

                            • Alan J. Flavell

                              #59
                              Re: Named vs. numerical entities

                              On Sat, 17 Jul 2004, Andy Dingley wrote:
                              [color=blue]
                              > How portable is —, as a very general thing,[/color]

                              Portable? Utterly: it's a string of seven ASCII characters, after
                              all; they're unlikely to come to any harm in transit. Compatible with
                              all browsers and client agents? No, but it's been clear enough since
                              RFC1866/HTML2.0 that this was where HTML would be heading; RFC2070
                              actually codified it, and HTML4.0 put it into a W3C version of HTML.
                              That's quite a little while back now, as you may recall.
                              [color=blue]
                              > relative to say, &#160; ?[/color]

                              That notation is technically meaningless (in HTML) and AFAIK illegal
                              in XHTML. So by definition it's not compatible with anything. Sure,
                              it happens to pick out the displayable characters of the Windows-1252
                              code on a rather popular majority platform; and other browser makers
                              may have considered that they couldn't afford to not copy that
                              behaviour, no matter what the specifications said. So it gives the
                              visual result that the author intended; but to call that "working"
                              would be stretching things.
                              [color=blue]
                              > I'd always assumed that both were effectively portable,[/color]

                              But what do you really mean by "portable"? They are notations
                              constructed of strings of ASCII characters. They will certainly
                              -reach- every client agent in that form. If you really mean "will
                              client agents render them?" why not ask that question? Most will;
                              some won't. At least if you use ’ then by definition any client
                              agent which doesn't render them, doesn't support HTML4. If you use
                              numbers between 128 and 159 respectively, then you're not really
                              writing HTML, but some kind of quasi-MSHTML which even MS are weaning
                              themselves off now.
                              [color=blue]
                              > but just this week I've been having trouble with a system
                              > (Vodafone's PartnerML) that can't handle apostrophes from M$oft
                              > Word, that appear as ’[/color]

                              AFAIK, neither does WebTV. Works great in Lynx, of course.

                              If you would at least code them as 8-bit characters, instead of
                              &#number; notations, and send them as charset=windows-1252, then you
                              would at least be both (a) honest and (b) protocol-conforming. It's
                              not my top recommendation - far from it, but see the discussion:


                              hope that helps

                              Comment

                              • Alan J. Flavell

                                #60
                                Re: Named vs. numerical entities

                                On Sat, 17 Jul 2004, Alan J. Flavell wrote:
                                [color=blue]
                                > On Sat, 17 Jul 2004, Andy Dingley wrote:
                                >[color=green]
                                > > How portable is —, as a very general thing,[/color][/color]
                                [...][color=blue][color=green]
                                > > relative to say, &#160; ?[/color]
                                >
                                > That notation is technically meaningless (in HTML) and AFAIK illegal
                                > in XHTML.[/color]

                                Hah! You caught me out well and truly there!!

                                There's nothing wrong with 160, it's a no-break space.

                                The windows-1252 code for your em dash would be 151. And there I was,
                                posting on autopilot, assuming that's what you had typed. Well, hit
                                me down with a clue by four...

                                But the rest of what I wrote was, at least, what I intended. Sorry
                                about that.

                                Comment

                                Working...