Named vs. numerical entities

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Harlan Messinger

    #31
    Re: Named vs. numerical entities


    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
    news:Pine.LNX.4 .53.04071617075 00.7333@ppepc56 .ph.gla.ac.uk.. .[color=blue]
    > On Fri, 16 Jul 2004, Harlan Messinger wrote:
    >[color=green]
    > > Sorry, I missed it somehow. I intend to read it later,[/color]
    >
    > Call back here when you have done?
    >[color=green]
    > > 1. There's nothing any more nonsensical about the concept of a Unicode
    > > encoding for the Unicode character set than there is about ASCII[/color][/color]
    encoding[color=blue][color=green]
    > > for the ASCII character set,[/color]
    >
    > Actually there are substantial differences. And you see this also
    > with that MIME parameter which is (mis)named "charset" - but specifies
    > what we now would call a "character encoding scheme".
    >
    > Back when 7 or 8 bits were sufficient to represent all of the
    > characters of a repertoire, it was quasi-obvious that the "coded
    > character set" was defined by assigning numbers (0-127 or 0-255 as the
    > case may be) to the characters of the repertoire, and then to lay out
    > the fonts according to that scheme, and to transmit the characters by
    > means of bytes having that value.
    >
    > Consequently, back then it looked as if the things that we now call
    > "coded character set", "character encoding" and "font arrangement"
    > were just different names for the same thing. Of course, you needed a
    > different font for each "charset" (i.e character encoding), which got
    > to be a considerable drag.
    >
    > Nowadays these concepts have to be disambiguated. Unicode characters
    > are designated by a code point which can, in principle, go up to 2**31
    > (it hasn't got that far yet). Those numbers then have to be
    > represented in a way which is convenient for transmission and/or
    > storage (different design criteria apply for different purposes).
    >[color=green]
    > > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
    >
    > Actually not. But discussing that would be a pointless digression, so
    > let's move on.
    >[color=green]
    > > So why are the UTF-* encoding, "encodings of the Unicode character[/color][/color]
    set"?[color=blue]
    >
    > It's not practical, for various reasons, to transmit characters as
    > 32-bit units. For one thing, it's very wasteful. For another,
    > there's no unique byte-ordering, hence all this fuss about endian-ness
    > when units of 16 or 32 bits are involved.
    >
    > There's also the question of representing unicode characters in a
    > mail-safe context (hence utf-7). That will fade with time, but even
    > 8-bit-safe mail formats ban null bytes, which means that utf-16 or
    > utf-32/ucs-4 representations cannot be used without a further layer of
    > encoding.
    >[color=green]
    > > Is it because they are closely related to the Unicode character set[/color]
    >
    > Is it because you won't read the tutorial before asking further
    > questions?[/color]

    No, it's because sometimes questions can be satisfied by relatively simple
    answers without requiring one to read a whole tutorial (though sometimes
    not). Sometimes a tutorial or textbook will tell you the way things are
    without explaining why they aren't some other way (though sometimes not).

    Comment

    • Alan J. Flavell

      #32
      Re: Named vs. numerical entities

      On Fri, 16 Jul 2004, C A Upsdell wrote:
      [color=blue]
      > Not quite. You are thinking of US-ASCII. There are a variety of
      > national ASCII character sets.[/color]

      That's sloppy terminology. There's a variety of 7-bit national
      character sets which are patterned on ASCII (US-ASCII is a more
      accurate name, since - contrary to widespread belief amongst some
      parties - America doesn't consist solely of the United States).

      But those national character sets were mostly codified under ISO-646.

      I give you my old page


      and particularly the links to
      http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html and


      This was a relevant topic in the early days of the WWW, since the code
      positions which were set aside for national variations in iso-646 were
      for example the basis for some of the "unsafe character" exclusions in
      URLs.

      Btw, I see there's a lovely comment in that Terena web page:

      It will be clear that so-called "de facto standards" are related to
      those discussed above as Monopoly banknotes to real money, valuable
      as long as the game goes on.

      Comment

      • C A Upsdell

        #33
        Re: Named vs. numerical entities

        "Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
        news:Pine.GSO.4 .44.04071618331 40.11334-100000@s5b003.. .[color=blue]
        > On Fri, 16 Jul 2004, C A Upsdell wrote:
        >[color=green][color=darkred]
        > >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
        > >> and ISO 646.[/color]
        > >
        > > Not quite. You are thinking of US-ASCII.[/color]
        >
        > ASCII and US-ASCII are synonyms.
        > <http://www.iana.org/assignments/character-sets>[/color]

        NOT TRUE!!!! Read the IANA page: "These names are expressed in
        ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
        character set most commonly use in the Internet and used especially in
        protocol standards is US-ASCII, this is strongly encouraged. The use of the
        name US-ASCII is also encouraged." This says that US-ASCII is commonly
        called ASCII. It does not say that US-ASCII is ASCII.

        Also, when I started developing software in the early 1970's -- before the
        Internet, before PCs, before microprocessors -- I routinely worked with
        various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
        codes). I find many Internet references denying the existence of 8-bit
        ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit sets
        were alive and well.



        Comment

        • Harlan Messinger

          #34
          Re: Named vs. numerical entities


          "C A Upsdell" <cupsdell0311XX X@-@-@XXXrogers.com> wrote in message
          news:LzUJc.1$Cm C1.0@news04.blo or.is.net.cable .rogers.com...[color=blue]
          > "Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
          > news:Pine.GSO.4 .44.04071618331 40.11334-100000@s5b003.. .[color=green]
          > > On Fri, 16 Jul 2004, C A Upsdell wrote:
          > >[color=darkred]
          > > >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
          > > >> and ISO 646.
          > > >
          > > > Not quite. You are thinking of US-ASCII.[/color]
          > >
          > > ASCII and US-ASCII are synonyms.
          > > <http://www.iana.org/assignments/character-sets>[/color]
          >
          > NOT TRUE!!!! Read the IANA page: "These names are expressed in
          > ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
          > character set most commonly use in the Internet and used especially in
          > protocol standards is US-ASCII, this is strongly encouraged. The use of[/color]
          the[color=blue]
          > name US-ASCII is also encouraged." This says that US-ASCII is commonly
          > called ASCII. It does not say that US-ASCII is ASCII.[/color]

          Uh, yeah, it does, unless the implication is along the lines of "... called
          US-ASCII, or often simply ASCII, although this is technically incorrect
          because ASCII properly refers to a different characters set". But that isn't
          the implication and the statement is saying that US-ASCII, ASCII, and
          ANSI_X3.4-1968 are all names for the same thing--which is the same as saying
          that each of them is also each of the others.
          [color=blue]
          >
          > Also, when I started developing software in the early 1970's -- before the
          > Internet, before PCs, before microprocessors -- I routinely worked with
          > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
          > codes). I find many Internet references denying the existence of 8-bit
          > ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit[/color]
          sets[color=blue]
          > were alive and well.[/color]

          Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
          They may have been ASCII extensions, but they were not ASCII.

          Comment

          • C A Upsdell

            #35
            Re: Named vs. numerical entities

            "Harlan Messinger" <h.messinger@co mcast.net> wrote in message
            news:2lqjqsFfb5 rcU1@uni-berlin.de...[color=blue][color=green]
            > > Also, when I started developing software in the early 1970's -- before[/color][/color]
            the[color=blue][color=green]
            > > Internet, before PCs, before microprocessors -- I routinely worked with
            > > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and[/color][/color]
            Gray[color=blue][color=green]
            > > codes). I find many Internet references denying the existence of 8-bit
            > > ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit[/color]
            > sets[color=green]
            > > were alive and well.[/color]
            >
            > Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
            > They may have been ASCII extensions, but they were not ASCII.[/color]

            Indeed they were ASCII. Standards written later appear to have
            disassociated the term ASCII from the national variants and extended sets,
            preferring to give them numbered ANSI designations, but in the early 1970s
            they were ASCII. National variants which I personally worked with included
            French, German, and Italian ASCII sets, and one of my co-workers worked with
            the Portugese set . US-ASCII is the preferred term now to avoid confusion
            with the other ASCII sets.



            Comment

            • Alan J. Flavell

              #36
              Re: Named vs. numerical entities

              On Fri, 16 Jul 2004, Harlan Messinger wrote:

              [after a bout of over-enthusiatic quoting]
              [color=blue][color=green]
              > > Is it because you won't read the tutorial before asking further
              > > questions?[/color]
              >
              > No, it's because sometimes questions can be satisfied by relatively
              > simple answers without requiring one to read a whole tutorial[/color]

              And it's because often, the relatively simple answers don't make any
              sense until you've done the groundwork first so that you can
              understand the answers (or even better - ask the right questions).

              Your attention was directed to the tutorial for a constructive reason:
              someone who knew the subject believed that it would be of genuine
              benefit to you, it would position you better for the subsequent
              discussion. As it happens, that is also my own opinion.
              [color=blue]
              > Sometimes a tutorial or textbook will tell you the way things are
              > without explaining why they aren't some other way (though sometimes not).[/color]

              You'll be able to tell us how it was when you've tried it, OK? That
              is, if I haven't lost patience by then and put you back into the
              killfile...

              Comment

              • Harlan Messinger

                #37
                Re: Named vs. numerical entities


                "C A Upsdell" <cupsdell0311XX X@-@-@XXXrogers.com> wrote in message
                news:Z7VJc.1$Dg D1.0@news04.blo or.is.net.cable .rogers.com...[color=blue]
                > "Harlan Messinger" <h.messinger@co mcast.net> wrote in message
                > news:2lqjqsFfb5 rcU1@uni-berlin.de...[color=green][color=darkred]
                > > > Also, when I started developing software in the early 1970's -- before[/color][/color]
                > the[color=green][color=darkred]
                > > > Internet, before PCs, before microprocessors -- I routinely worked[/color][/color][/color]
                with[color=blue][color=green][color=darkred]
                > > > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and[/color][/color]
                > Gray[color=green][color=darkred]
                > > > codes). I find many Internet references denying the existence of[/color][/color][/color]
                8-bit[color=blue][color=green][color=darkred]
                > > > ASCII, but I can attest that, in the early 1970s, multiple 7- and[/color][/color][/color]
                8-bit[color=blue][color=green]
                > > sets[color=darkred]
                > > > were alive and well.[/color]
                > >
                > > Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
                > > They may have been ASCII extensions, but they were not ASCII.[/color]
                >
                > Indeed they were ASCII. Standards written later appear to have
                > disassociated the term ASCII from the national variants and extended sets,
                > preferring to give them numbered ANSI designations, but in the early 1970s
                > they were ASCII. National variants which I personally worked with[/color]
                included[color=blue]
                > French, German, and Italian ASCII sets, and one of my co-workers worked[/color]
                with[color=blue]
                > the Portugese set .[/color]

                You and they called them "ASCII" informally, or do you have a citation to
                show that ASCII was officially regarded as the proper name for these sets?
                [color=blue]
                > US-ASCII is the preferred term now to avoid confusion
                > with the other ASCII sets.[/color]

                Comment

                • Harlan Messinger

                  #38
                  Re: Named vs. numerical entities


                  "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
                  news:Pine.LNX.4 .53.04071619333 30.7123@ppepc56 .ph.gla.ac.uk.. .[color=blue]
                  > On Fri, 16 Jul 2004, Harlan Messinger wrote:
                  >
                  > [after a bout of over-enthusiatic quoting]
                  >[color=green][color=darkred]
                  > > > Is it because you won't read the tutorial before asking further
                  > > > questions?[/color]
                  > >
                  > > No, it's because sometimes questions can be satisfied by relatively
                  > > simple answers without requiring one to read a whole tutorial[/color]
                  >
                  > And it's because often, the relatively simple answers don't make any
                  > sense until you've done the groundwork first so that you can
                  > understand the answers (or even better - ask the right questions).
                  >
                  > Your attention was directed to the tutorial for a constructive reason:
                  > someone who knew the subject believed that it would be of genuine
                  > benefit to you, it would position you better for the subsequent
                  > discussion. As it happens, that is also my own opinion.
                  >[color=green]
                  > > Sometimes a tutorial or textbook will tell you the way things are
                  > > without explaining why they aren't some other way (though sometimes[/color][/color]
                  not).[color=blue]
                  >
                  > You'll be able to tell us how it was when you've tried it, OK? That
                  > is, if I haven't lost patience by then and put you back into the
                  > killfile...[/color]

                  Oh, good grief, go ahead and get it over with. One would think I'd said
                  something simply terrible to you, instead of just asking questions and then
                  saying why I thought it was reasonable to do so.

                  Comment

                  • C A Upsdell

                    #39
                    Re: Named vs. numerical entities

                    > >[color=blue][color=green]
                    > > Indeed they were ASCII. Standards written later appear to have
                    > > disassociated the term ASCII from the national variants and extended[/color][/color]
                    sets,[color=blue][color=green]
                    > > preferring to give them numbered ANSI designations, but in the early[/color][/color]
                    1970s[color=blue][color=green]
                    > > they were ASCII. National variants which I personally worked with[/color]
                    > included[color=green]
                    > > French, German, and Italian ASCII sets, and one of my co-workers worked[/color]
                    > with[color=green]
                    > > the Portugese set .[/color]
                    >
                    > You and they called them "ASCII" informally, or do you have a citation to
                    > show that ASCII was officially regarded as the proper name for these sets?[/color]

                    I do not have any of the manuals etc. that I used 30+ years ago. Otherwise
                    I could show you.


                    Comment

                    • Alan J. Flavell

                      #40
                      Re: Named vs. numerical entities

                      On Fri, 16 Jul 2004, C A Upsdell wrote:
                      [color=blue]
                      > Indeed they were ASCII.[/color]

                      No, they may have been *based* on ASCII, they may have been informally
                      referred to as "national ASCII", but they were not literally the
                      "American Standard Code for Information Interchange".
                      [color=blue]
                      > Standards written later appear to have disassociated the term ASCII
                      > from the national variants[/color]

                      Uh-uh, it's an international conspiracy to hide the origin of these
                      codes, is it? You don't seriously believe that the US American
                      national standards body would go making national character codes for
                      other countries, do you?
                      [color=blue]
                      > and extended sets,[/color]

                      At this point nobody's arguing about "extended sets". It's about
                      national variants based on the 7-bit code called ASCII.
                      [color=blue]
                      > preferring to give them numbered ANSI designations,[/color]

                      There you go again. ANSI (the later name of the US American national
                      standards body) had no jurisdiction over other national variants; only
                      over the (US-)American one. The British national variant based on
                      ASCII was a British Standard designation, BS4370; other national
                      variants would have had designations under their respective standards
                      bodies (DIN in Germany, and so on).

                      Later these 7-bit codes were codified into ISO-646 under the auspices
                      of the international standards body.
                      [color=blue]
                      > but in the early 1970s they were ASCII.[/color]

                      I've been interested in character coding issues since before then, and
                      I say you are mistaken, or confusing loose everyday terms and formal
                      specfications. Not that any of this is relevant to authoring HTML for
                      the WWW, so I shan't keep this sub-thread going.

                      Comment

                      • Jonas Smithson

                        #41
                        Re: Named vs. numerical entities

                        My thanks to all the respondents. I've been sitting here reading this
                        thread with my jaw dropped open -- people not only discussing the
                        arcane nuances of encoding methods, but flaming each other over it!
                        This thread was so far over my head that (for my purposes) it might as
                        well have been written in ancient Greek (say, is that a possible
                        encoding method?). But I got the core information I needed: there's no
                        speed advantage of — over &mdash;. I wish I could remember where
                        I read that nonsense so that (if it was in a book, which I suspect it
                        was) I could warn people about the title.

                        I guess now my decision comes down to this: named entities are more
                        intuitive (I can remember them while I type without looking at a
                        chart), but Netscape 4 doesn't understand them, and makes the text look
                        like junk -- but it does understand numerical entities, which I can't
                        remember. So which do I care more about, my convenience in writing code
                        or the <0.5% of NS4 users? (That's a subjective question to myself, of
                        course; I don't expect an answer here.) Or maybe I'll type the named
                        entities and then do a bulk search & replace to numeric ones before
                        uploading the pages...

                        Alan Flavell wrote:[color=blue]
                        > It's best, of course, if your HTML authoring software takes
                        > care of the details for you, according to some options which
                        > you can set.[/color]

                        My "HTML authoring software" is a simple text editor; I don't care for
                        the so-called WYSIWYG editors so I have to make decisions like this for
                        myself.
                        [color=blue]
                        > utf-8 encoding is widely supported and a compact representation; its
                        > problem is more the possibility of mishandling in the hands of
                        > authors who are not yet familiar with it.[/color]

                        How would I, for example, type an emdash in utf-8 code? (I'm pretty
                        sure I just asked something totally clueless, like "which hand does a
                        cow use to play the accordian?" Oh, well... in for a dime, in for a
                        dollar, as they say...)

                        By the way, I occasionally see garbage characters even on the big news
                        sites -- where it looks like they meant to insert some kind of
                        punctuation mark but instead I see something that looks like a Chinese
                        character. I'm pretty sure they're not seeing that on their end, or
                        they would have fixed it; and I've searched through my preference
                        settings (in Windows Explorer 6) but couldn't find anything that seemed
                        relevant in terms of character encodings. Any guess as to why I'm
                        seeing scattered Chinese characters (it happens fairly rarely,
                        actually) and the site coders (presumably) aren't?

                        Comment

                        • Matt

                          #42
                          Re: Named vs. numerical entities

                          Jonas Smithson wrote:
                          [color=blue]
                          > My thanks to all the respondents. I've been sitting here reading this
                          > thread with my jaw dropped open -- people not only discussing the
                          > arcane nuances of encoding methods, but flaming each other over it![/color]

                          I prefer the term "heated discussion" :)
                          [color=blue]
                          > This thread was so far over my head that (for my purposes) it might as
                          > well have been written in ancient Greek (say, is that a possible
                          > encoding method?).[/color]

                          Use a greek encoding or unicode :). AIUI, and I never took Ancient Greek
                          very far at school, it uses letters all found in modern Greek.
                          [color=blue][color=green]
                          >> utf-8 encoding is widely supported and a compact representation; its
                          >> problem is more the possibility of mishandling in the hands of
                          >> authors who are not yet familiar with it.[/color]
                          >
                          > How would I, for example, type an emdash in utf-8 code? (I'm pretty
                          > sure I just asked something totally clueless, like "which hand does a
                          > cow use to play the accordian?" Oh, well... in for a dime, in for a
                          > dollar, as they say...)[/color]

                          Set your text editor to UTF-8 encoding, and input the character. You can
                          copy/paste it from anywhere (e.g. character map, a handy web page) or use
                          your keyboard -- I edited my keyboard layout to give me lots of useful
                          symbols. For instance, ndash – and mdash — and AltGr + hypen and
                          Shift+AltGr+hyp hen now.[1]
                          [color=blue]
                          > By the way, I occasionally see garbage characters even on the big news
                          > sites -- where it looks like they meant to insert some kind of
                          > punctuation mark but instead I see something that looks like a Chinese
                          > character. I'm pretty sure they're not seeing that on their end, or
                          > they would have fixed it; and I've searched through my preference
                          > settings (in Windows Explorer 6) but couldn't find anything that seemed
                          > relevant in terms of character encodings. Any guess as to why I'm
                          > seeing scattered Chinese characters (it happens fairly rarely,
                          > actually) and the site coders (presumably) aren't?[/color]

                          Someone's character encoding is not set correctly. If you've set the
                          encoding selection in IE (View, Encoding) to auto-select, maybe theirs is
                          set wrongly.

                          [1] US layouts don't have AltGr, so I'd have to use Ctrl+Shift. You can
                          make a keyboard layout for Windows 2k,XP,2003 with this tool:
                          <http://www.microsoft.c om/downloads/details.aspx?Fa milyID=fb7b3dcd-d4c1-4943-9c74-d8df57ef19d7&di splaylang=en>
                          Much, much faster for typing things like ‘ ’ “ ” · ¼ ½ ¾ ©

                          --
                          Matt


                          -----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
                          http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
                          -----== Over 100,000 Newsgroups - 19 Different Servers! =-----

                          Comment

                          • Eric B. Bednarz

                            #43
                            Re: Named vs. numerical entities

                            Jonas Smithson <smithsonNOSPAM @REMOVETHISboar dermail.com> writes:
                            [color=blue]
                            > [...] people not only discussing [...] but flaming each other over it![/color]
                            [color=blue]
                            > [...] numerical entities, [...][/color]

                            If you call *character references* 'numerical entities' one more time,
                            you ain't seen nothing yet! ;-)

                            Entity references are an entirely different syntactical construct.
                            You are excused because the WWW is cluttered with disinformation, but
                            before you go to sleep you really gotta write down 100 times:

                            '&#' is _*/NOT/*_ an ERO delimiter

                            Append exclamation marks in amounts you see fit.


                            --
                            | ) 111010111011 | http://bednarz.nl/
                            -(
                            | ) Distribute me: http://binaries.bednarz.nl/mp3/aicha

                            Comment

                            • Alan J. Flavell

                              #44
                              Re: Named vs. numerical entities

                              On Fri, 16 Jul 2004, Jonas Smithson wrote:
                              [color=blue]
                              > My thanks to all the respondents. I've been sitting here reading this
                              > thread with my jaw dropped open -- people not only discussing the
                              > arcane nuances of encoding methods, but flaming each other over it![/color]

                              Welcome to usenet. Gene Spafford had already said it in 1992
                              (google for usenet and "herd of performing elephants").
                              [color=blue]
                              > My "HTML authoring software" is a simple text editor;[/color]

                              But /how/ simple? Come back to that in a moment...
                              [color=blue]
                              > I don't care for the so-called WYSIWYG editors[/color]

                              I'm right with you there. But it isn't a binary choice between
                              type-every-character-by-hand or point-and-drool-and-never-see-any-HTML
                              [color=blue]
                              > How would I, for example, type an emdash in utf-8 code?[/color]

                              That's a non-sequitur: your keyboard doesn't generate "in" us-ascii or
                              iso-8859-1 or utf-8 code, it generates keyboard codes: it's the job of
                              input methods to turn keypresses into actual stored characters.

                              If your editor is sufficiently unicode-aware, then you can type-in an
                              emdash character (by some combination of keypressings), and when
                              you're done authoring, you can say save-As and tell the dialog to save
                              in utf-8 format. Or you can copy/paste characters from a menu, or use
                              a character picker utility or whatever. The key issue is that the
                              editor can store and work with these characters, and save them to file
                              in an encoding that you like (probably utf-8).

                              Recent versions of even such a "simple" editor as Notepad can do this
                              (in win2k, xp). Older ones can't, so you'd need to look for a
                              unicode-capable editor.

                              You could use the source-view mode of Mozilla Composer, for that
                              matter. A good choice, as it offers an immediate preview and various
                              other conveniences, such as translating &-notation to and from coded
                              characters.
                              [color=blue]
                              > (I'm pretty sure I just asked something totally clueless, like
                              > "which hand does a cow use to play the accordian?" Oh, well... in
                              > for a dime, in for a dollar, as they say...)[/color]

                              You recognise the problem, and that's well over half way to a
                              solution. Believe me, it's much harder to explain anything to people
                              who are convinced they already understand 90% of it (just that what
                              they think they understand is wrong!).

                              You could try Alan Wood's overview at
                              Text editors, HTML editors and word processors with Unicode, UTF-8 or multilingual support that run under Microsoft Windows. Part of Alan Wood’s Unicode Resources.

                              although it's a bit of a mix of text editors, word processors and
                              web-page extruders all in the same bucket, so be selective.

                              Or google for unicode editors (and related terms) and see if you care
                              for anything you get.
                              [color=blue]
                              > By the way, I occasionally see garbage characters even on the big news
                              > sites -- where it looks like they meant to insert some kind of
                              > punctuation mark but instead I see something that looks like a Chinese
                              > character. I'm pretty sure they're not seeing that on their end,[/color]

                              This can happen if they fail to specify a character encoding, and the
                              browser is set to auto-guess the encoding. Or various related errors.
                              I don't think there's a single right answer to your question. Given a
                              specific instance, it might be possible to deduce what had gone wrong.
                              Sometimes they got a news feed in one encoding, and accidentally
                              incorporated it into a page in a different encoding (news sites are
                              done from content management systems, the pages aren't produced
                              individually by hand).

                              hope this helps.

                              Comment

                              • Jonas Smithson

                                #45
                                Re: Named vs. numerical entities

                                Alan J. Flavell wrote:
                                [color=blue]
                                > If your editor is sufficiently unicode-aware, then you can type-in an
                                > emdash character (by some combination of keypressings), and when
                                > you're done authoring, you can say save-As and tell the dialog to save
                                > in utf-8 format....[/color]

                                The editor (an old version of BBEdit) gives me two options for the
                                document while I'm working on it: "Encode as Unicode" and, if that's
                                enabled, the option to "Swap Bytes". Whether or not I chose those
                                options, when I go to save the document, I have the further options to
                                "Save as Unicode" and, if that's enabled, to "Swap Bytes". (It also
                                gives me a choice of Macintosh, Unix, or DOS line breaks, which I
                                assume wouldn't affect the HTML display.) The "unicode/swap bytes"
                                choices, of course, mean nothing to me, and I've always left them off
                                (the default). I can't find anything in the editor's preferences or
                                dialogs about "utf-8". When they say "Save as Unicode", is it likely
                                they mean the same thing you mean by "save in utf-8 format"?

                                If I were working and saving in unicode, would that mean (for example)
                                that I could type an emdash the way we Mac users do it
                                (command-option-hyphen) and that would actually work in the HTML
                                document on other platforms, without my using any character entity (or
                                character reference or whatever it's called)? (I have a PC too so I
                                guess I could test that.) And would the emdash character then be more
                                "compact" (smaller download) than the character reference (—)
                                I've been using? But...um... didn't I read somewhere that unicode
                                documents are much larger than... the other kind... (what's a
                                'non-Unicode' document called?) and so should only be used if you need
                                support for large character sets like Chinese etc...? Or maybe they
                                were referring to something else... wait, I think it was called
                                "double-byte encoding" or something. Excuse me, my brain is exploding.
                                :)

                                And then, of course, there's the whole other issue that my FTP program
                                automatically converts code to iso-8859-1 charset when you upload it,
                                unless you tell it not to, and when BBEdit talks directly to the FTP
                                server I don't know what it does.

                                And if I did save a text file as unicode, when I opened it later in a
                                text editor (perhaps even a different one), would I be able to tell
                                what it was saved as?

                                (That's a lot of questions, and I'm sure I phrased this all wrong, but
                                maybe you can guess what I mean or what the stuff I've been reading
                                meant?)

                                Comment

                                Working...