Lang attribute values

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andreas Prilop

    #16
    Re: Lang attribute values

    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
    [color=blue]
    > If those characters were Arabic, then it would be useful to choose,
    > say, a Persian font if it were known that the language is Farsi.[/color]

    It might be a good idea to extend the euro-centric list
    serif, sans-serif, cursive, fantasy
    by
    naskhi, nastaliq, thuluth
    etc.
    <http://images.google.c om/images?q=naskhi >
    <http://images.google.c om/images?q=nastal iq>
    <http://images.google.c om/images?q=thulut h>

    Serif and sans-serif have not meaning with the Arabic script.
    BTW: Have you ever noticed that the Arabic glyphs in Arial and
    Times New Roman are identical?

    Comment

    • Neal

      #17
      Re: Lang attribute values

      On Thu, 22 Jan 2004 17:38:17 +0000 (UTC), Jukka K. Korpela
      <jkorpela@cs.tu t.fi> wrote:
      [color=blue]
      > Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
      >[color=green]
      >> Mozilla/Netscape uses the value of the LANG attribute to determine
      >> the typeface in which the corresponding text is displayed.[/color]
      >
      > That's an example of what I meant by _wrong_ use.
      >
      > If I write about <span lang="ru">Dosto yevsky</span>, I don't want the
      > name appear in a fancy font just because a browser makes foolish
      > guesses.[/color]

      Are you guys saying that if I set a transliterated name with a language
      markup, it might change the characters? That'd be so so wrong. Simply
      comparing the number of characters in the Latin-transliterated Tchaikovsky
      to the Russian Cyrillic spelling - that would become gibberish!

      Please tell me I have it wrong.

      Comment

      • Jukka K. Korpela

        #18
        Re: Lang attribute values

        Neal <neal413@spamrc n.com> wrote:
        [color=blue]
        > Are you guys saying that if I set a transliterated name with a
        > language markup, it might change the characters?[/color]

        No, we are saying that it actually changes the _glyphs_ on some
        browsers. That is, if you have a letter "D", it will appear in some
        visual form, as some glyph, but it may be of a typeface/font different
        from the surrounding text. For example, you might see an ordinary word,
        written in Latin letters, in the midst of normal text written in Latin
        letters, in e.g. Arial font while the text around it is Times New
        Roman. Just because you marked it up as what it is, such as a Russian
        word.

        --
        Yucca, http://www.cs.tut.fi/~jkorpela/
        Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

        Comment

        • Jukka K. Korpela

          #19
          Re: Lang attribute values

          Henri Sivonen <hsivonen@iki.f i> wrote:
          [color=blue]
          > Tim Bray mentions "Things you can't do properly in a
          > language-oblivious way include: Render it on a screen or on paper[/color]

          Oh, what have Web browsers been doing then? They surely have problems
          in presenting text _well_, but you seem to be saying that the selection
          of a font is among the worst problems. I disagree.
          [color=blue][color=green]
          >> If I write about <span lang="ru">Dosto yevsky</span>, I don't want
          >> the name appear in a fancy font just because a browser makes
          >> foolish guesses.[/color]
          >
          > In the absence of *script* identification, is Mozilla's behavior
          > really that foolish?[/color]

          Yes. It should see immediately that Latin script is used. But in
          addition to this, what's the big idea in selecting fonts according to
          language? It might make sense for some scripts, like CJK, but only in
          cases where the language actually affects the generally preferred
          choice of fonts.
          [color=blue]
          > How do you suggest the font heuristics should
          > work with UTF-8 (that is, when the dominant script can't be guessed
          > from the encoding)?[/color]

          I don't suggest any font heuristics. There's enough confusion in the
          current font settings in browsers, which hopelessly mix up languages,
          countries, scripts, character repertoires, fonts and whatever into a
          dessert for tag soup. _Documenting_ the behavior would be the best
          move. Well, next to making things simple: specify some coherent
          sequence of fonts to be tried in succession when trying to display a
          character, and let the user change it. And naturally the author can
          make his own suggestions. There's no need for a browser play in that
          game with its guesswork (aka heuristics).

          --
          Yucca, http://www.cs.tut.fi/~jkorpela/
          Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

          Comment

          • Andreas Prilop

            #20
            Re: Lang attribute values

            Neal <neal413@spamrc n.com> wrote:
            [color=blue]
            > Are you guys saying that if I set a transliterated name with a language
            > markup, it might change the characters?[/color]

            No.
            Let's say you've defined Futura as your preferred typeface for West
            European Latin and Verdana for Cyrillic. Then Mozilla will display
            your document with "charset=IS O-8859-1" or "charset=UT F-8" in Futura
            but will display <span lang="ru">Dosto evskij</span> in Verdana.

            If you have "charset=IS O-8859-5", everything is displayed in
            Verdana - except of course <span lang="en">Bront &euml;</span> ,
            which is in Futura.

            Comment

            • Neal

              #21
              Re: Lang attribute values

              On Thu, 22 Jan 2004 09:41:18 +1030, Tim <Tim@mail.local host> wrote:
              [color=blue]
              > Well, unless you're inventing something new, they're a country code
              > (e.g. en-us for U.S.A. English, en-au for Australian English, etc.).
              >[/color]


              Apologies if this has been answered elsewhere, but is there a list of
              these codes anywhere? And how necessary are they?

              My specific application is a website for an orchestra using many foreign
              titles and names. I'm imagining a speech reader will need the language
              code to be able to pronounce the word correctly, but perhaps I am off here
              as well. At any rate, a country subtag appears to be unimportant, as our
              primary market is our US-based audience.

              I guess the question distills down to this - what's the proper markup for
              the French title "L'arlessie nne" or the Czech name "Dvorák" in an
              otherwise English document?

              Comment

              • Andreas Prilop

                #22
                Re: Lang attribute values

                Neal <neal413@spamrc n.com> wrote:
                [color=blue]
                > I guess the question distills down to this - what's the proper markup for
                > the French title "L'arlessie nne" or the Czech name "Dvorák" in an
                > otherwise English document?[/color]

                <span lang="cs">Dvoř& #225;k</span>
                <span lang="cs">Dvoøá k</span>

                Comment

                • Neal

                  #23
                  Re: Lang attribute values

                  On Thu, 22 Jan 2004 22:53:07 +0000 (UTC), Jukka K. Korpela
                  <jkorpela@cs.tu t.fi> wrote:
                  [color=blue]
                  > Neal <neal413@spamrc n.com> wrote:
                  >[color=green]
                  >> Are you guys saying that if I set a transliterated name with a
                  >> language markup, it might change the characters?[/color]
                  >
                  > No, we are saying that it actually changes the _glyphs_ on some
                  > browsers. That is, if you have a letter "D", it will appear in some
                  > visual form, as some glyph, but it may be of a typeface/font different
                  > from the surrounding text. For example, you might see an ordinary word,
                  > written in Latin letters, in the midst of normal text written in Latin
                  > letters, in e.g. Arial font while the text around it is Times New
                  > Roman. Just because you marked it up as what it is, such as a Russian
                  > word.
                  >[/color]


                  Ok, just so long as it doesn't make it illegible, I can deal with that!

                  Comment

                  • Neal

                    #24
                    Re: Lang attribute values

                    On Fri, 23 Jan 2004 00:35:25 +0100, Andreas Prilop
                    <nhtcapri@rrz n-user.uni-hannover.de> wrote:
                    [color=blue]
                    > Neal <neal413@spamrc n.com> wrote:
                    >[color=green]
                    >> I guess the question distills down to this - what's the proper markup
                    >> for
                    >> the French title "L'arlessie nne" or the Czech name "Dvorák" in an
                    >> otherwise English document?[/color]
                    >
                    > <span lang="cs">Dvoř& #225;k</span>
                    > <span lang="cs">Dvoř ák</span>[/color]


                    Holy crap, I have been looking for YEARS for &#345...

                    Where do I find a COMPLETE list of such characters that aren't just the
                    typical set?

                    Comment

                    • Jukka K. Korpela

                      #25
                      Re: Lang attribute values

                      Neal <neal413@spamrc n.com> wrote:
                      [color=blue]
                      > Holy crap, I have been looking for YEARS for &#345...[/color]

                      You'll have fun in future - there are literally myriads of other
                      character references to be found.
                      [color=blue]
                      > Where do I find a COMPLETE list of such characters that aren't just
                      > the typical set?[/color]

                      The Unicode standard, or the equivalent ISO 10646 standard. The tricky
                      part is to find the information you need, and make sure you have
                      understood it correctly, but see


                      --
                      Yucca, http://www.cs.tut.fi/~jkorpela/
                      Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                      Comment

                      • Alan Wood

                        #26
                        Re: Lang attribute values

                        Neal <neal413@spamrc n.com> wrote in message news:<opr17grvb odvhyks@news.rc n.com>...[color=blue][color=green]
                        > >[color=darkred]
                        > >> I guess the question distills down to this - what's the proper markup
                        > >> for
                        > >> the French title "L'arlessie nne" or the Czech name "Dvorák" in an
                        > >> otherwise English document?[/color]
                        > >
                        > > <span lang="cs">Dvoř& #225;k</span>
                        > > <span lang="cs">DvoÅ ™Ã¡k</span>[/color]
                        >
                        > Where do I find a COMPLETE list of such characters that aren't just the
                        > typical set?[/color]

                        Try the Unicode site (http://www.unicode.org) or buy their book.

                        Or look for Unicode with a search engine, which will find lots of
                        useful sites, including mine.

                        --
                        Alan Wood
                        http://www.alanwood.net (Unicode, special characters, pesticide names)

                        Comment

                        • Henri Sivonen

                          #27
                          Re: Lang attribute values

                          In article <Xns9479A6AF2F1 Ejkorpelacstutf i@193.229.0.31> ,
                          "Jukka K. Korpela" <jkorpela@cs.tu t.fi> wrote:
                          [color=blue]
                          > Henri Sivonen <hsivonen@iki.f i> wrote:
                          >[color=green]
                          > > Tim Bray mentions "Things you can't do properly in a
                          > > language-oblivious way include: Render it on a screen or on paper[/color]
                          >
                          > Oh, what have Web browsers been doing then? They surely have problems
                          > in presenting text _well_, but you seem to be saying that the selection
                          > of a font is among the worst problems. I disagree.[/color]

                          Choosing a font is only one problem. There are others including
                          line breaking. (And I don't mean it just "complex" to line breaking for
                          languages such as Thai, but also dynamic hyphenation for European
                          languages.
                          [color=blue][color=green][color=darkred]
                          > >> If I write about <span lang="ru">Dosto yevsky</span>, I don't want
                          > >> the name appear in a fancy font just because a browser makes
                          > >> foolish guesses.[/color]
                          > >
                          > > In the absence of *script* identification, is Mozilla's behavior
                          > > really that foolish?[/color]
                          >
                          > Yes. It should see immediately that Latin script is used. But in
                          > addition to this, what's the big idea in selecting fonts according to
                          > language?[/color]

                          I can't find a politically correct way of saying this, but there's are
                          pecking orders of language groups within scripts in terms of font
                          availability and quality. It's unfortunate.

                          For example Polish looks ugly if some glyphs come from a "Western" font
                          and others come from a "Central European" font.
                          [color=blue]
                          > It might make sense for some scripts, like CJK, but only in
                          > cases where the language actually affects the generally preferred
                          > choice of fonts.[/color]

                          Chinese text looks ugly if the ideograps that are also used for Japanese
                          come from a Kanji font while the rest come from a Chinese font.

                          When you write <span lang="ru">Dosto yevsky</span>, what would you want
                          recipients to do with the language data? That is, is it actually useful
                          for transliterated text to come with language data in any existing or
                          realistic client implementation for any of the purposes you list in
                          http://www.cs.tut.fi/~jkorpela/kielimerkkaus/1.html ? Is it there just
                          in case the user is curious and invokes "Properties " in Mozilla in order
                          to find out that Dostoyevsky is a Russian name?
                          [color=blue][color=green]
                          > > How do you suggest the font heuristics should
                          > > work with UTF-8 (that is, when the dominant script can't be guessed
                          > > from the encoding)?[/color]
                          >
                          > I don't suggest any font heuristics.[/color]
                          [...][color=blue]
                          > And naturally the author can
                          > make his own suggestions. There's no need for a browser play in that
                          > game with its guesswork (aka heuristics).[/color]

                          Let's suppose I'm writing a content management system and I choose to
                          use UTF-8 for all output because
                          1) Prior to serialization the data is in UTF-16 anyway, because
                          I use Java, so producing UTF-8 or UTF-16 is easier than producing
                          something else.
                          2) I want every character that a user might enter in a form arrive
                          to the server intact (be representable in the encoding used).
                          Therefore, I have to use UTF-*.

                          What advice should I provide authors who want to use the system for
                          publishing Polish or Chinese text? How should they make their
                          suggestions?

                          --
                          Henri Sivonen
                          hsivonen@iki.fi

                          Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                          Comment

                          • Alan J. Flavell

                            #28
                            Re: Lang attribute values

                            On Fri, 23 Jan 2004, Henri Sivonen wrote:

                            [addressing Jukka, but I shall offer an answer anyway ;-) ][color=blue]
                            > When you write <span lang="ru">Dosto yevsky</span>, what would you want
                            > recipients to do with the language data?[/color]

                            If they are browsers, my answer would be "probably nothing". If they
                            are indexers, summarisers etc. then the answer would be different.
                            [color=blue]
                            > That is, is it actually useful
                            > for transliterated text to come with language data in any existing or
                            > realistic client implementation [...][/color]

                            In theory, the /markup/ depends on the structure and attributes of the
                            content - it isn't *supposed* to be done with the intention of
                            producing a particular result on a particular client agent (that job
                            is delegated to stylesheet/s).

                            In theory, of course, theory and practice are the same, but in
                            practice....

                            So when you are raising issues of this kind, it might be useful if you
                            would make clear whether you have in mind the theoretical ideal, or
                            rather some particular practical issue related to current browsers and
                            other kinds of client agent.

                            Remark: IBM HPR will use different pronunciations depending on the
                            language markup, to take just one example (which is actually
                            irrelevant here, since it didn't offer Russian as an option, and I've
                            no idea what it would do with Russian-transliterated-into-Roman-
                            letters even if it did). But nevertheless, it's an interesting
                            what-if question, isn't it?

                            Comment

                            • Safalra

                              #29
                              Re: Lang attribute values

                              "Jukka K. Korpela" <jkorpela@cs.tu t.fi> wrote in message news:<Xns9478C7 80DD053jkorpela cstutfi@193.229 .0.31>...[color=blue]
                              > Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:[color=green]
                              > > Mozilla/Netscape uses the value of the LANG attribute to determine
                              > > the typeface in which the corresponding text is displayed.[/color]
                              >
                              > That's an example of what I meant by _wrong_ use.
                              >
                              > If I write about <span lang="ru">Dosto yevsky</span>,[/color]

                              I don't mean to sound ignorant, but what's the logic behind using
                              language mark-up for proper nouns?
                              [color=blue]
                              > I don't want the
                              > name appear in a fancy font just because a browser makes foolish
                              > guesses. That's why I recommend that lang markup be not used for
                              > transliterated texts.[/color]

                              Presumably in an ideal mark-up language, language and script would be
                              independent attributes (and that way I'd have some sort of mark-up to
                              put around my IPA sections...)?

                              --- Safalra (Stephen Morley) ---

                              Comment

                              • Andreas Prilop

                                #30
                                Re: Lang attribute values

                                On Thu, 22 Jan 2004, Neal wrote:
                                [color=blue]
                                > Holy crap, I have been looking for YEARS for &#345...
                                > Where do I find a COMPLETE list of such characters that aren't just the
                                > typical set?[/color]

                                You probably don't need a complete list. For a start, look at
                                <http://www.unics.uni-hannover.de/nhtcapri/multilingual2.h tml>

                                Set the encoding to "charset=UT F-8".
                                <http://ppewww.ph.gla.a c.uk/~flavell/charset/checklist.html# s6>
                                to suit Netscape 4 and perhaps other older browsers.

                                Comment

                                Working...