Lang attribute values

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jukka K. Korpela

    #46
    Re: Lang attribute values

    Bertilo Wennergren <bertilow@gmx.n et> wrote:
    [color=blue]
    > In the same way "Dostoyevsk y" (written exactly like that) is
    > written in Latin script. There is no need (or should be no need)
    > telling the browser what it already knows.[/color]

    It is written in Latin letters, but the word "script" is somewhat
    confusing here. There are many different systems of transliterating
    Russian names, even in one country, and this is a constant source of
    confusion. So the information needed for correct analysis of the word
    would include information about the particular transliteration method.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

    Comment

    • Jukka K. Korpela

      #47
      Re: Lang attribute values

      Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
      [color=blue]
      > Hmm, let's take <span lang="ru">vodka </span>, da?[/color]

      An interesting proposal. :-) In fact, the word "vodka" could be
      regarded as a Russian word, or as a loanword of Russian origin used in
      English or some other language. Thus, the markup above could be
      construed as an author's expression for the intent of reading it as a
      genuinely Russian word, pronounced the Russian way (reading its "d" as
      unvoiced, "t", etc.), as far as possible. Needless to say, it is
      overoptimistic to expect user agents to understand such finer points
      very soon.

      --
      Yucca, http://www.cs.tut.fi/~jkorpela/
      Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

      Comment

      • Alan J. Flavell

        #48
        Re: Lang attribute values

        On Sat, 24 Jan 2004, Jukka K. Korpela wrote:
        [color=blue]
        > I just realized that there's similar absurdity in IE, though at a
        > different level. Maybe it could be described just as documentation
        > error: If you go to Internet settings and select Fonts, IE lets you
        > specify the font used for various "character sets". These sets are
        > named as Latin, Greek, Cyrillic, etc. This seems to make sense, until
        > you realize that it's the _encoding_ that matters.[/color]

        It seems you may have observed part of the problem, and I've observed
        a different part of the problem. Could I persuade you to take a look
        at my observations in
        http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html , in
        the part that relates to Win IE, and see how well it fits your own
        observations?
        [color=blue]
        > That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
        > page content as "Cyrillic", no matter what characters and what language
        > it actually contains.[/color]

        The language attribute in HTML also has an influence: some examples
        are shown on my page.

        As I say, it could be that each of us is only seeing part of the
        picture. With hindsight, some of my observations might only be
        accurate in relation to pages that are advertised as utf-8.
        [color=blue]
        > Similarly, if I specify a particular font for "Cyrillic character
        > set" and access a UTF-8 encoded page, IE does _not_ use that font
        > for Cyrillic letters on the page.[/color]

        That depends...
        [color=blue]
        > It seems to treat the page content as "Latin based".[/color]

        That will not happen if you choose a Latin font which contains no
        Cyrillic characters (use the MS font properties extension to view the
        relevant properties of the font).

        As I recall, I can make it use for Cyrillic the font that I configured
        for Greek, if I choose a Latin font which has no Cyrillic.
        [color=blue]
        > It's an interesting guessing game.[/color]

        I've set out my guess on the above page. The writing systems are set
        out in an ordered list, and my guess was that it works its way down
        this list until it finds a font which contains support for the desired
        writing system (even if the chosen font's support is incomplete
        relative to the one which was configured for that writing system!).
        [color=blue]
        > It indirectly affects authoring in the sense that the choice of an
        > encoding has implications on fonts, though only on pages that do not
        > set font family (except when the user overrides such settings),[/color]

        Well, sort-of. The primary guideline is surely to mark up the
        document accurately, and leave the client agent to do the best job
        that its authors were capable of? But yes, sometimes it's opportune
        for document authors to make some allowances for known browser
        shortcomings.

        However, here the most usual proposal is that authors should offer a
        font, or rather a selection of fonts, that the author found to be
        viable. Unfortunately, in every case where this has been
        investigated, while the suggestion of a font can improve the results
        for some subset of browsers, it can make matters worse, sometimes a
        lot worse, for some other subset of browsers. So much so that in this
        kind of multi-script situation, I would recommend readers who are
        having difficulties with the default settings, to try reconfiguring
        their browser to ignore any author-specified fonts and work with their
        own font defaults for best results.

        Comment

        • Alan J. Flavell

          #49
          Re: Lang attribute values

          On Sat, 24 Jan 2004, Alan J. Flavell wrote:
          [color=blue]
          > That will not happen if you choose a Latin font which contains no
          > Cyrillic characters (use the MS font properties extension to view the
          > relevant properties of the font).[/color]

          Oh, perhaps an easier way to do this is to visit IE's font defaults
          menu (tools> internet options> general> fonts). When you try to
          select a particular language script (i.e writing system), IE will
          present a menu of the available fonts for that language script. By a
          process of elimination, the fonts which are not included in that list
          do not support the script in question.

          And immediatly we see the trap! When I carried out my tests in
          Win/NT4, the Book Antiqua font provided there did not support Greek
          nor Cyrillic. But now that I repeat the test in Win2K, well, you
          guessed it: this font, with the same name, supports also Greek and
          Cyrillic. Ho hum.

          Comment

          • Alan J. Flavell

            #50
            Re: Lang attribute values

            On Sat, 24 Jan 2004, Alan J. Flavell wrote:

            [Jukka wrote:][color=blue][color=green]
            > > That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
            > > page content as "Cyrillic", no matter what characters and what language
            > > it actually contains.[/color]
            >
            > The language attribute in HTML also has an influence: some examples
            > are shown on my page.[/color]

            Please accept my apologies on this particular point. I now realise I
            was misremembering _that_ specific behaviour: it was in fact seen in
            Mozilla, not MSIE.

            Comment

            • Bertilo Wennergren

              #51
              Re: Lang attribute values

              Jukka K. Korpela:
              [color=blue]
              > Bertilo Wennergren <bertilow@gmx.n et> wrote:[/color]
              [color=blue][color=green]
              >> In the same way "Dostoyevsk y" (written exactly like that) is
              >> written in Latin script. There is no need (or should be no need)
              >> telling the browser what it already knows.[/color][/color]
              [color=blue]
              > It is written in Latin letters, but the word "script" is somewhat
              > confusing here. There are many different systems of transliterating
              > Russian names, even in one country, and this is a constant source of
              > confusion. So the information needed for correct analysis of the word
              > would include information about the particular transliteration method.[/color]

              Indeed "script" is a vague term, but I don't think we should mix it with
              "transcript ion system". There are several systems of Latin transcription
              of Japanese. They all use Latin script.

              But if there were a script attribute, it's value could of course consist
              of things like "la" (Latin) "la-hep" (Latin script, Hepburn
              transcription of Japanese), and also "ipa", "ipa-wide", "ipa-narrow"
              etc. Or there could be another attribute for transcription systems.

              That would all probably be a bit too much for HTML though.

              --
              Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

              Comment

              • Jukka K. Korpela

                #52
                Re: Lang attribute values

                Bertilo Wennergren <bertilow@gmx.n et> wrote:
                [color=blue]
                > Indeed "script" is a vague term, but I don't think we should mix it
                > with "transcript ion system".[/color]

                My point was that "script" in the vague sense has really no relevance
                to markup whereas writing system has. When Russian is written in Latin
                letters (using transliteration , basically, and not transcription), it
                is a system of writing Russian. It can be viewed as consisting of a
                composition of the normal writing system and Russian and a
                transliteration method, but that's a different aspect
                [color=blue]
                > But if there were a script attribute, it's value could of course
                > consist
                > of things like "la" (Latin) "la-hep" (Latin script, Hepburn
                > transcription of Japanese), and also "ipa", "ipa-wide",
                > "ipa-narrow" etc.[/color]

                No, "la" would not identify a writing system - it would refer to a
                family of character repertoires, more or less, which is at a completely
                different conceptual level. I can understand the idea of using "Latin",
                "Cyrillic" etc., because there are languages that have or have had
                writing systems that basically differ in the use of the base system of
                letters (e.g., Latin, Cyrillic, or Arabic). But that's just one
                possibility, and - as mentioned in this thread - it is relatively
                obvious even without such metainformation whether e.g. some fragment of
                Russian is written in Latin or Cyrillic letters. What is _not_ so
                obvious, in many cases, is the specific writing system (e.g., "old" and
                "new" Russian orthography, or the choice of a particular
                transliteration method).
                [color=blue]
                > That would all probably be a bit too much for HTML though.[/color]

                Some of the IANA registered "language subcodes" actually identify
                writing systems. This indicates at least some subjective need for
                specifying the writing system. But it's a wrong approach.

                The situation is somewhat complex, though, since an orthography reform
                is often coupled with some change of language, or could be _viewed_ as
                creating a version of a language. But logically orthography is
                orthogonal to dialect, jargon, and other variation reflected in a
                language subcode.

                Does someone really think that a new version of the German language has
                been or is being created by the orthography reform that was officially
                started in 1998? I don't think so. For adequate use of language
                information, e.g. in spelling checking, orthography is relevant, but it
                should be specified separately.

                --
                Yucca, http://www.cs.tut.fi/~jkorpela/
                Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                Comment

                • Jukka K. Korpela

                  #53
                  Re: Lang attribute values

                  "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                  [color=blue]
                  > It seems you may have observed part of the problem, and I've
                  > observed a different part of the problem. Could I persuade you to
                  > take a look at my observations in
                  > http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html ,
                  > in the part that relates to Win IE, and see how well it fits your
                  > own observations?[/color]

                  Now that I looked at that page again, I realized that it describes
                  (among other things) the problem I tried to explain. I had read it but
                  probably forgotten it, since it had not really caused me trouble. But
                  now it had.
                  [color=blue]
                  > However, here the most usual proposal is that authors should offer
                  > a font, or rather a selection of fonts, that the author found to be
                  > viable. Unfortunately, in every case where this has been
                  > investigated, while the suggestion of a font can improve the
                  > results for some subset of browsers, it can make matters worse,
                  > sometimes a lot worse, for some other subset of browsers.[/color]

                  In situations where the author knows that some font(s) that are
                  relatively commonly installed contain the characters he uses in a
                  document, I think it is reasonable to write a font-family suggestion
                  for body if the font is qualitatively acceptable. I'm naturally
                  referring to situations where a rich character repertoire is used, so
                  that we know that common browsers with common default settings will
                  fail to render all the characters. As a rough rule of thumb, if you use
                  characters that are not present in Times New Roman, consider suggesting
                  body { font-family: "Arial Unicode MS"; }
                  maybe with some other fonts too, if you have checked that each of them
                  has all the characters you're using.

                  The sure gain is that a large number of IE users will be able to read
                  the page without difficulty. The potential loss is that users who
                  actually have a qualitatively better font in their system and a browser
                  configured to use it will need an extra action to override the page
                  settings. I don't like the loss, but I think it's acceptable.

                  But I recently encountered a problem where Arial Unicode MS is not
                  sufficient. Not knowing what to do, I decided to make no font
                  suggestions for the text, since anything I considered would have sure
                  and considerable drawbacks as well. (This is one of the cases where
                  creating a PDF alternative is almost a must.)

                  It's unfortunate that Code2000 is qualitatively so awful. I could
                  accept it as the fallback font to be used for those characters that are
                  not present in any other font, but copy text looks horrendous in
                  Code2000. But using font-family: "Arial Unicode MS", "Code2000" does
                  not work the defined way on IE, and it makes things worse when a
                  browser implements it correctly and has both Code2000 and some better
                  very-large-repertoire font installed.

                  --
                  Yucca, http://www.cs.tut.fi/~jkorpela/
                  Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                  Comment

                  • Bertilo Wennergren

                    #54
                    Re: Lang attribute values

                    Jukka K. Korpela:
                    [color=blue]
                    > Bertilo Wennergren <bertilow@gmx.n et> wrote:[/color]
                    [color=blue][color=green]
                    >> But if there were a script attribute, it's value could of course
                    >> consist
                    >> of things like "la" (Latin) "la-hep" (Latin script, Hepburn
                    >> transcription of Japanese), and also "ipa", "ipa-wide",
                    >> "ipa-narrow" etc.[/color][/color]
                    [color=blue]
                    > No, "la" would not identify a writing system - it would refer to a
                    > family of character repertoires, more or less, which is at a completely
                    > different conceptual level.[/color]

                    I think we're agreeing here.
                    [color=blue]
                    > I can understand the idea of using "Latin",
                    > "Cyrillic" etc., because there are languages that have or have had
                    > writing systems that basically differ in the use of the base system of
                    > letters (e.g., Latin, Cyrillic, or Arabic). But that's just one
                    > possibility, and - as mentioned in this thread - it is relatively
                    > obvious even without such metainformation whether e.g. some fragment of
                    > Russian is written in Latin or Cyrillic letters. What is _not_ so
                    > obvious, in many cases, is the specific writing system (e.g., "old" and
                    > "new" Russian orthography, or the choice of a particular
                    > transliteration method).[/color]

                    True.
                    [color=blue][color=green]
                    >> That would all probably be a bit too much for HTML though.[/color][/color]
                    [color=blue]
                    > Some of the IANA registered "language subcodes" actually identify
                    > writing systems. This indicates at least some subjective need for
                    > specifying the writing system. But it's a wrong approach.[/color]
                    [color=blue]
                    > The situation is somewhat complex, though, since an orthography reform
                    > is often coupled with some change of language, or could be _viewed_ as
                    > creating a version of a language. But logically orthography is
                    > orthogonal to dialect, jargon, and other variation reflected in a
                    > language subcode.[/color]

                    That would seem to mean that a separate attribute "orthograph y" with a
                    value from a wide range of codes for various writing systems used for
                    various languages, would make sense.
                    [color=blue]
                    > Does someone really think that a new version of the German language has
                    > been or is being created by the orthography reform that was officially
                    > started in 1998? I don't think so. For adequate use of language
                    > information, e.g. in spelling checking, orthography is relevant, but it
                    > should be specified separately.[/color]

                    So "<span lang='de' orthography='de-neu'>Schloss</span>" would in
                    principle be OK then? (Supposing that "de-neu" - or whatever - has been
                    officially registered as the code for the new German orthograpy.)

                    --
                    Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

                    Comment

                    • Bertilo Wennergren

                      #55
                      Re: Lang attribute values

                      Jukka K. Korpela:
                      [color=blue]
                      > As a rough rule of thumb, if you use
                      > characters that are not present in Times New Roman, consider suggesting
                      > body { font-family: "Arial Unicode MS"; }
                      > maybe with some other fonts too, if you have checked that each of them
                      > has all the characters you're using.[/color]

                      You should be aware that "Arial Unicode MS" can be installed on Linux
                      systems, but that on many such systems it will fail to render any
                      italics. So suggesting that font might disable italics for some users.

                      If italics are used for emphasized text or citations (or something else)
                      that could be a problem on pages where emphasis, citation etc. convey
                      important pieces of information.

                      --
                      Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

                      Comment

                      • Alan J. Flavell

                        #56
                        Re: Lang attribute values

                        On Sun, 25 Jan 2004, Jukka K. Korpela wrote:
                        [color=blue]
                        > As a rough rule of thumb, if you use
                        > characters that are not present in Times New Roman, consider suggesting
                        > body { font-family: "Arial Unicode MS"; }
                        > maybe with some other fonts too, if you have checked that each of them
                        > has all the characters you're using.[/color]

                        Well, at least if they have Arial Unicode MS, you know that the font
                        has the rich character repertoire. Whereas many font family names
                        denote fonts which come in more than one version, having widely
                        different repertoires - previous discussion has shown numerous
                        examples.

                        It's a dilemma. Arial Unicode MS typeface has only one font, whereas
                        (for example) the Palatino Linotype typeface has also italic, bold and
                        bold italic fonts. Lucida Sans Unicode typeface also has a fairly
                        wide repertoire but only one font. When italic, bold etc. have to be
                        derived from the regular font, the results are suboptimal.
                        [color=blue]
                        > The sure gain is that a large number of IE users will be able to read
                        > the page without difficulty. The potential loss is that users who
                        > actually have a qualitatively better font in their system and a browser
                        > configured to use it will need an extra action to override the page
                        > settings. I don't like the loss, but I think it's acceptable.[/color]

                        It's a value judgement call, which could very well come out different
                        for each situation. I really don't have a final view on it.

                        Fortunately, if one uses a central stylesheet then a change of
                        opinion can be easily implemented!
                        [color=blue]
                        > It's unfortunate that Code2000 is qualitatively so awful.[/color]

                        It's a reasonable choice when repertoire is the overwhelming
                        consideration, and cosmetics can take a back place.

                        (Then there's the problem of monospace.)

                        cheers

                        Comment

                        • Mad Bad Rabbit

                          #57
                          Re: Lang attribute values

                          "Jukka K. Korpela" <jkorpela@cs.tu t.fi> wrote:
                          [color=blue]
                          > In situations where the author knows that some font(s) that are
                          > relatively commonly installed contain the characters he uses in a
                          > document, I think it is reasonable to write a font-family suggestion
                          > [...] As a rough rule of thumb, if you use characters that are not
                          > present in Times New Roman, consider suggesting
                          >
                          > body { font-family: "Arial Unicode MS"; }[/color]

                          Wouldn't it be safer to leave <body> alone, and only suggest
                          an alternate font-family for parts of the document known to
                          contain the problematic characters?

                          For example, if I'm composing a Bible-study page that has a
                          few scattered Greek words, oughtn't it just use:

                          span.polytonic { font-family: "Palatino Linotype" }


                          [color=blue]
                          >;K[/color]

                          Comment

                          • Philip Newton

                            #58
                            Re: Lang attribute values

                            On Thu, 22 Jan 2004 22:04:46 +0100, Andreas Prilop
                            <nhtcapri@rrz n-user.uni-hannover.de> wrote:
                            [color=blue]
                            > It might be a good idea to extend the euro-centric list
                            > serif, sans-serif, cursive, fantasy
                            > by
                            > naskhi, nastaliq, thuluth
                            > etc.[/color]

                            Sounds reasonable to me. Is "thuluth" what is sometimes called "sülüs"?

                            Cheers,
                            Philip
                            --
                            Philip Newton <nospam.newton@ gmx.li>
                            That really is my address; no need to remove anything to reply.
                            If you're not part of the solution, you're part of the precipitate.

                            Comment

                            • Philip Newton

                              #59
                              Re: Lang attribute values

                              On Fri, 23 Jan 2004 21:17:41 +0100, Andreas Prilop
                              <nhtcapri@rrz n-user.uni-hannover.de> wrote:
                              [color=blue]
                              > Philip Newton <pne-news-200401@newton.d igitalspace.net > wrote:
                              >[color=green]
                              > > "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                              > >[color=darkred]
                              > >> If those characters were Arabic, then it would be useful to choose,
                              > >> say, a Persian font if it were known that the language is Farsi.[/color]
                              > >
                              > > Or, for a possibly better example, to choose a nastaliq font (the kind
                              > > that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.[/color]
                              >
                              > That ain't a better example - it's the same example. Both Persian and
                              > Urdu would prefer a nast'aliq typeface.[/color]

                              Ah, I did not know that Persian also preferred nastaliq. Thanks.

                              Cheers,
                              Philip
                              --
                              Philip Newton <nospam.newton@ gmx.li>
                              That really is my address; no need to remove anything to reply.
                              If you're not part of the solution, you're part of the precipitate.

                              Comment

                              • Philip Newton

                                #60
                                Re: Lang attribute values

                                On Fri, 23 Jan 2004 16:32:02 +0200, Henri Sivonen <hsivonen@iki.f i>
                                wrote:
                                [color=blue]
                                > In article <Xns9479A6AF2F1 Ejkorpelacstutf i@193.229.0.31> , "Jukka K.
                                > Korpela" <jkorpela@cs.tu t.fi> wrote:
                                >[color=green]
                                > > Yes. It should see immediately that Latin script is used. But in
                                > > addition to this, what's the big idea in selecting fonts according
                                > > to language?[/color]
                                >
                                > I can't find a politically correct way of saying this, but there's
                                > are pecking orders of language groups within scripts in terms of
                                > font availability and quality. It's unfortunate.
                                >
                                > For example Polish looks ugly if some glyphs come from a "Western"
                                > font and others come from a "Central European" font.[/color]

                                Mmm. Or if you want to have d-with-caron; you often can't use U+010F
                                LATIN SMALL LETTER D WITH CARON since this will typically have a glyph
                                with apostrophe after rather than caron above due to Czech and Slovak
                                typesetting habits (if I interpret the comment in the Unicode standard
                                correctly). But what if I'm not typesetting Czech or Slovak, but a
                                language which uses d-with-caron? (This is a real example, though the
                                language in question is not a natlang.)

                                Cheers,
                                Philip
                                --
                                Philip Newton <nospam.newton@ gmx.li>
                                That really is my address; no need to remove anything to reply.
                                If you're not part of the solution, you're part of the precipitate.

                                Comment

                                Working...