Lang attribute values

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bertilo Wennergren

    #31
    Re: Lang attribute values

    Safalra:
    [color=blue][color=green]
    >> If I write about <span lang="ru">Dosto yevsky</span>,[/color][/color]
    [color=blue]
    > I don't mean to sound ignorant, but what's the logic behind using
    > language mark-up for proper nouns?[/color]

    In this case the only need for the markup is the need to indicate the
    language of that proper noun. That's why the otherwise meaningless
    element "span" has been used. It's just there in order to make it
    possible to add the attribute "lang" that conveys the information that
    the language in question is Russian.

    If at the same time that name would have constituted a citation (to some
    work by Dostoyevsky) then the following would have been appropriate:

    <cite lang="ru">Dosto yevsky</cite>
    [color=blue]
    > Presumably in an ideal mark-up language, language and script would be
    > independent attributes (and that way I'd have some sort of mark-up to
    > put around my IPA sections...)?[/color]

    Indication the script of a piece of text would be just as meaningful as
    the following (using the ficticious attribute "text"):

    <span text="book">boo k</span>

    The text is already there as content, so there is of course absolutely
    no need to indicate it with an attribute as well.

    This

    <span script="latin"> book</span>

    would be just as stupid. The text string "book" can't be anything else
    but Latin script. If it wasn't Latin script, then it wouldn't consist of
    the four Latin script characters "b", "o", "o" and "k", would it?

    In the same way "Dostoyevsk y" (written exactly like that) is written in
    Latin script. There is no need (or should be no need) telling the
    browser what it already knows.

    --
    Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

    Comment

    • Andreas Prilop

      #32
      Re: Lang attribute values

      On Fri, 23 Jan 2004, Henri Sivonen wrote:
      [color=blue]
      > For example Polish looks ugly if some glyphs come from a "Western" font
      > and others come from a "Central European" font.[/color]

      This is especially true for Macintosh and Unix.
      MS Windows users probably never encounter this problem - don't even know
      that it exists.

      I remind you of
      <http://www.unics.uni-hannover.de/nhtcapri/temp/face-arial.gif>


      It just comes into my mind that
      <p lang="en"> ... <span lang="zh">Mao Zedong</span> ...
      may give funny-looking results in Mozilla/Netscape.
      So you better use LANG markup only with the original script.

      Comment

      • Bertilo Wennergren

        #33
        Re: Lang attribute values

        Andreas Prilop:
        [color=blue]
        > It just comes into my mind that
        > <p lang="en"> ... <span lang="zh">Mao Zedong</span> ...
        > may give funny-looking results in Mozilla/Netscape.
        > So you better use LANG markup only with the original script.[/color]

        Funny-looking results are the least of your problems if you use such
        mark-up.

        Windows users (Explorer or Mozilla) might get a prompt to download a
        Chinese language pack in order to read that text - although there are no
        Chinese characters in it. Some will probably suppose that the computer
        has a virus (maybe from your web page).

        --
        Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

        Comment

        • Alan J. Flavell

          #34
          Re: Lang attribute values

          On Fri, 23 Jan 2004, Safalra wrote:
          [color=blue][color=green]
          > > If I write about <span lang="ru">Dosto yevsky</span>,[/color]
          >
          > I don't mean to sound ignorant, but what's the logic behind using
          > language mark-up for proper nouns?[/color]

          It's a fair question! Would you care to debate the topic as if
          the example had been e.g <span lang="ru">glasn ost</span> instead ?
          [color=blue]
          > Presumably in an ideal mark-up language, language and script would be
          > independent attributes[/color]

          Well, they are defined to be independent in HTML (begging the question
          whether HTML is an "ideal" mark-up language ;-)
          [color=blue]
          > (and that way I'd have some sort of mark-up to
          > put around my IPA sections...)?[/color]

          In what sense do you not have? Such a markup would be entirely proper
          in HTML.

          Any language dependence re-enters only indirectly via Unicode, but as
          far as HTML is concerned, writing system (script) and language are
          independent properties.

          Some browsers, as we've discussed, use language as a hint for font
          selection, but that's an issue of cosmetics, it is NOT allowed to
          cause any change in the actual characters displayed: the notorious
          <font face="Dingbats" > etc. is a bogosity of the first water, as far
          as HTML4 is concerned (exceeded only by the corresponding bogosity in
          CSS), and I'm glad to see Mozilla resisting misguided demands to "make
          it work" (i.e to break it so that it appears to do what the misguided
          author intended).

          Comment

          • Alan J. Flavell

            #35
            Re: Lang attribute values

            On Fri, 23 Jan 2004, Andreas Prilop wrote:
            [color=blue]
            > Set the encoding to "charset=UT F-8".
            > <http://ppewww.ph.gla.a c.uk/~flavell/charset/checklist.html# s6>
            > to suit Netscape 4 and perhaps other older browsers.[/color]

            Perhaps we should say "old-ish browsers".

            There have been browsers which would understand e.g iso-8869-7 Greek
            mixed with Latin-1 entities such as &uuml; , but would not understand
            utf-8 - that was true of 16-bit IE3.01 if my memory serves me right.
            They would need the approach described in #s5 in order to display such
            material correctly.

            Then, as you say, there would be NN4.* browsers, which in general
            don't understand #s5, but do understand #s6

            Browsers which are even older, might not understand either. Indeed
            there's one "browser" in use today that doesn't seem to understand
            either: WebTV treats all encodings as a somewhat crippled form of
            Windows-1252, if its developer simulation is accurate!

            Since none of the affected browsers sends a meaningful Accept-charset,
            I would rule out the idea of using content negotiation to choose the
            right option. Since I'm fundamentally opposed to negotiating on the
            basis of client agent strings, that leaves only a manual selection, if
            you really have such challenging content -and- you care about such
            elderly browsers.

            My recommendation at the present time would be to use utf-8 (as per
            #s6 or #s7 whichever is convenient to the author) for such material
            (thus covering not only any RFC2070-conforming browser but also the
            remaining NN4.* stragglers), and forget the remaining antique browser
            versions. They're just too old to lose sleep over, by now.

            Not that I would deliberately repel them if the material was
            accessible to them; but sometimes the material by its very nature
            requires a rich character repertoire, and then I think such an action
            is justifiable.

            Comment

            • Andreas Prilop

              #36
              Re: Lang attribute values

              "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
              [color=blue]
              > Would you care to debate the topic as if
              > the example had been e.g <span lang="ru">glasn ost</span> instead ?[/color]

              Hmm, let's take <span lang="ru">vodka </span>, da?
              And that makes me ponder whether 'tis nobler in the mind to write
              <span lang="en-SC">whisky</span>
              <span lang="en-IE">whiskey</span>
              ;-)

              Comment

              • Philip Newton

                #37
                Re: Lang attribute values

                On Thu, 22 Jan 2004 19:45:53 +0000, "Alan J. Flavell"
                <flavell@ph.gla .ac.uk> wrote:
                [color=blue]
                > If those characters were Arabic, then it would be useful to choose,
                > say, a Persian font if it were known that the language is Farsi.[/color]

                Or, for a possibly better example, to choose a nastaliq font (the kind
                that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.

                Cheers,
                Philip
                --
                Philip Newton <nospam.newton@ gmx.li>
                That really is my address; no need to remove anything to reply.
                If you're not part of the solution, you're part of the precipitate.

                Comment

                • Andreas Prilop

                  #38
                  Re: Lang attribute values

                  Philip Newton <pne-news-200401@newton.d igitalspace.net > wrote:
                  [color=blue]
                  > "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                  >[color=green]
                  >> If those characters were Arabic, then it would be useful to choose,
                  >> say, a Persian font if it were known that the language is Farsi.[/color]
                  >
                  > Or, for a possibly better example, to choose a nastaliq font (the kind
                  > that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.[/color]

                  That ain't a better example - it's the same example. Both Persian and
                  Urdu would prefer a nast'aliq typeface.

                  Comment

                  • Henri Sivonen

                    #39
                    Re: Lang attribute values

                    In article <Pine.LNX.4.53. 0401231508350.1 8603@ppepc56.ph .gla.ac.uk>,
                    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                    [color=blue]
                    > On Fri, 23 Jan 2004, Henri Sivonen wrote:
                    >
                    > [addressing Jukka, but I shall offer an answer anyway ;-) ][color=green]
                    > > When you write <span lang="ru">Dosto yevsky</span>, what would you want
                    > > recipients to do with the language data?[/color]
                    >
                    > If they are browsers, my answer would be "probably nothing". If they
                    > are indexers, summarisers etc. then the answer would be different.[/color]

                    What would your answer be in that case?
                    [color=blue][color=green]
                    > > That is, is it actually useful
                    > > for transliterated text to come with language data in any existing or
                    > > realistic client implementation [...][/color]
                    >
                    > In theory, the /markup/ depends on the structure and attributes of the
                    > content - it isn't *supposed* to be done with the intention of
                    > producing a particular result on a particular client agent (that job
                    > is delegated to stylesheet/s).[/color]
                    [color=blue]
                    > So when you are raising issues of this kind, it might be useful if you
                    > would make clear whether you have in mind the theoretical ideal, or
                    > rather some particular practical issue related to current browsers and
                    > other kinds of client agent.[/color]

                    I'm interested in realistic and practical use cases (for which software
                    support exists or realistically could exist in a useful way).

                    Having been involved in a couple of metadata-related projects myself,
                    I've observed that there's a tendency towars developing metadata fields
                    that seem like nice to have but would require either more labor to fill
                    than the supposed benefit is worth or would require the processing
                    software to pass the Turing test as a side effect. That's why I like to
                    call for realistic use cases when metadata is discussed.
                    [color=blue]
                    > Remark: IBM HPR will use different pronunciations depending on the
                    > language markup, to take just one example (which is actually
                    > irrelevant here, since it didn't offer Russian as an option, and I've
                    > no idea what it would do with Russian-transliterated-into-Roman-
                    > letters even if it did). But nevertheless, it's an interesting
                    > what-if question, isn't it?[/color]

                    The question gets even more interesting if the surrounding language
                    causes the foreign name to look different due to flexion. Does it get so
                    interesting that we are sliding towards the Turing test?

                    --
                    Henri Sivonen
                    hsivonen@iki.fi

                    Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                    Comment

                    • Tim

                      #40
                      Re: Lang attribute values

                      Tim <Tim@mail.local host> wrote:
                      [color=blue][color=green]
                      >> Well, unless you're inventing something new, they're a country code
                      >> (e.g. en-us for U.S.A. English, en-au for Australian English, etc.).[/color][/color]


                      Neal <neal413@spamrc n.com> wrote:
                      [color=blue]
                      > Apologies if this has been answered elsewhere, but is there a list of
                      > these codes anywhere?[/color]

                      Yes.

                      I don't know it off hand, or I'd mention it. Try searching for "country
                      codes."
                      [color=blue]
                      > And how necessary are they?[/color]

                      Generally, they're not (e.g. it doesn't make any difference to
                      understanding this text whether it's Australian, British, or American
                      English, though it can help with a spell checker). And the RFC that's
                      previously been mentioned in this thread goes as far as to comment that
                      sometimes they may cause more problems.
                      [color=blue]
                      > My specific application is a website for an orchestra using many foreign
                      > titles and names. I'm imagining a speech reader will need the language
                      > code to be able to pronounce the word correctly, but perhaps I am off here
                      > as well. At any rate, a country subtag appears to be unimportant, as our
                      > primary market is our US-based audience.[/color]

                      I'd make a hazardous guess that the speech synthesiser will still get
                      things wrong. English ones certainly do; although many other languages
                      do play by the rules a lot better than English does, you're never quite
                      sure how to pronounce someone's name.

                      --
                      My "from" address is totally fake. The reply-to address is real, but
                      may be only temporary. Reply to usenet postings in the same place as
                      you read the message you're replying to.

                      This message was sent without a virus, please delete some files yourself.

                      Comment

                      • Safalra

                        #41
                        Re: Lang attribute values

                        Bertilo Wennergren <bertilow@gmx.n et> wrote in message news:<buriqh$er 3$02$1@news.t-online.com>...[color=blue]
                        > Safalra:[color=green][color=darkred]
                        > >> If I write about <span lang="ru">Dosto yevsky</span>,[/color][/color]
                        >[color=green]
                        > > I don't mean to sound ignorant, but what's the logic behind using
                        > > language mark-up for proper nouns?[/color]
                        >
                        > In this case the only need for the markup is the need to indicate the
                        > language of that proper noun. That's why the otherwise meaningless
                        > element "span" has been used. It's just there in order to make it
                        > possible to add the attribute "lang" that conveys the information that
                        > the language in question is Russian.[/color]

                        But what if the proper noun had been 'Natasha'? That's a Russian name,
                        but should I mark it up as such if the Natasha in question is not
                        Russian?
                        [color=blue][color=green]
                        > > Presumably in an ideal mark-up language, language and script would be
                        > > independent attributes (and that way I'd have some sort of mark-up to
                        > > put around my IPA sections...)?[/color]
                        >
                        > [snip]
                        > <span script="latin"> book</span>
                        > would be just as stupid. The text string "book" can't be anything else
                        > but Latin script. If it wasn't Latin script, then it wouldn't consist of
                        > the four Latin script characters "b", "o", "o" and "k", would it?[/color]

                        What if it's IPA? Most Latin characters are present in IPA, but many
                        (vowels in particular) represent differents sound from what they would
                        in English, for example. A speech browser would need to know to
                        pronounce the word using IPA phonemes rather than English. Given some
                        time, I'm sure I could find an example of an English word that when
                        written in IPA uses the same characters as another English word. In
                        that case, script would need to be indicated.

                        --- Safalra (Stephen Morley) ---

                        Comment

                        • Bertilo Wennergren

                          #42
                          Re: Lang attribute values

                          Safalra:
                          [color=blue]
                          > Bertilo Wennergren[/color]
                          [color=blue][color=green]
                          >> In this case the only need for the markup is the need to indicate the
                          >> language of that proper noun. That's why the otherwise meaningless
                          >> element "span" has been used. It's just there in order to make it
                          >> possible to add the attribute "lang" that conveys the information that
                          >> the language in question is Russian.[/color][/color]
                          [color=blue]
                          > But what if the proper noun had been 'Natasha'? That's a Russian name,
                          > but should I mark it up as such if the Natasha in question is not
                          > Russian?[/color]

                          You decide what language the text is in. There are difficult cases. You
                          as the author has to make a decision.
                          [color=blue][color=green]
                          >> <span script="latin"> book</span>
                          >> would be just as stupid. The text string "book" can't be anything else
                          >> but Latin script. If it wasn't Latin script, then it wouldn't consist of
                          >> the four Latin script characters "b", "o", "o" and "k", would it?[/color][/color]
                          [color=blue]
                          > What if it's IPA? Most Latin characters are present in IPA, but many
                          > (vowels in particular) represent differents sound from what they would
                          > in English, for example. A speech browser would need to know to
                          > pronounce the word using IPA phonemes rather than English. Given some
                          > time, I'm sure I could find an example of an English word that when
                          > written in IPA uses the same characters as another English word. In
                          > that case, script would need to be indicated.[/color]

                          True. There are exceptions.

                          --
                          Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

                          Comment

                          • Neal

                            #43
                            Re: Lang attribute values

                            On 24 Jan 2004 03:06:06 -0800, Safalra <usenet@safalra .com> wrote:[color=blue]
                            > Given some
                            > time, I'm sure I could find an example of an English word that when
                            > written in IPA uses the same characters as another English word. In
                            > that case, script would need to be indicated.[/color]


                            IPA \bit\ is pronounced "beet." \robot\ is "rowboat," though with a
                            European r. The unadorned IPA vowels are pronounced in a Latin fashion,
                            unlike common English pronunciation where many such vowels are short.

                            I recall something from the recommendations saying that authors should in
                            some cases provide pronunciation help to a speech reader. Apologies for
                            not remembering the exact context, perhaps someone else recalls it as
                            well. Has W3C adopted any manner to do this?

                            Comment

                            • Jukka K. Korpela

                              #44
                              Re: Lang attribute values

                              Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
                              [color=blue]
                              > Let's say you've defined Futura as your preferred typeface for West
                              > European Latin and Verdana for Cyrillic. Then Mozilla will display
                              > your document with "charset=IS O-8859-1" or "charset=UT F-8" in Futura
                              > but will display <span lang="ru">Dosto evskij</span> in Verdana.[/color]

                              I just realized that there's similar absurdity in IE, though at a
                              different level. Maybe it could be described just as documentation
                              error: If you go to Internet settings and select Fonts, IE lets you
                              specify the font used for various "character sets". These sets are
                              named as Latin, Greek, Cyrillic, etc. This seems to make sense, until
                              you realize that it's the _encoding_ that matters.

                              That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
                              page content as "Cyrillic", no matter what characters and what language
                              it actually contains. Similarly, if I specify a particular font for
                              "Cyrillic character set" and access a UTF-8 encoded page, IE does _not_
                              use that font for Cyrillic letters on the page. It seems to treat the
                              page content as "Latin based".

                              It's an interesting guessing game. It indirectly affects authoring in
                              the sense that the choice of an encoding has implications on fonts,
                              though only on pages that do not set font family (except when the user
                              overrides such settings), and in a rather unpredictable situation - the
                              defaults for the font settings in browsers for different "character
                              sets" presumably vary, and if users change them, they probably do so in
                              the dark, more or less, since few people know what's going on in those
                              settings.

                              --
                              Yucca, http://www.cs.tut.fi/~jkorpela/
                              Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                              Comment

                              • Jukka K. Korpela

                                #45
                                Re: Lang attribute values

                                Henri Sivonen <hsivonen@iki.f i> wrote:
                                [color=blue]
                                > Choosing a font is only one problem. There are others including
                                > line breaking.[/color]

                                Of course the _quality_ of rendering on screen or paper can be affected
                                by such processes. My point was that browsers have been able to present
                                documents without knowing the language, and they keep doing so (even
                                now, when they could in principle get the language information from
                                some pages, and they always had the option of recognizing language from
                                actual content - something that Google does with rather good rate of
                                success, no matter what we think about the idea in principle).

                                (Line breaking makes my head ache. The Unicode line breaking rules are
                                very complex and largely absurd, and browsers are now competing in
                                implementing some of the worst parts in a wrong way. But I digress.)
                                [color=blue]
                                > When you write <span lang="ru">Dosto yevsky</span>, what would you
                                > want recipients to do with the language data?[/color]

                                Nothing particular. I'm just giving (meta)informati on. In a sense, here
                                I'm intentionally more papal than the pope - I am applying an
                                unconditional Priority 1 WAI guideline that the WAI itself violates.

                                And as I wrote, I don't recommend doing that in practice - but not
                                because the idea would be wrong. It's the Mozilla misbehavior that
                                makes it currently impractical.
                                [color=blue]
                                > That is, is it
                                > actually useful for transliterated text to come with language data
                                > in any existing or realistic client implementation for any of the
                                > purposes you list in
                                > http://www.cs.tut.fi/~jkorpela/kielimerkkaus/1.html ?[/color]

                                (What I list there is basically the reasons given in HTML 4
                                specification and in WCAG 1.0, with some explanations of mine.)

                                In any existing implementation, most probably not. As we know, there
                                are very few existing implementations that utilize of lang attributes,
                                and there are implementations that draw wrong conclusions from them.

                                In a realistic implementation, why not? Of course they would need to
                                know or guess the transliteration method, but there's nothing that
                                prevents them from making educated guesses, except that it means quite
                                some work. And the metainformation about transliteration could even be
                                transmitted in an HTTP header. Of course this is hypothetical, but so
                                it most talk about utilization lang attributes.
                                [color=blue]
                                > Is it there
                                > just in case the user is curious and invokes "Properties " in
                                > Mozilla in order to find out that Dostoyevsky is a Russian name?[/color]

                                Well, that's one actual usage of the information. And nothing to be
                                frowned upon, since when users find the right-click info features,
                                they will start using them. If you don't use lang markup for a name, it
                                will naturally report the language according to the lang attribute of
                                the enclosing element, i.e. give wrong information. In fact, on such
                                grounds, an extremist (?) could say that if lang markup is used at all,
                                it should be comprehensive. If you say nothing about language, you are
                                not giving wrong information. But if you say e.g. <html lang="en">,
                                then you _are_ claiming that each and every word in the document is in
                                English, unless stated otherwise in lang attributes for inner elements.
                                (Quite a job, isn't it? Often you don't even know the language of a
                                name. I guess we should use lang="und" then.)
                                [color=blue]
                                > Let's suppose I'm writing a content management system and I choose
                                > to use UTF-8 for all output - -
                                > What advice should I provide authors who want to use the system for
                                > publishing Polish or Chinese text? How should they make their
                                > suggestions?[/color]

                                You mean for fonts? By using font properties in CSS. As far as I can
                                see, this would be sufficient for defeating Mozilla's misbehavior.

                                I don't see how lang attributes would help in practice, though it would
                                be OK to declare the language as a preparation for the future.

                                --
                                Yucca, http://www.cs.tut.fi/~jkorpela/
                                Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                                Comment

                                Working...