Lang attribute values

**Bertilo Wennergren** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

Safalra:
[color=blue][color=green]
>> If I write about Dosto yevsky,[/color][/color]
[color=blue]
> I don't mean to sound ignorant, but what's the logic behind using
> language mark-up for proper nouns?[/color]

In this case the only need for the markup is the need to indicate the
language of that proper noun. That's why the otherwise meaningless
element "span" has been used. It's just there in order to make it
possible to add the attribute "lang" that conveys the information that
the language in question is Russian.

If at the same time that name would have constituted a citation (to some
work by Dostoyevsky) then the following would have been appropriate:

<cite lang="ru">Dosto yevsky</cite>
[color=blue]
> Presumably in an ideal mark-up language, language and script would be
> independent attributes (and that way I'd have some sort of mark-up to
> put around my IPA sections...)?[/color]

Indication the script of a piece of text would be just as meaningful as
the following (using the ficticious attribute "text"):

boo k

The text is already there as content, so there is of course absolutely
no need to indicate it with an attribute as well.

This

 book

would be just as stupid. The text string "book" can't be anything else
but Latin script. If it wasn't Latin script, then it wouldn't consist of
the four Latin script characters "b", "o", "o" and "k", would it?

In the same way "Dostoyevsk y" (written exactly like that) is written in
Latin script. There is no need (or should be no need) telling the
browser what it already knows.

--
Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

**Andreas Prilop** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

On Fri, 23 Jan 2004, Henri Sivonen wrote:
[color=blue]
> For example Polish looks ugly if some glyphs come from a "Western" font
> and others come from a "Central European" font.[/color]

This is especially true for Macintosh and Unix.
MS Windows users probably never encounter this problem - don't even know
that it exists.

I remind you of
<http://www.unics.uni-hannover.de/nhtcapri/temp/face-arial.gif>

It just comes into my mind that
 ... Mao Zedong ...
may give funny-looking results in Mozilla/Netscape.
So you better use LANG markup only with the original script.

**Bertilo Wennergren** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

Andreas Prilop:
[color=blue]
> It just comes into my mind that
> ... Mao Zedong ...
> may give funny-looking results in Mozilla/Netscape.
> So you better use LANG markup only with the original script.[/color]

Funny-looking results are the least of your problems if you use such
mark-up.

Windows users (Explorer or Mozilla) might get a prompt to download a
Chinese language pack in order to read that text - although there are no
Chinese characters in it. Some will probably suppose that the computer
has a virus (maybe from your web page).

--
Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

**Alan J. Flavell** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

On Fri, 23 Jan 2004, Safalra wrote:
[color=blue][color=green]
> > If I write about Dosto yevsky,[/color]
>
> I don't mean to sound ignorant, but what's the logic behind using
> language mark-up for proper nouns?[/color]

It's a fair question! Would you care to debate the topic as if
the example had been e.g glasn ost instead ?
[color=blue]
> Presumably in an ideal mark-up language, language and script would be
> independent attributes[/color]

Well, they are defined to be independent in HTML (begging the question
whether HTML is an "ideal" mark-up language ;-)
[color=blue]
> (and that way I'd have some sort of mark-up to
> put around my IPA sections...)?[/color]

In what sense do you not have? Such a markup would be entirely proper
in HTML.

Any language dependence re-enters only indirectly via Unicode, but as
far as HTML is concerned, writing system (script) and language are
independent properties.

Some browsers, as we've discussed, use language as a hint for font
selection, but that's an issue of cosmetics, it is NOT allowed to
cause any change in the actual characters displayed: the notorious
 etc. is a bogosity of the first water, as far
as HTML4 is concerned (exceeded only by the corresponding bogosity in
CSS), and I'm glad to see Mozilla resisting misguided demands to "make
it work" (i.e to break it so that it appears to do what the misguided
author intended).

**Alan J. Flavell** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

On Fri, 23 Jan 2004, Andreas Prilop wrote:
[color=blue]
> Set the encoding to "charset=UT F-8".
> <http://ppewww.ph.gla.a c.uk/~flavell/charset/checklist.html# s6>
> to suit Netscape 4 and perhaps other older browsers.[/color]

Perhaps we should say "old-ish browsers".

There have been browsers which would understand e.g iso-8869-7 Greek
mixed with Latin-1 entities such as ü , but would not understand
utf-8 - that was true of 16-bit IE3.01 if my memory serves me right.
They would need the approach described in #s5 in order to display such
material correctly.

Then, as you say, there would be NN4.* browsers, which in general
don't understand #s5, but do understand #s6

Browsers which are even older, might not understand either. Indeed
there's one "browser" in use today that doesn't seem to understand
either: WebTV treats all encodings as a somewhat crippled form of
Windows-1252, if its developer simulation is accurate!

Since none of the affected browsers sends a meaningful Accept-charset,
I would rule out the idea of using content negotiation to choose the
right option. Since I'm fundamentally opposed to negotiating on the
basis of client agent strings, that leaves only a manual selection, if
you really have such challenging content -and- you care about such
elderly browsers.

My recommendation at the present time would be to use utf-8 (as per
#s6 or #s7 whichever is convenient to the author) for such material
(thus covering not only any RFC2070-conforming browser but also the
remaining NN4.* stragglers), and forget the remaining antique browser
versions. They're just too old to lose sleep over, by now.

Not that I would deliberately repel them if the material was
accessible to them; but sometimes the material by its very nature
requires a rich character repertoire, and then I think such an action
is justifiable.

**Andreas Prilop** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
[color=blue]
> Would you care to debate the topic as if
> the example had been e.g glasn ost instead ?[/color]

Hmm, let's take vodka , da?
And that makes me ponder whether 'tis nobler in the mind to write
whisky
whiskey
;-)

**Philip Newton** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

On Thu, 22 Jan 2004 19:45:53 +0000, "Alan J. Flavell"
<flavell@ph.gla .ac.uk> wrote:
[color=blue]
> If those characters were Arabic, then it would be useful to choose,
> say, a Persian font if it were known that the language is Farsi.[/color]

Or, for a possibly better example, to choose a nastaliq font (the kind
that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.

Cheers,
Philip
--
Philip Newton <nospam.newton@ gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.

**Andreas Prilop** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

Philip Newton <pne-news-200401@newton.d igitalspace.net > wrote:
[color=blue]
> "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
>[color=green]
>> If those characters were Arabic, then it would be useful to choose,
>> say, a Persian font if it were known that the language is Farsi.[/color]
>
> Or, for a possibly better example, to choose a nastaliq font (the kind
> that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.[/color]

That ain't a better example - it's the same example. Both Persian and
Urdu would prefer a nast'aliq typeface.

**Henri Sivonen** · Jul 20 '05, 05:25 PM

Re: Lang attribute values

In article <Pine.LNX.4.53. 0401231508350.1 8603@ppepc56.ph .gla.ac.uk>,
"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
[color=blue]
> On Fri, 23 Jan 2004, Henri Sivonen wrote:
>
> [addressing Jukka, but I shall offer an answer anyway ;-) ][color=green]
> > When you write Dosto yevsky, what would you want
> > recipients to do with the language data?[/color]
>
> If they are browsers, my answer would be "probably nothing". If they
> are indexers, summarisers etc. then the answer would be different.[/color]

What would your answer be in that case?
[color=blue][color=green]
> > That is, is it actually useful
> > for transliterated text to come with language data in any existing or
> > realistic client implementation [...][/color]
>
> In theory, the /markup/ depends on the structure and attributes of the
> content - it isn't *supposed* to be done with the intention of
> producing a particular result on a particular client agent (that job
> is delegated to stylesheet/s).[/color]
[color=blue]
> So when you are raising issues of this kind, it might be useful if you
> would make clear whether you have in mind the theoretical ideal, or
> rather some particular practical issue related to current browsers and
> other kinds of client agent.[/color]

I'm interested in realistic and practical use cases (for which software
support exists or realistically could exist in a useful way).

Having been involved in a couple of metadata-related projects myself,
I've observed that there's a tendency towars developing metadata fields
that seem like nice to have but would require either more labor to fill
than the supposed benefit is worth or would require the processing
software to pass the Turing test as a side effect. That's why I like to
call for realistic use cases when metadata is discussed.
[color=blue]
> Remark: IBM HPR will use different pronunciations depending on the
> language markup, to take just one example (which is actually
> irrelevant here, since it didn't offer Russian as an option, and I've
> no idea what it would do with Russian-transliterated-into-Roman-
> letters even if it did). But nevertheless, it's an interesting
> what-if question, isn't it?[/color]

The question gets even more interesting if the surrounding language
causes the foreign name to look different due to flexion. Does it get so
interesting that we are sliding towards the Turing test?

--
Henri Sivonen
hsivonen@iki.fi

Henri Sivonen's pages

http://iki.fi/hsivonen/

Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

**Tim** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Tim <Tim@mail.local host> wrote:
[color=blue][color=green]
>> Well, unless you're inventing something new, they're a country code
>> (e.g. en-us for U.S.A. English, en-au for Australian English, etc.).[/color][/color]

Neal <neal413@spamrc n.com> wrote:
[color=blue]
> Apologies if this has been answered elsewhere, but is there a list of
> these codes anywhere?[/color]

Yes.

I don't know it off hand, or I'd mention it. Try searching for "country
codes."
[color=blue]
> And how necessary are they?[/color]

Generally, they're not (e.g. it doesn't make any difference to
understanding this text whether it's Australian, British, or American
English, though it can help with a spell checker). And the RFC that's
previously been mentioned in this thread goes as far as to comment that
sometimes they may cause more problems.
[color=blue]
> My specific application is a website for an orchestra using many foreign
> titles and names. I'm imagining a speech reader will need the language
> code to be able to pronounce the word correctly, but perhaps I am off here
> as well. At any rate, a country subtag appears to be unimportant, as our
> primary market is our US-based audience.[/color]

I'd make a hazardous guess that the speech synthesiser will still get
things wrong. English ones certainly do; although many other languages
do play by the rules a lot better than English does, you're never quite
sure how to pronounce someone's name.

--
My "from" address is totally fake. The reply-to address is real, but
may be only temporary. Reply to usenet postings in the same place as
you read the message you're replying to.

This message was sent without a virus, please delete some files yourself.

**Safalra** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Bertilo Wennergren <bertilow@gmx.n et> wrote in message news:<buriqh$er 3$02$1@news.t-online.com>...[color=blue]
> Safalra:[color=green][color=darkred]
> >> If I write about Dosto yevsky,[/color][/color]
>[color=green]
> > I don't mean to sound ignorant, but what's the logic behind using
> > language mark-up for proper nouns?[/color]
>
> In this case the only need for the markup is the need to indicate the
> language of that proper noun. That's why the otherwise meaningless
> element "span" has been used. It's just there in order to make it
> possible to add the attribute "lang" that conveys the information that
> the language in question is Russian.[/color]

But what if the proper noun had been 'Natasha'? That's a Russian name,
but should I mark it up as such if the Natasha in question is not
Russian?
[color=blue][color=green]
> > Presumably in an ideal mark-up language, language and script would be
> > independent attributes (and that way I'd have some sort of mark-up to
> > put around my IPA sections...)?[/color]
>
> [snip]
> book
> would be just as stupid. The text string "book" can't be anything else
> but Latin script. If it wasn't Latin script, then it wouldn't consist of
> the four Latin script characters "b", "o", "o" and "k", would it?[/color]

What if it's IPA? Most Latin characters are present in IPA, but many
(vowels in particular) represent differents sound from what they would
in English, for example. A speech browser would need to know to
pronounce the word using IPA phonemes rather than English. Given some
time, I'm sure I could find an example of an English word that when
written in IPA uses the same characters as another English word. In
that case, script would need to be indicated.

--- Safalra (Stephen Morley) ---

404 Not Found

http://www.safalra.com/hypertext

**Bertilo Wennergren** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Safalra:
[color=blue]
> Bertilo Wennergren[/color]
[color=blue][color=green]
>> In this case the only need for the markup is the need to indicate the
>> language of that proper noun. That's why the otherwise meaningless
>> element "span" has been used. It's just there in order to make it
>> possible to add the attribute "lang" that conveys the information that
>> the language in question is Russian.[/color][/color]
[color=blue]
> But what if the proper noun had been 'Natasha'? That's a Russian name,
> but should I mark it up as such if the Natasha in question is not
> Russian?[/color]

You decide what language the text is in. There are difficult cases. You
as the author has to make a decision.
[color=blue][color=green]
>> book
>> would be just as stupid. The text string "book" can't be anything else
>> but Latin script. If it wasn't Latin script, then it wouldn't consist of
>> the four Latin script characters "b", "o", "o" and "k", would it?[/color][/color]
[color=blue]
> What if it's IPA? Most Latin characters are present in IPA, but many
> (vowels in particular) represent differents sound from what they would
> in English, for example. A speech browser would need to know to
> pronounce the word using IPA phonemes rather than English. Given some
> time, I'm sure I could find an example of an English word that when
> written in IPA uses the same characters as another English word. In
> that case, script would need to be indicated.[/color]

True. There are exceptions.

--
Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

**Neal** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On 24 Jan 2004 03:06:06 -0800, Safalra <usenet@safalra .com> wrote:[color=blue]
> Given some
> time, I'm sure I could find an example of an English word that when
> written in IPA uses the same characters as another English word. In
> that case, script would need to be indicated.[/color]

IPA \bit\ is pronounced "beet." \robot\ is "rowboat," though with a
European r. The unadorned IPA vowels are pronounced in a Latin fashion,
unlike common English pronunciation where many such vowels are short.

I recall something from the recommendations saying that authors should in
some cases provide pronunciation help to a speech reader. Apologies for
not remembering the exact context, perhaps someone else recalls it as
well. Has W3C adopted any manner to do this?

**Jukka K. Korpela** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
[color=blue]
> Let's say you've defined Futura as your preferred typeface for West
> European Latin and Verdana for Cyrillic. Then Mozilla will display
> your document with "charset=IS O-8859-1" or "charset=UT F-8" in Futura
> but will display Dosto evskij in Verdana.[/color]

I just realized that there's similar absurdity in IE, though at a
different level. Maybe it could be described just as documentation
error: If you go to Internet settings and select Fonts, IE lets you
specify the font used for various "character sets". These sets are
named as Latin, Greek, Cyrillic, etc. This seems to make sense, until
you realize that it's the _encoding_ that matters.

That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
page content as "Cyrillic", no matter what characters and what language
it actually contains. Similarly, if I specify a particular font for
"Cyrillic character set" and access a UTF-8 encoded page, IE does _not_
use that font for Cyrillic letters on the page. It seems to treat the
page content as "Latin based".

It's an interesting guessing game. It indirectly affects authoring in
the sense that the choice of an encoding has implications on fonts,
though only on pages that do not set font family (except when the user
overrides such settings), and in a rather unpredictable situation - the
defaults for the font settings in browsers for different "character
sets" presumably vary, and if users change them, they probably do so in
the dark, more or less, since few people know what's going on in those
settings.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

**Jukka K. Korpela** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Henri Sivonen <hsivonen@iki.f i> wrote:
[color=blue]
> Choosing a font is only one problem. There are others including
> line breaking.[/color]

Of course the _quality_ of rendering on screen or paper can be affected
by such processes. My point was that browsers have been able to present
documents without knowing the language, and they keep doing so (even
now, when they could in principle get the language information from
some pages, and they always had the option of recognizing language from
actual content - something that Google does with rather good rate of
success, no matter what we think about the idea in principle).

(Line breaking makes my head ache. The Unicode line breaking rules are
very complex and largely absurd, and browsers are now competing in
implementing some of the worst parts in a wrong way. But I digress.)
[color=blue]
> When you write Dosto yevsky, what would you
> want recipients to do with the language data?[/color]

Nothing particular. I'm just giving (meta)informati on. In a sense, here
I'm intentionally more papal than the pope - I am applying an
unconditional Priority 1 WAI guideline that the WAI itself violates.

And as I wrote, I don't recommend doing that in practice - but not
because the idea would be wrong. It's the Mozilla misbehavior that
makes it currently impractical.
[color=blue]
> That is, is it
> actually useful for transliterated text to come with language data
> in any existing or realistic client implementation for any of the
> purposes you list in
> http://www.cs.tut.fi/~jkorpela/kielimerkkaus/1.html ?[/color]

(What I list there is basically the reasons given in HTML 4
specification and in WCAG 1.0, with some explanations of mine.)

In any existing implementation, most probably not. As we know, there
are very few existing implementations that utilize of lang attributes,
and there are implementations that draw wrong conclusions from them.

In a realistic implementation, why not? Of course they would need to
know or guess the transliteration method, but there's nothing that
prevents them from making educated guesses, except that it means quite
some work. And the metainformation about transliteration could even be
transmitted in an HTTP header. Of course this is hypothetical, but so
it most talk about utilization lang attributes.
[color=blue]
> Is it there
> just in case the user is curious and invokes "Properties " in
> Mozilla in order to find out that Dostoyevsky is a Russian name?[/color]

Well, that's one actual usage of the information. And nothing to be
frowned upon, since when users find the right-click info features,
they will start using them. If you don't use lang markup for a name, it
will naturally report the language according to the lang attribute of
the enclosing element, i.e. give wrong information. In fact, on such
grounds, an extremist (?) could say that if lang markup is used at all,
it should be comprehensive. If you say nothing about language, you are
not giving wrong information. But if you say e.g. <html lang="en">,
then you _are_ claiming that each and every word in the document is in
English, unless stated otherwise in lang attributes for inner elements.
(Quite a job, isn't it? Often you don't even know the language of a
name. I guess we should use lang="und" then.)
[color=blue]
> Let's suppose I'm writing a content management system and I choose
> to use UTF-8 for all output - -
> What advice should I provide authors who want to use the system for
> publishing Polish or Chinese text? How should they make their
> suggestions?[/color]

You mean for fonts? By using font properties in CSS. As far as I can
see, this would be sufficient for defeating Mozilla's misbehavior.

I don't see how lang attributes would help in practice, though it would
be OK to declare the language as a preparation for the future.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Lang attribute values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment