Lang attribute values

**Jukka K. Korpela** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Bertilo Wennergren <bertilow@gmx.n et> wrote:
[color=blue]
> In the same way "Dostoyevsk y" (written exactly like that) is
> written in Latin script. There is no need (or should be no need)
> telling the browser what it already knows.[/color]

It is written in Latin letters, but the word "script" is somewhat
confusing here. There are many different systems of transliterating
Russian names, even in one country, and this is a constant source of
confusion. So the information needed for correct analysis of the word
would include information about the particular transliteration method.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

**Jukka K. Korpela** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
[color=blue]
> Hmm, let's take <span lang="ru">vodka </span>, da?[/color]

An interesting proposal. :-) In fact, the word "vodka" could be
regarded as a Russian word, or as a loanword of Russian origin used in
English or some other language. Thus, the markup above could be
construed as an author's expression for the intent of reading it as a
genuinely Russian word, pronounced the Russian way (reading its "d" as
unvoiced, "t", etc.), as far as possible. Needless to say, it is
overoptimistic to expect user agents to understand such finer points
very soon.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

**Alan J. Flavell** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Sat, 24 Jan 2004, Jukka K. Korpela wrote:
[color=blue]
> I just realized that there's similar absurdity in IE, though at a
> different level. Maybe it could be described just as documentation
> error: If you go to Internet settings and select Fonts, IE lets you
> specify the font used for various "character sets". These sets are
> named as Latin, Greek, Cyrillic, etc. This seems to make sense, until
> you realize that it's the _encoding_ that matters.[/color]

It seems you may have observed part of the problem, and I've observed
a different part of the problem. Could I persuade you to take a look
at my observations in
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html , in
the part that relates to Win IE, and see how well it fits your own
observations?
[color=blue]
> That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
> page content as "Cyrillic", no matter what characters and what language
> it actually contains.[/color]

The language attribute in HTML also has an influence: some examples
are shown on my page.

As I say, it could be that each of us is only seeing part of the
picture. With hindsight, some of my observations might only be
accurate in relation to pages that are advertised as utf-8.
[color=blue]
> Similarly, if I specify a particular font for "Cyrillic character
> set" and access a UTF-8 encoded page, IE does _not_ use that font
> for Cyrillic letters on the page.[/color]

That depends...
[color=blue]
> It seems to treat the page content as "Latin based".[/color]

That will not happen if you choose a Latin font which contains no
Cyrillic characters (use the MS font properties extension to view the
relevant properties of the font).

As I recall, I can make it use for Cyrillic the font that I configured
for Greek, if I choose a Latin font which has no Cyrillic.
[color=blue]
> It's an interesting guessing game.[/color]

I've set out my guess on the above page. The writing systems are set
out in an ordered list, and my guess was that it works its way down
this list until it finds a font which contains support for the desired
writing system (even if the chosen font's support is incomplete
relative to the one which was configured for that writing system!).
[color=blue]
> It indirectly affects authoring in the sense that the choice of an
> encoding has implications on fonts, though only on pages that do not
> set font family (except when the user overrides such settings),[/color]

Well, sort-of. The primary guideline is surely to mark up the
document accurately, and leave the client agent to do the best job
that its authors were capable of? But yes, sometimes it's opportune
for document authors to make some allowances for known browser
shortcomings.

However, here the most usual proposal is that authors should offer a
font, or rather a selection of fonts, that the author found to be
viable. Unfortunately, in every case where this has been
investigated, while the suggestion of a font can improve the results
for some subset of browsers, it can make matters worse, sometimes a
lot worse, for some other subset of browsers. So much so that in this
kind of multi-script situation, I would recommend readers who are
having difficulties with the default settings, to try reconfiguring
their browser to ignore any author-specified fonts and work with their
own font defaults for best results.

**Alan J. Flavell** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Sat, 24 Jan 2004, Alan J. Flavell wrote:
[color=blue]
> That will not happen if you choose a Latin font which contains no
> Cyrillic characters (use the MS font properties extension to view the
> relevant properties of the font).[/color]

Oh, perhaps an easier way to do this is to visit IE's font defaults
menu (tools> internet options> general> fonts). When you try to
select a particular language script (i.e writing system), IE will
present a menu of the available fonts for that language script. By a
process of elimination, the fonts which are not included in that list
do not support the script in question.

And immediatly we see the trap! When I carried out my tests in
Win/NT4, the Book Antiqua font provided there did not support Greek
nor Cyrillic. But now that I repeat the test in Win2K, well, you
guessed it: this font, with the same name, supports also Greek and
Cyrillic. Ho hum.

**Alan J. Flavell** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Sat, 24 Jan 2004, Alan J. Flavell wrote:

[Jukka wrote:][color=blue][color=green]
> > That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
> > page content as "Cyrillic", no matter what characters and what language
> > it actually contains.[/color]
>
> The language attribute in HTML also has an influence: some examples
> are shown on my page.[/color]

Please accept my apologies on this particular point. I now realise I
was misremembering _that_ specific behaviour: it was in fact seen in
Mozilla, not MSIE.

**Bertilo Wennergren** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Jukka K. Korpela:
[color=blue]
> Bertilo Wennergren <bertilow@gmx.n et> wrote:[/color]
[color=blue][color=green]
>> In the same way "Dostoyevsk y" (written exactly like that) is
>> written in Latin script. There is no need (or should be no need)
>> telling the browser what it already knows.[/color][/color]
[color=blue]
> It is written in Latin letters, but the word "script" is somewhat
> confusing here. There are many different systems of transliterating
> Russian names, even in one country, and this is a constant source of
> confusion. So the information needed for correct analysis of the word
> would include information about the particular transliteration method.[/color]

Indeed "script" is a vague term, but I don't think we should mix it with
"transcript ion system". There are several systems of Latin transcription
of Japanese. They all use Latin script.

But if there were a script attribute, it's value could of course consist
of things like "la" (Latin) "la-hep" (Latin script, Hepburn
transcription of Japanese), and also "ipa", "ipa-wide", "ipa-narrow"
etc. Or there could be another attribute for transcription systems.

That would all probably be a bit too much for HTML though.

--
Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

**Jukka K. Korpela** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Bertilo Wennergren <bertilow@gmx.n et> wrote:
[color=blue]
> Indeed "script" is a vague term, but I don't think we should mix it
> with "transcript ion system".[/color]

My point was that "script" in the vague sense has really no relevance
to markup whereas writing system has. When Russian is written in Latin
letters (using transliteration , basically, and not transcription), it
is a system of writing Russian. It can be viewed as consisting of a
composition of the normal writing system and Russian and a
transliteration method, but that's a different aspect
[color=blue]
> But if there were a script attribute, it's value could of course
> consist
> of things like "la" (Latin) "la-hep" (Latin script, Hepburn
> transcription of Japanese), and also "ipa", "ipa-wide",
> "ipa-narrow" etc.[/color]

No, "la" would not identify a writing system - it would refer to a
family of character repertoires, more or less, which is at a completely
different conceptual level. I can understand the idea of using "Latin",
"Cyrillic" etc., because there are languages that have or have had
writing systems that basically differ in the use of the base system of
letters (e.g., Latin, Cyrillic, or Arabic). But that's just one
possibility, and - as mentioned in this thread - it is relatively
obvious even without such metainformation whether e.g. some fragment of
Russian is written in Latin or Cyrillic letters. What is _not_ so
obvious, in many cases, is the specific writing system (e.g., "old" and
"new" Russian orthography, or the choice of a particular
transliteration method).
[color=blue]
> That would all probably be a bit too much for HTML though.[/color]

Some of the IANA registered "language subcodes" actually identify
writing systems. This indicates at least some subjective need for
specifying the writing system. But it's a wrong approach.

The situation is somewhat complex, though, since an orthography reform
is often coupled with some change of language, or could be _viewed_ as
creating a version of a language. But logically orthography is
orthogonal to dialect, jargon, and other variation reflected in a
language subcode.

Does someone really think that a new version of the German language has
been or is being created by the orthography reform that was officially
started in 1998? I don't think so. For adequate use of language
information, e.g. in spelling checking, orthography is relevant, but it
should be specified separately.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

**Jukka K. Korpela** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
[color=blue]
> It seems you may have observed part of the problem, and I've
> observed a different part of the problem. Could I persuade you to
> take a look at my observations in
> http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html ,
> in the part that relates to Win IE, and see how well it fits your
> own observations?[/color]

Now that I looked at that page again, I realized that it describes
(among other things) the problem I tried to explain. I had read it but
probably forgotten it, since it had not really caused me trouble. But
now it had.
[color=blue]
> However, here the most usual proposal is that authors should offer
> a font, or rather a selection of fonts, that the author found to be
> viable. Unfortunately, in every case where this has been
> investigated, while the suggestion of a font can improve the
> results for some subset of browsers, it can make matters worse,
> sometimes a lot worse, for some other subset of browsers.[/color]

In situations where the author knows that some font(s) that are
relatively commonly installed contain the characters he uses in a
document, I think it is reasonable to write a font-family suggestion
for body if the font is qualitatively acceptable. I'm naturally
referring to situations where a rich character repertoire is used, so
that we know that common browsers with common default settings will
fail to render all the characters. As a rough rule of thumb, if you use
characters that are not present in Times New Roman, consider suggesting
body { font-family: "Arial Unicode MS"; }
maybe with some other fonts too, if you have checked that each of them
has all the characters you're using.

The sure gain is that a large number of IE users will be able to read
the page without difficulty. The potential loss is that users who
actually have a qualitatively better font in their system and a browser
configured to use it will need an extra action to override the page
settings. I don't like the loss, but I think it's acceptable.

But I recently encountered a problem where Arial Unicode MS is not
sufficient. Not knowing what to do, I decided to make no font
suggestions for the text, since anything I considered would have sure
and considerable drawbacks as well. (This is one of the cases where
creating a PDF alternative is almost a must.)

It's unfortunate that Code2000 is qualitatively so awful. I could
accept it as the fallback font to be used for those characters that are
not present in any other font, but copy text looks horrendous in
Code2000. But using font-family: "Arial Unicode MS", "Code2000" does
not work the defined way on IE, and it makes things worse when a
browser implements it correctly and has both Code2000 and some better
very-large-repertoire font installed.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

**Bertilo Wennergren** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Jukka K. Korpela:
[color=blue]
> Bertilo Wennergren <bertilow@gmx.n et> wrote:[/color]
[color=blue][color=green]
>> But if there were a script attribute, it's value could of course
>> consist
>> of things like "la" (Latin) "la-hep" (Latin script, Hepburn
>> transcription of Japanese), and also "ipa", "ipa-wide",
>> "ipa-narrow" etc.[/color][/color]
[color=blue]
> No, "la" would not identify a writing system - it would refer to a
> family of character repertoires, more or less, which is at a completely
> different conceptual level.[/color]

I think we're agreeing here.
[color=blue]
> I can understand the idea of using "Latin",
> "Cyrillic" etc., because there are languages that have or have had
> writing systems that basically differ in the use of the base system of
> letters (e.g., Latin, Cyrillic, or Arabic). But that's just one
> possibility, and - as mentioned in this thread - it is relatively
> obvious even without such metainformation whether e.g. some fragment of
> Russian is written in Latin or Cyrillic letters. What is _not_ so
> obvious, in many cases, is the specific writing system (e.g., "old" and
> "new" Russian orthography, or the choice of a particular
> transliteration method).[/color]

True.
[color=blue][color=green]
>> That would all probably be a bit too much for HTML though.[/color][/color]
[color=blue]
> Some of the IANA registered "language subcodes" actually identify
> writing systems. This indicates at least some subjective need for
> specifying the writing system. But it's a wrong approach.[/color]
[color=blue]
> The situation is somewhat complex, though, since an orthography reform
> is often coupled with some change of language, or could be _viewed_ as
> creating a version of a language. But logically orthography is
> orthogonal to dialect, jargon, and other variation reflected in a
> language subcode.[/color]

That would seem to mean that a separate attribute "orthograph y" with a
value from a wide range of codes for various writing systems used for
various languages, would make sense.
[color=blue]
> Does someone really think that a new version of the German language has
> been or is being created by the orthography reform that was officially
> started in 1998? I don't think so. For adequate use of language
> information, e.g. in spelling checking, orthography is relevant, but it
> should be specified separately.[/color]

So "<span lang='de' orthography='de-neu'>Schloss</span>" would in
principle be OK then? (Supposing that "de-neu" - or whatever - has been
officially registered as the code for the new German orthograpy.)

--
Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

**Bertilo Wennergren** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

Jukka K. Korpela:
[color=blue]
> As a rough rule of thumb, if you use
> characters that are not present in Times New Roman, consider suggesting
> body { font-family: "Arial Unicode MS"; }
> maybe with some other fonts too, if you have checked that each of them
> has all the characters you're using.[/color]

You should be aware that "Arial Unicode MS" can be installed on Linux
systems, but that on many such systems it will fail to render any
italics. So suggesting that font might disable italics for some users.

If italics are used for emphasized text or citations (or something else)
that could be a problem on pages where emphasis, citation etc. convey
important pieces of information.

--
Bertilo Wennergren <bertilow@gmx.n et> <http://www.bertilow.co m>

**Alan J. Flavell** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Sun, 25 Jan 2004, Jukka K. Korpela wrote:
[color=blue]
> As a rough rule of thumb, if you use
> characters that are not present in Times New Roman, consider suggesting
> body { font-family: "Arial Unicode MS"; }
> maybe with some other fonts too, if you have checked that each of them
> has all the characters you're using.[/color]

Well, at least if they have Arial Unicode MS, you know that the font
has the rich character repertoire. Whereas many font family names
denote fonts which come in more than one version, having widely
different repertoires - previous discussion has shown numerous
examples.

It's a dilemma. Arial Unicode MS typeface has only one font, whereas
(for example) the Palatino Linotype typeface has also italic, bold and
bold italic fonts. Lucida Sans Unicode typeface also has a fairly
wide repertoire but only one font. When italic, bold etc. have to be
derived from the regular font, the results are suboptimal.
[color=blue]
> The sure gain is that a large number of IE users will be able to read
> the page without difficulty. The potential loss is that users who
> actually have a qualitatively better font in their system and a browser
> configured to use it will need an extra action to override the page
> settings. I don't like the loss, but I think it's acceptable.[/color]

It's a value judgement call, which could very well come out different
for each situation. I really don't have a final view on it.

Fortunately, if one uses a central stylesheet then a change of
opinion can be easily implemented!
[color=blue]
> It's unfortunate that Code2000 is qualitatively so awful.[/color]

It's a reasonable choice when repertoire is the overwhelming
consideration, and cosmetics can take a back place.

(Then there's the problem of monospace.)

cheers

**Mad Bad Rabbit** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

"Jukka K. Korpela" <jkorpela@cs.tu t.fi> wrote:
[color=blue]
> In situations where the author knows that some font(s) that are
> relatively commonly installed contain the characters he uses in a
> document, I think it is reasonable to write a font-family suggestion
> [...] As a rough rule of thumb, if you use characters that are not
> present in Times New Roman, consider suggesting
>
> body { font-family: "Arial Unicode MS"; }[/color]

Wouldn't it be safer to leave <body> alone, and only suggest
an alternate font-family for parts of the document known to
contain the problematic characters?

For example, if I'm composing a Bible-study page that has a
few scattered Greek words, oughtn't it just use:

span.polytonic { font-family: "Palatino Linotype" }

[color=blue]
>;K[/color]

**Philip Newton** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Thu, 22 Jan 2004 22:04:46 +0100, Andreas Prilop
<nhtcapri@rrz n-user.uni-hannover.de> wrote:
[color=blue]
> It might be a good idea to extend the euro-centric list
> serif, sans-serif, cursive, fantasy
> by
> naskhi, nastaliq, thuluth
> etc.[/color]

Sounds reasonable to me. Is "thuluth" what is sometimes called "sülüs"?

Cheers,
Philip
--
Philip Newton <nospam.newton@ gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.

**Philip Newton** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Fri, 23 Jan 2004 21:17:41 +0100, Andreas Prilop
<nhtcapri@rrz n-user.uni-hannover.de> wrote:
[color=blue]
> Philip Newton <pne-news-200401@newton.d igitalspace.net > wrote:
>[color=green]
> > "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
> >[color=darkred]
> >> If those characters were Arabic, then it would be useful to choose,
> >> say, a Persian font if it were known that the language is Farsi.[/color]
> >
> > Or, for a possibly better example, to choose a nastaliq font (the kind
> > that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.[/color]
>
> That ain't a better example - it's the same example. Both Persian and
> Urdu would prefer a nast'aliq typeface.[/color]

Ah, I did not know that Persian also preferred nastaliq. Thanks.

Cheers,
Philip
--
Philip Newton <nospam.newton@ gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.

**Philip Newton** · Jul 20 '05, 05:26 PM

Re: Lang attribute values

On Fri, 23 Jan 2004 16:32:02 +0200, Henri Sivonen <hsivonen@iki.f i>
wrote:
[color=blue]
> In article <Xns9479A6AF2F1 Ejkorpelacstutf i@193.229.0.31> , "Jukka K.
> Korpela" <jkorpela@cs.tu t.fi> wrote:
>[color=green]
> > Yes. It should see immediately that Latin script is used. But in
> > addition to this, what's the big idea in selecting fonts according
> > to language?[/color]
>
> I can't find a politically correct way of saying this, but there's
> are pecking orders of language groups within scripts in terms of
> font availability and quality. It's unfortunate.
>
> For example Polish looks ugly if some glyphs come from a "Western"
> font and others come from a "Central European" font.[/color]

Mmm. Or if you want to have d-with-caron; you often can't use U+010F
LATIN SMALL LETTER D WITH CARON since this will typically have a glyph
with apostrophe after rather than caron above due to Czech and Slovak
typesetting habits (if I interpret the comment in the Unicode standard
correctly). But what if I'm not typesetting Czech or Slovak, but a
language which uses d-with-caron? (This is a real example, though the
language in question is not a natlang.)

Cheers,
Philip
--
Philip Newton <nospam.newton@ gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.

Lang attribute values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment