Search engines continue to ignore LANG markup

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andreas Prilop

    Search engines continue to ignore LANG markup

    I have three test pages that are marked as Italian, Spanish,
    Portuguese, resp. by

    Content-Language: it
    <html lang="it">
    <body lang="it">

    and the same for "es" and "pt".

    Yahoo regards all three pages as Italian:


    Google regards one as English (What??) and two as Spanish:



    :-(

    --
    In memoriam Alan J. Flavell

  • Spartanicus

    #2
    Re: Search engines continue to ignore LANG markup

    Andreas Prilop <AndreasPrilop2 007@trashmail.n etwrote:
    >I have three test pages that are marked as Italian, Spanish,
    >Portuguese, resp. by
    >
    Content-Language: it
    <html lang="it">
    <body lang="it">
    >
    >and the same for "es" and "pt".
    >
    >Yahoo regards all three pages as Italian:
    >http://search.yahoo.com/search?p=%22...l=1&vl=lang_it
    >
    >Google regards one as English (What??) and two as Spanish:
    >http://www.google.com/search?q=%22id...%22&lr=lang_en
    >http://www.google.com/search?q=%22id...%22&lr=lang_es
    I'd be surprised if author provided meta data like language info on the
    web was broadly reliable. I'd expect better results from using
    heuristics to determine a document's language. So I'd expect SEs to use
    heuristics, it serves their users better.

    I don't speak any of the test languages, but comparing two of the test
    pages it seems to me that they do not contain words that are
    characteristic for each language, in fact the content appears to be
    chosen to confuse heuristic guessing.

    The choice of using a list of words instead of natural language probably
    also hinders heuristic guessing since it makes it impossible to use
    context for similar words in the various languages.

    --
    Spartanicus

    Comment

    • Andreas Prilop

      #3
      Re: Search engines continue to ignore LANG markup

      On Wed, 28 Feb 2007, Spartanicus wrote:
      I'd be surprised if author provided meta data like language info on the
      web was broadly reliable.
      Mostly, LANG markup is *missing* from documents. However, if the author
      supplies LANG markup, it should be taken as ... well ... authoritative.
      The author knows best in which language he writes.
      I'd expect better results from using
      heuristics to determine a document's language. So I'd expect SEs to use
      heuristics, it serves their users better.
      That's the same argument used by Internet Explorer 6:

      | The server sends "text/plain" but I take "text/html"
      | because it seems to make more sense to me.

      They can still guess when LANG markup is *missing*.
      in fact the content appears to be chosen to confuse heuristic guessing.
      Exactly.
      The choice of using a list of words instead of natural language probably
      also hinders heuristic guessing since it makes it impossible to use
      context for similar words in the various languages.
      But only with such a list of words, you can take different LANG
      parameters. All the words exist in Italian, Spanish, Portuguese.
      Each page could be IT or ES or PT.

      --
      In memoriam Alan J. Flavell

      Comment

      • =?UTF-8?B?QW50w7NuaW8gTWFycXVlcw==?=

        #4
        Re: Search engines continue to ignore LANG markup

        Andreas Prilop wrote:
        I have three test pages that are marked as Italian, Spanish,
        Portuguese, resp. by
        >
        Content-Language: it
        <html lang="it">
        <body lang="it">
        >
        and the same for "es" and "pt".
        >
        Yahoo regards all three pages as Italian:

        >
        Google regards one as English (What??) and two as Spanish:


        >
        :-(
        Shouldn't you have <meta lang="it"/in the head rather than specifying
        the language of elements?
        --
        am

        laurus : rhodophyta : brethoneg : smalltalk : stargate

        --
        Posted via a free Usenet account from http://www.teranews.com

        Comment

        • Spartanicus

          #5
          Re: Search engines continue to ignore LANG markup

          Andreas Prilop <AndreasPrilop2 007@trashmail.n etwrote:
          >I'd be surprised if author provided meta data like language info on the
          >web was broadly reliable.
          >
          >Mostly, LANG markup is *missing* from documents. However, if the author
          >supplies LANG markup, it should be taken as ... well ... authoritative.
          >The author knows best in which language he writes.
          I don't have any statistics, but I'd expect that many documents on the
          web are produced by authoring tools that use templates which may contain
          false language info. I've done it myself even as a hand coder, my
          default document template contains lang="en", on more than one occasion
          have I published "Lorem ipsum" demo pages with the default lang="en"
          still in there.
          >I'd expect better results from using
          >heuristics to determine a document's language. So I'd expect SEs to use
          >heuristics, it serves their users better.
          >
          >That's the same argument used by Internet Explorer 6:
          >
          >| The server sends "text/plain" but I take "text/html"
          >| because it seems to make more sense to me.
          That is a spec violation (must). There is no spec requirement on a UA to
          use language meta data:
          "Language information specified via the lang attribute may be used by a
          user agent"

          >in fact the content appears to be chosen to confuse heuristic guessing.
          >
          >Exactly.
          >
          >The choice of using a list of words instead of natural language probably
          >also hinders heuristic guessing since it makes it impossible to use
          >context for similar words in the various languages.
          >
          >But only with such a list of words, you can take different LANG
          >parameters. All the words exist in Italian, Spanish, Portuguese.
          >Each page could be IT or ES or PT.
          I don't think that it is realistic to expect SEs to use language meta
          data if they cannot determine the language via heuristics. And as I've
          noted before I find their decision to use heuristics logical.

          --
          Spartanicus

          Comment

          • Jukka K. Korpela

            #6
            Re: Search engines continue to ignore LANG markup

            Scripsit Andreas Prilop:

            [ Search engines ignore lang attributes and Content-Language headers,
            apparently using some guesswork instead. ]

            Sadly enough, this will probably not improve. The problem is that there are
            too many phoney lang attributes on web pages, typically resulting from
            authoring software that spits them out, though clueless authors write them,
            too. There are also wrong lang attributes due to simple carelessness. What
            would you do, then, if you were a search engine that tried to be useful?

            For example, http://www.kko.fi/29566.htm is a page by the Supreme Court of
            Finland, actually in Sámi language, but with lang="sa", i.e. claiming to be
            in Sanskrit, despite the detailed explanation of the mistake that I sent
            months ago. Someone took the trouble of actually typing in the lang
            attribute but didn't get it right, and apparently it is impossible to fix
            it.

            --
            Jukka K. Korpela ("Yucca")


            Comment

            • Jukka K. Korpela

              #7
              Re: Search engines continue to ignore LANG markup

              Scripsit António Marques:
              Shouldn't you have <meta lang="it"/in the head rather than
              specifying the language of elements?
              No. By definition, the lang attribute specifies the language of the text in
              the element and its attributes. The <metaelement never has any content, so
              the above element is completely pointless. Using lang with other attributes
              could make sense in odd cases, if you have metainformation in a language
              other than the document's overall language, but that would normally be
              keyword spamming and could be treated as such.

              Followups trimmed; there was no reason for a silent addition of sci.lang.

              --
              Jukka K. Korpela ("Yucca")


              Comment

              • mb

                #8
                Re: Search engines continue to ignore LANG markup

                On Feb 28, 8:54 am, Andreas Prilop <AndreasPrilop2 ...@trashmail.n et>
                wrote:
                I have three test pages that are marked as Italian, Spanish,
                Portuguese, resp. by
                >
                Content-Language: it
                <html lang="it">
                <body lang="it">
                >
                and the same for "es" and "pt".
                Can you inform a total ignorant? All these, and many other languages
                too, can be typed on an international keyboard. Where is the sense of
                arbitrarily assigning "language" to texts then?

                Comment

                • =?UTF-8?B?QW50w7NuaW8gTWFycXVlcw==?=

                  #9
                  Re: Search engines continue to ignore LANG markup

                  Jukka K. Korpela wrote:
                  Scripsit António Marques:
                  >
                  >Shouldn't you have <meta lang="it"/in the head rather than
                  >specifying the language of elements?
                  >
                  No. By definition, the lang attribute specifies the language of the text
                  in the element and its attributes.
                  Yes, my fault. I intended to write <meta http-equiv="Content-Language"
                  content="it"/>, as I suspect that's what search engines look at.
                  The <metaelement never has any
                  content, so the above element is completely pointless. Using lang with
                  other attributes could make sense in odd cases, if you have
                  metainformation in a language other than the document's overall
                  language, but that would normally be keyword spamming and could be
                  treated as such.
                  >
                  Followups trimmed; there was no reason for a silent addition of sci.lang.
                  There was no reason for the original posting to sci.lang either.
                  --
                  am

                  laurus : rhodophyta : brethoneg : smalltalk : stargate

                  --
                  Posted via a free Usenet account from http://www.teranews.com

                  Comment

                  • David Dorward

                    #10
                    Re: Search engines continue to ignore LANG markup

                    mb wrote:
                    > Content-Language: it
                    > <html lang="it">
                    Can you inform a total ignorant? All these, and many other languages
                    too, can be typed on an international keyboard. Where is the sense of
                    arbitrarily assigning "language" to texts then?
                    It isn't arbitrary, it describes the language the content is written in.
                    This has implications for (among other things) what pronunciation
                    dictionary a screen reader should use and what an automated system could do
                    given an instruction to get some data if it knows what languages the user
                    can understand.

                    --
                    David Dorward <http://blog.dorward.me .uk/ <http://dorward.me.uk/>
                    Home is where the ~/.bashrc is

                    Comment

                    • mb

                      #11
                      Re: Search engines continue to ignore LANG markup

                      On Feb 28, 1:20 pm, David Dorward <dorw...@yahoo. comwrote:
                      mb wrote:
                      Content-Language: it
                      <html lang="it">
                      Can you inform a total ignorant? All these, and many other languages
                      too, can be typed on an international keyboard. Where is the sense of
                      arbitrarily assigning "language" to texts then?
                      >
                      It isn't arbitrary, it describes the language the content is written in.
                      This has implications for (among other things) what pronunciation
                      dictionary a screen reader should use and what an automated system could do
                      given an instruction to get some data if it knows what languages the user
                      can understand.
                      Thank you, wasn't thinking of that.
                      Meaning that if I somehow could get a tag on Word documents I could
                      stop that @##! Word spellchecker from "automatica lly" deciding what
                      dictionary to use for each %$@@! word?

                      Comment

                      • Osmo Saarikumpu

                        #12
                        Re: Search engines continue to ignore LANG markup

                        mb wrote:
                        Meaning that if I somehow could get a tag on Word documents I could
                        stop that @##! Word spellchecker from "automatica lly" deciding what
                        dictionary to use for each %$@@! word?
                        Yes. And you can. The easiest way is to use Word's GUI for the task. My
                        Word is in Finnish so you have to press F1 for help :)

                        Osmo




                        Comment

                        • Andreas Prilop

                          #13
                          Re: Search engines continue to ignore LANG markup

                          On Wed, 28 Feb 2007, mb wrote:
                          Can you inform a total ignorant? All these, and many other languages
                          too, can be typed on an international keyboard. Where is the sense of
                          arbitrarily assigning "language" to texts then?
                          Search engines allow you to restrict your search to certain languages.
                          For example, you might want to restrict your search to English or
                          to French or to German when looking for the word "elf".
                          This will go wrong of course when the search engine is unable
                          to detect the language correctly.

                          --
                          In memoriam Alan J. Flavell

                          Comment

                          • Andreas Prilop

                            #14
                            Re: Search engines continue to ignore LANG markup

                            On Wed, 28 Feb 2007, António Marques wrote:
                            I intended to write <meta http-equiv="Content-Language"
                            content="it"/>, as I suspect that's what search engines look at.
                            First, the slash is wrong in HTML.

                            Second, *everything* called <meta http-equivis only a poor ersatz,
                            a cheapo surrogate, a plastic imitation from China. What you should
                            have instead, is the *real* HTTP header

                            Content-Language: it

                            And that's exactly what I wrote in my original posting.

                            --
                            In memoriam Alan J. Flavell

                            Comment

                            • mb

                              #15
                              Re: Search engines continue to ignore LANG markup

                              On Mar 1, 1:34 am, Osmo Saarikumpu <o...@weppipakk i.comwrote:
                              mb wrote:
                              Meaning that if I somehow could get a tag on Word documents I could
                              stop that @##! Word spellchecker from "automatica lly" deciding what
                              dictionary to use for each %$@@! word?
                              >
                              Yes. And you can. The easiest way is to use Word's GUI for the task. My
                              Word is in Finnish so you have to press F1 for help :)
                              Nah. When you have multiple languages installed the damn thing only
                              does "automatic" recognition, because "tools-language" selects them
                              all by default.

                              Comment

                              Working...