langauges, locales, regex, LIKE

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dennis Gearon

    langauges, locales, regex, LIKE

    If I've read everything right, in order to get:

    multiple languages on a site

    with the functionality of ALL of:

    REGEX
    LIKE
    Correctly sorted text

    A site would have to:

    create a cluster for every language needed
    run a separate database instance for every language
    and have the database instances each have their own port
    and use 8 bit encoding for that specific language

    because:

    Sorting is fixed at cluster/directory creation per single
    database instance
    And LIKE only works on C Locale with an eight bit encoding
    and sorting (MAYBE?) works only on 8 bit encoding
    when using C Locale.

    If anyone can correct me on this, I'd love to hear it.

    Boy, the old LOCALE system has really got to go someday.

    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

  • Joel Matthew

    #2
    Re: langauges, locales, regex, LIKE

    > If I've read everything right, in order to get:[color=blue]
    >
    > multiple languages on a site
    >
    > with the functionality of ALL of:
    >
    > REGEX
    > LIKE
    > Correctly sorted text
    >
    > ...[/color]

    You might want to look at this:



    to get an idea of where things stand at the Unicode Consortium.

    But, for sorting mixed content, what sort order should take effect
    between "now" and $B!V8=;~E@!W( B (the latter being Japanese)?

    A little more down-to-earth, would you want "genjiten" and the hiragana
    equivalent, $B!V$2$s$8$F$s !W(B to fold together in the collation? You
    definitely would need a Japanese locale for that, if it ever could work.
    (I'm not sure it could work, unless you had some way to specify the
    method of romanization in a sublocale or something.)

    And what happens when you mix "genjiten" with "genealogy" ? Or the US
    English name "Moench" with the German name using the correct character
    for the "oe", with the romanized Japanese "moeru", and the same ($B!V$b$((B
    $B$k!W(B) in hiragana?

    --
    Joel <rees@ddcom.co. jp>
    Just ranting, ignore me.


    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?



    Comment

    • Richard Huxton

      #3
      Re: langauges, locales, regex, LIKE

      Dennis Gearon wrote:[color=blue]
      > If I've read everything right, in order to get:
      >
      > multiple languages on a site
      >
      > with the functionality of ALL of:
      >
      > REGEX
      > LIKE
      > Correctly sorted text
      >
      > A site would have to:
      >
      > create a cluster for every language needed
      > run a separate database instance for every language
      > and have the database instances each have their own port
      > and use 8 bit encoding for that specific language[/color]

      You'd need a separate database, not a separate cluster. Each database
      can then have their own encoding and locale.
      [color=blue]
      > because:
      >
      > Sorting is fixed at cluster/directory creation per single
      > database instance[/color]

      To clarify, a cluster is a group of databases that share user logins and
      can all be accessed via the same server.
      [color=blue]
      > And LIKE only works on C Locale with an eight bit encoding
      > and sorting (MAYBE?) works only on 8 bit encoding
      > when using C Locale.[/color]

      You can sort, and I believe use LIKE on UTF etc. However, index use is a
      different matter.
      [color=blue]
      > If anyone can correct me on this, I'd love to hear it.
      >
      > Boy, the old LOCALE system has really got to go someday.[/color]

      The issue isn't so much the difficulty of supporting multiple locales
      (AFAIK). I believe it's more to do with interactions. If you have a
      table containing multiple languages in the same column, what does it
      mean to sort that table? Do you sort by language-name then by languages?
      If you don't, what rules do you follow?

      What happens if we compare different languages?
      Does fr/fr:"a" == en/gb:"a"?
      Does en/gb:"hello" == en/us:"hello"?

      Messy, isn't it?

      --
      Richard Huxton
      Archonet Ltd

      ---------------------------(end of broadcast)---------------------------
      TIP 4: Don't 'kill -9' the postmaster

      Comment

      • John Sidney-Woollett

        #4
        Re: langauges, locales, regex, LIKE

        For what it's worth, we have a unicode 7.4.1 database which gives us the
        sorting and searching behaviour that we expect (with the exception of
        the upper and lower functions). We access the data via jdbc so we don't
        have to deal with encoding issues per se as the driver does any
        translation for us.

        Currently we don't use any LIKE statements, but if we did, and wanted
        them optimized then we'd use the appropriate OP Class when defining the
        index. We also don't use any REGEX expressions. And we'll shortly be
        experimenting with tsearch2...

        List of databases
        Name | Owner | Encoding
        ---------------+----------+----------
        test | postgres | UNICODE

        Setting the psql client encoding to Latin1 and inserting the following
        data...

        # select * from johntest;
        id | value
        ----+-------
        1 | test
        2 | tést
        3 | tèst
        4 | taste
        5 | TEST
        6 | TÉST
        7 | TÈST
        8 | TASTE
        (8 rows)

        and then extracting the data in sorted order works as we would expect

        # select * from johntest order by value (no index on the value field)
        id | value
        ----+-------
        8 | TASTE
        5 | TEST
        7 | TÈST
        6 | TÉST
        4 | taste
        1 | test
        3 | tèst
        2 | tést
        (8 rows)

        however, applying the UPPER function to the data does not work as
        expected, problem with ids 6,7,3,2 - should be ordered (3,7 or 7,3) ,
        (6,2 or 2,6)

        # select * from johntest order by upper(value);
        id | value
        ----+-------
        4 | taste
        8 | TASTE
        1 | test
        5 | TEST
        7 | TÈST
        6 | TÉST
        3 | tèst
        2 | tést
        (8 rows)

        using a LIKE operation also works as expected (again no index on value
        field)

        # select * from johntest where value like 't%';
        id | value
        ----+-------
        1 | test
        2 | tést
        3 | tèst
        4 | taste
        (4 rows)

        Here's our pg_controldata output:
        version number: 72
        Catalog version number: 200310211
        Database cluster state: in production
        pg_control last modified: Thu 24 Jun 2004 07:18:56 GMT
        Current log file ID: 0
        Next log file segment: 29
        Latest checkpoint location: 0/1CA5F8D8
        Prior checkpoint location: 0/1C8F2074
        Latest checkpoint's REDO location: 0/1CA5F8D8
        Latest checkpoint's UNDO location: 0/0
        Latest checkpoint's StartUpID: 17
        Latest checkpoint's NextXID: 42355483
        Latest checkpoint's NextOID: 29814
        Time of latest checkpoint: Thu 24 Jun 2004 07:18:54 GMT
        Database block size: 8192
        Blocks per segment of large relation: 131072
        Maximum length of identifiers: 64
        Maximum number of function arguments: 32
        Date/time type storage: floating-point numbers
        Maximum length of locale name: 128
        LC_COLLATE: C
        LC_CTYPE: C

        and our locale is

        locale
        LANG=en_GB.UTF-8
        LC_CTYPE="en_GB .UTF-8"
        LC_NUMERIC="en_ GB.UTF-8"
        LC_TIME="en_GB. UTF-8"
        LC_COLLATE="en_ GB.UTF-8"
        LC_MONETARY="en _GB.UTF-8"
        LC_MESSAGES="en _GB.UTF-8"
        LC_PAPER="en_GB .UTF-8"
        LC_NAME="en_GB. UTF-8"
        LC_ADDRESS="en_ GB.UTF-8"
        LC_TELEPHONE="e n_GB.UTF-8"
        LC_MEASUREMENT= "en_GB.UTF-8"
        LC_IDENTIFICATI ON="en_GB.UTF-8"
        LC_ALL=

        We are intending to support various European languages on our website,
        and so far the unicode seems to be working for us.

        But maybe we' re just happy in our ignorance!

        John Sidney-Woollett

        Dennis Gearon wrote:
        [color=blue]
        > If I've read everything right, in order to get:
        >
        > multiple languages on a site
        >
        > with the functionality of ALL of:
        >
        > REGEX
        > LIKE
        > Correctly sorted text
        >
        > A site would have to:
        >
        > create a cluster for every language needed
        > run a separate database instance for every language
        > and have the database instances each have their own port
        > and use 8 bit encoding for that specific language
        >
        > because:
        >
        > Sorting is fixed at cluster/directory creation per single
        > database instance
        > And LIKE only works on C Locale with an eight bit encoding
        > and sorting (MAYBE?) works only on 8 bit encoding
        > when using C Locale.
        >
        > If anyone can correct me on this, I'd love to hear it.
        >
        > Boy, the old LOCALE system has really got to go someday.
        >
        > ---------------------------(end of broadcast)---------------------------
        > TIP 8: explain analyze is your friend[/color]


        ---------------------------(end of broadcast)---------------------------
        TIP 2: you can get off all lists at once with the unregister command
        (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

        Comment

        • Tom Lane

          #5
          Re: langauges, locales, regex, LIKE

          Richard Huxton <dev@archonet.c om> writes:[color=blue]
          > You'd need a separate database, not a separate cluster. Each database
          > can then have their own encoding and locale.[/color]

          Not so, unfortunately. You get one locale per cluster, at least for
          LC_COLLATE and LC_CTYPE which are the significant settings for this
          discussion.

          regards, tom lane

          ---------------------------(end of broadcast)---------------------------
          TIP 3: if posting/reading through Usenet, please send an appropriate
          subscribe-nomail command to majordomo@postg resql.org so that your
          message can get through to the mailing list cleanly

          Comment

          • Dennis Gearon

            #6
            Re: langauges, locales, regex, LIKE

            Richard Huxton wrote:
            [color=blue]
            > Dennis Gearon wrote:
            >[color=green]
            >> If I've read everything right, in order to get:
            >>
            >> multiple languages on a site
            >>
            >> with the functionality of ALL of:
            >> REGEX
            >> LIKE
            >> Correctly sorted text
            >>
            >> A site would have to:
            >>
            >> create a cluster for every language needed
            >> run a separate database instance for every language
            >> and have the database instances each have their own port
            >> and use 8 bit encoding for that specific language[/color]
            >
            >
            > You'd need a separate database, not a separate cluster. Each database
            > can then have their own encoding and locale.[/color]

            If I wanted all the languages to be running concurently, I can't switch clusters that the database is connected to on the fly, right? The database stays in the cluster it was started in, right? So, if that's true, then I need separate database instances if I want truly accurate sorting.
            [color=blue]
            >[color=green]
            >> because:
            >>
            >> Sorting is fixed at cluster/directory creation per single
            >> database instance[/color]
            >
            >
            > To clarify, a cluster is a group of databases that share user logins and
            > can all be accessed via the same server.
            >[color=green]
            >> And LIKE only works on C Locale with an eight bit encoding
            >> and sorting (MAYBE?) works only on 8 bit encoding
            >> when using C Locale.[/color]
            >
            >
            > You can sort, and I believe use LIKE on UTF etc. However, index use is a
            > different matter.[/color]

            Yup, there is no facility to declare character sets for indexes.
            [color=blue]
            >[color=green]
            >> If anyone can correct me on this, I'd love to hear it.
            >>
            >> Boy, the old LOCALE system has really got to go someday.[/color]
            >
            >
            > The issue isn't so much the difficulty of supporting multiple locales
            > (AFAIK). I believe it's more to do with interactions. If you have a
            > table containing multiple languages in the same column, what does it
            > mean to sort that table? Do you sort by language-name then by languages?
            > If you don't, what rules do you follow?
            >
            > What happens if we compare different languages?
            > Does fr/fr:"a" == en/gb:"a"?
            > Does en/gb:"hello" == en/us:"hello"?
            >
            > Messy, isn't it?
            >[/color]
            Without languge specific characters, they will sort exactly the same.

            ---------------------------(end of broadcast)---------------------------
            TIP 5: Have you checked our extensive FAQ?



            Comment

            • John Sidney-Woollett

              #7
              Re: langauges, locales, regex, LIKE



              Dennis Gearon wrote:
              [color=blue]
              > John Sidney-Woollett wrote:
              >[color=green]
              >> For what it's worth, we have a unicode 7.4.1 database which gives us
              >> the sorting and searching behaviour that we expect (with the
              >> exception of the upper and lower functions). We access the data via
              >> jdbc so we don't have to deal with encoding issues per se as the
              >> driver does any translation for us.
              >>
              >> Currently we don't use any LIKE statements, but if we did, and wanted
              >> them optimized then we'd use the appropriate OP Class when defining
              >> the index. We also don't use any REGEX expressions. And we'll shortly
              >> be experimenting with tsearch2...
              >>
              >> List of databases
              >> Name | Owner | Encoding
              >> ---------------+----------+----------
              >> test | postgres | UNICODE
              >>
              >> Setting the psql client encoding to Latin1 and inserting the
              >> following data...
              >>
              >> # select * from johntest;
              >> id | value
              >> ----+-------
              >> 1 | test
              >> 2 | tést
              >> 3 | tèst
              >> 4 | taste
              >> 5 | TEST
              >> 6 | TÉST
              >> 7 | TÈST
              >> 8 | TASTE
              >> (8 rows)
              >>
              >> [snip]
              >>
              >> using a LIKE operation also works as expected (again no index on
              >> value field)
              >>
              >> # select * from johntest where value like 't%';
              >> id | value
              >> ----+-------
              >> 1 | test
              >> 2 | tést
              >> 3 | tèst
              >> 4 | taste
              >> (4 rows)
              >>[/color]
              > Like works, but it can't use an index, and so would have horibble
              > performance vs. the situation where it CAN use an index. I believe
              > this is how Postgres is working now.[/color]


              If you use one of the OPCLASSes then LIKE operations using indexes
              should work, I believe.

              See http://www.postgresql.org/docs/7.4/s...s-opclass.html

              John Sidney-Woollett

              ---------------------------(end of broadcast)---------------------------
              TIP 3: if posting/reading through Usenet, please send an appropriate
              subscribe-nomail command to majordomo@postg resql.org so that your
              message can get through to the mailing list cleanly

              Comment

              • Dennis Gearon

                #8
                Re: langauges, locales, regex, LIKE

                John Sidney-Woollett wrote:
                [color=blue]
                > For what it's worth, we have a unicode 7.4.1 database which gives us the
                > sorting and searching behaviour that we expect (with the exception of
                > the upper and lower functions). We access the data via jdbc so we don't
                > have to deal with encoding issues per se as the driver does any
                > translation for us.
                >
                > Currently we don't use any LIKE statements, but if we did, and wanted
                > them optimized then we'd use the appropriate OP Class when defining the
                > index. We also don't use any REGEX expressions. And we'll shortly be
                > experimenting with tsearch2...
                >
                > List of databases
                > Name | Owner | Encoding
                > ---------------+----------+----------
                > test | postgres | UNICODE
                >
                > Setting the psql client encoding to Latin1 and inserting the following
                > data...
                >
                > # select * from johntest;
                > id | value
                > ----+-------
                > 1 | test
                > 2 | tést
                > 3 | tèst
                > 4 | taste
                > 5 | TEST
                > 6 | TÉST
                > 7 | TÈST
                > 8 | TASTE
                > (8 rows)
                >
                > and then extracting the data in sorted order works as we would expect
                >
                > # select * from johntest order by value (no index on the value field)
                > id | value
                > ----+-------
                > 8 | TASTE
                > 5 | TEST
                > 7 | TÈST
                > 6 | TÉST
                > 4 | taste
                > 1 | test
                > 3 | tèst
                > 2 | tést
                > (8 rows)
                >
                > however, applying the UPPER function to the data does not work as
                > expected, problem with ids 6,7,3,2 - should be ordered (3,7 or 7,3) ,
                > (6,2 or 2,6)
                >
                > # select * from johntest order by upper(value);
                > id | value
                > ----+-------
                > 4 | taste
                > 8 | TASTE
                > 1 | test
                > 5 | TEST
                > 7 | TÈST
                > 6 | TÉST
                > 3 | tèst
                > 2 | tést
                > (8 rows)
                >
                > using a LIKE operation also works as expected (again no index on value
                > field)
                >
                > # select * from johntest where value like 't%';
                > id | value
                > ----+-------
                > 1 | test
                > 2 | tést
                > 3 | tèst
                > 4 | taste
                > (4 rows)
                >[/color]
                Like works, but it can't use an index, and so would have horibble performance vs. the situation where it CAN use an index. I believe this is how Postgres is working now.

                ---------------------------(end of broadcast)---------------------------
                TIP 5: Have you checked our extensive FAQ?



                Comment

                Working...