Trouble sorting lists (unicode/locale related?)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Erlend Fuglum

    Trouble sorting lists (unicode/locale related?)

    Hi everyone,

    I'm having some trouble sorting lists. I suspect this might have
    something to do with locale settings and/or character
    encoding/unicode.

    Consider the following example, text containing norwegian special
    characters æ, ø and å.
    [color=blue][color=green][color=darkred]
    >>> liste = ["ola", "erlend", "trygve", "Ærlige anders", "Lars", "Øksemorder en", "Åsne", "Akrobatisk e Anna", "leidulf"]
    >>> liste.sort()
    >>> liste[/color][/color][/color]
    ['Akrobatiske Anna', 'Lars', 'erlend', 'leidulf', 'ola', 'trygve',
    '\xc5sne', '\xc6rlige anders', '\xd8ksemordere n']

    There are a couple of issues for me here:
    * The sorting method apparently places strings starting with uppercase
    characters before strings staring with lowercase. I would like to
    treat them them equally when sorting. OK, this could probably be fixed
    by hacking with .toupper() or something, but isn't it possible to
    achieve this in a more elegant way?

    * The norwegian special characters are sorted in a wrong way.
    According to our alphabet the correct order is (...) x, y, z, æ, ø å.
    Python does it this way: (...) x, y, z, å, æ, ø ?

    I would really appreciate any help and suggestions - I have been
    fiddling with this mess for quite some time now :-)
  • Peter Otten

    #2
    Re: Trouble sorting lists (unicode/locale related?)

    Erlend Fuglum wrote:
    [color=blue]
    > Hi everyone,
    >
    > I'm having some trouble sorting lists. I suspect this might have
    > something to do with locale settings and/or character
    > encoding/unicode.
    >
    > Consider the following example, text containing norwegian special
    > characters æ, ø and å.
    >[color=green][color=darkred]
    >>>> liste = ["ola", "erlend", "trygve", "Ærlige anders", "Lars",
    >>>> "Øksemorder en", "Åsne", "Akrobatisk e Anna", "leidulf"] liste.sort()
    >>>> liste[/color][/color]
    > ['Akrobatiske Anna', 'Lars', 'erlend', 'leidulf', 'ola', 'trygve',
    > '\xc5sne', '\xc6rlige anders', '\xd8ksemordere n']
    >
    > There are a couple of issues for me here:
    > * The sorting method apparently places strings starting with uppercase
    > characters before strings staring with lowercase. I would like to
    > treat them them equally when sorting. OK, this could probably be fixed
    > by hacking with .toupper() or something, but isn't it possible to
    > achieve this in a more elegant way?
    >
    > * The norwegian special characters are sorted in a wrong way.
    > According to our alphabet the correct order is (...) x, y, z, æ, ø å.
    > Python does it this way: (...) x, y, z, å, æ, ø ?
    >
    > I would really appreciate any help and suggestions - I have been
    > fiddling with this mess for quite some time now :-)[/color]

    Try setting the appropriate locale first:

    import locale
    locale.setlocal e(locale.LC_ALL , ("no", None))

    Then for a case-insensitive sort:

    wordlist.sort(l ocale.strcoll)

    should do (disclaimer: all untested).

    Peter


    Comment

    • Erlend Fuglum

      #3
      Re: Trouble sorting lists (unicode/locale related?)

      On Sun, 21 Sep 2003 14:10:16 +0200, Peter Otten <__peter__@web. de>
      wrote:
      [color=blue]
      >Erlend Fuglum wrote:
      >[color=green]
      >> Hi everyone,
      >>
      >> I'm having some trouble sorting lists. I suspect this might have
      >> something to do with locale settings and/or character
      >> encoding/unicode.
      >> (...)[/color]
      >
      >Try setting the appropriate locale first:
      >
      >import locale
      >locale.setloca le(locale.LC_AL L, ("no", None))
      >
      >Then for a case-insensitive sort:
      >
      >wordlist.sort( locale.strcoll)
      >
      >should do (disclaimer: all untested).[/color]

      Grrrrrreat, that did it! Thank you very much!

      Erlend

      Comment

      • Jeff Epler

        #4
        Re: Trouble sorting lists (unicode/locale related?)

        On Sun, Sep 21, 2003 at 02:10:16PM +0200, Peter Otten wrote:[color=blue]
        > Try setting the appropriate locale first:
        >
        > import locale
        > locale.setlocal e(locale.LC_ALL , ("no", None))
        >
        > Then for a case-insensitive sort:
        >
        > wordlist.sort(l ocale.strcoll)
        >
        > should do (disclaimer: all untested).[/color]

        If this did work, then you can use the DSU pattern like so:

        decorated_wordl ist = [(locale.strxfrm (w), w) for w in wordlist]
        decorated_wordl ist.sort()
        wordlist[:] = [i[1] for i in decorated_wordl ist]

        from "man strxfrm"
        DESCRIPTION
        The strxfrm() function transforms the src string into a form suchthat
        the result of strcmp() on two strings that have been transformed with
        strxfrm() is the same as the result of strcoll() on the two strings
        before their transformation. The first n characters of the transformed
        string are placed in dest. The transformation is based on thepro-
        gram’s current locale for category LC_COLLATE. (See setlocale(3)).

        Jeff

        Comment

        • John J. Lee

          #5
          Re: Trouble sorting lists (unicode/locale related?)

          Jeff Epler <jepler@unpytho nic.net> writes:[color=blue]
          > On Sun, Sep 21, 2003 at 02:10:16PM +0200, Peter Otten wrote:[/color]
          [...][color=blue][color=green]
          > > locale.setlocal e(locale.LC_ALL , ("no", None))[/color][/color]
          [...][color=blue]
          > If this did work, then you can use the DSU pattern like so:[/color]
          [...]

          The advantage of which is that you don't have to mess with the locale.


          John

          Comment

          • Martin v. Löwis

            #6
            Re: Trouble sorting lists (unicode/locale related?)

            John J. Lee wrote:
            [color=blue][color=green][color=darkred]
            >>>locale.setlo cale(locale.LC_ ALL, ("no", None))[/color][/color]
            >
            > [...]
            >[color=green]
            >>If this did work, then you can use the DSU pattern like so:[/color]
            >
            > [...strxfrm...]
            >
            > The advantage of which is that you don't have to mess with the locale.[/color]

            No, it doesn't. strxfrm requires that the locale is adjusted to the
            desired LC_COLLATE facet. Otherwise, it will collate according to the
            C locale (in which case strxfrm is the identity).

            Regards,
            Martin

            Comment

            • John J. Lee

              #7
              Re: Trouble sorting lists (unicode/locale related?)

              "Martin v. Löwis" <martin@v.loewi s.de> writes:
              [color=blue]
              > John J. Lee wrote:
              >[color=green][color=darkred]
              > >>>locale.setlo cale(locale.LC_ ALL, ("no", None))[/color]
              > > [...]
              > >[color=darkred]
              > >>If this did work, then you can use the DSU pattern like so:[/color]
              > > [...strxfrm...]
              > > The advantage of which is that you don't have to mess with the
              > > locale.[/color]
              >
              > No, it doesn't.[/color]

              It does set the locale, you mean? So I guess there's no advantage
              there at all?

              [...][color=blue]
              > strxfrm requires that the locale is adjusted to the
              > desired LC_COLLATE facet.[/color]
              [...]


              John

              Comment

              • Martin v. Löwis

                #8
                Re: Trouble sorting lists (unicode/locale related?)

                jjl@pobox.com (John J. Lee) writes:
                [color=blue][color=green][color=darkred]
                > > >>If this did work, then you can use the DSU pattern like so:
                > > > [...strxfrm...]
                > > > The advantage of which is that you don't have to mess with the
                > > > locale.[/color]
                > >
                > > No, it doesn't.[/color]
                >
                > It does set the locale, you mean?[/color]

                Calling locale.strxfrm does not cause locale.setlocal e to be called,
                i.e. it does not set the locale. Likewise, calling locale.strcoll
                does not cause setlocale to be called.

                Instead, the application needs to call setlocale *explicitly* for
                proper operation of either function.
                [color=blue]
                > So I guess there's no advantage there at all?[/color]

                Using strxfrm is an advantage if you have to collate the same string
                multiple times, e.g. when sorting a list of strings. It is an
                advantage because sorting will complete faster.

                Regards,
                Martin

                Comment

                • Peter Otten

                  #9
                  DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                  Jeff Epler wrote:
                  [color=blue]
                  > If this did work, then you can use the DSU pattern like so:
                  >
                  > decorated_wordl ist = [(locale.strxfrm (w), w) for w in wordlist]
                  > decorated_wordl ist.sort()
                  > wordlist[:] = [i[1] for i in decorated_wordl ist][/color]

                  Everytime that someone posts a naive list.sort(compa re), the DSU pattern is
                  proposed to improve execution speed.

                  So maybe it's about time to change the sort() method to support a second
                  argument

                  list.sort(compa re=None, mapping=None)

                  that, if provided, would perform the DSU magic. Or was that already proposed
                  and rejected?

                  Peter

                  Comment

                  • Alex Martelli

                    #10
                    Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                    Peter Otten wrote:
                    [color=blue]
                    > Jeff Epler wrote:
                    >[color=green]
                    >> If this did work, then you can use the DSU pattern like so:
                    >>
                    >> decorated_wordl ist = [(locale.strxfrm (w), w) for w in wordlist]
                    >> decorated_wordl ist.sort()
                    >> wordlist[:] = [i[1] for i in decorated_wordl ist][/color]
                    >
                    > Everytime that someone posts a naive list.sort(compa re), the DSU pattern
                    > is proposed to improve execution speed.
                    >
                    > So maybe it's about time to change the sort() method to support a second
                    > argument
                    >
                    > list.sort(compa re=None, mapping=None)
                    >
                    > that, if provided, would perform the DSU magic. Or was that already
                    > proposed and rejected?[/color]

                    I have not seen this proposed before, and I'm not very clear on what
                    the "compare" and "mapping" arguments are supposed to be in order to
                    let you specify any DSU. Basically it seems you would need two
                    callables, "decorate" to be called with each item in the list (to
                    return for each item the decorated tuple) and "undecorate " to be
                    called with each decorated tuple after the sort (to return the item
                    for the result). How do you turn that into "compare" and "mapping"?


                    Alex

                    Comment

                    • Duncan Booth

                      #11
                      Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                      Alex Martelli <aleax@aleax.it > wrote in
                      news:DUCbb.1008 26$hE5.3550910@ news1.tin.it:
                      [color=blue][color=green]
                      >> So maybe it's about time to change the sort() method to support a
                      >> second argument
                      >>
                      >> list.sort(compa re=None, mapping=None)
                      >>
                      >> that, if provided, would perform the DSU magic. Or was that already
                      >> proposed and rejected?[/color]
                      >
                      > I have not seen this proposed before, and I'm not very clear on what
                      > the "compare" and "mapping" arguments are supposed to be in order to
                      > let you specify any DSU. Basically it seems you would need two
                      > callables, "decorate" to be called with each item in the list (to
                      > return for each item the decorated tuple) and "undecorate " to be
                      > called with each decorated tuple after the sort (to return the item
                      > for the result). How do you turn that into "compare" and "mapping"?
                      >
                      >[/color]
                      You don't need two callables because the sort function would be doing
                      the decorating, so it knows also how to undecorate. I think the
                      suggestion is that the mapping argument returns something that can be
                      compared.

                      For example, here is a DSU function that does a not-in-place sort and
                      takes a suitable mapping argument. Changing it to in-place sort is,
                      of course, trivial.
                      [color=blue][color=green][color=darkred]
                      >>> def DSU(aList, aMapping):[/color][/color][/color]
                      newList = [ (aMapping(item) , index, item) for (index, item)
                      in enumerate(aList ) ]
                      newList.sort()
                      return [ item[2] for item in newList ]
                      [color=blue][color=green][color=darkred]
                      >>> print DSU(['Alex Martelli', 'Duncan Booth', 'Peter Otten', 'Jeff Epler'],[/color][/color][/color]
                      str.lower)
                      ['Alex Martelli', 'Duncan Booth', 'Jeff Epler', 'Peter Otten'][color=blue][color=green][color=darkred]
                      >>> def lastFirst(name) :[/color][/color][/color]
                      names = name.split()
                      names = names[-1:] + names[:-1]
                      return [name.lower() for name in names]
                      [color=blue][color=green][color=darkred]
                      >>> print DSU(['Alex Martelli', 'Duncan Booth', 'Peter Otten', 'Jeff Epler'],[/color][/color][/color]
                      lastFirst)
                      ['Duncan Booth', 'Jeff Epler', 'Alex Martelli', 'Peter Otten']


                      --
                      Duncan Booth
                      duncan@rcp.co.u k int month(char
                      *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
                      "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

                      Comment

                      • Peter Otten

                        #12
                        Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                        Alex Martelli wrote:
                        [color=blue][color=green]
                        >> Everytime that someone posts a naive list.sort(compa re), the DSU pattern
                        >> is proposed to improve execution speed.
                        >>
                        >> So maybe it's about time to change the sort() method to support a second
                        >> argument
                        >>
                        >> list.sort(compa re=None, mapping=None)
                        >>
                        >> that, if provided, would perform the DSU magic. Or was that already
                        >> proposed and rejected?[/color]
                        >
                        > I have not seen this proposed before, and I'm not very clear on what
                        > the "compare" and "mapping" arguments are supposed to be in order to
                        > let you specify any DSU. Basically it seems you would need two
                        > callables, "decorate" to be called with each item in the list (to
                        > return for each item the decorated tuple) and "undecorate " to be
                        > called with each decorated tuple after the sort (to return the item
                        > for the result). How do you turn that into "compare" and "mapping"?[/color]

                        My first idea was to add a .dsu(mapping) where the tuples in the decoration
                        phase would be generated as (mapping(item), item).
                        But I would rather enhance the already-existing sort() to provide for both
                        the traditional and the dsu sorting. Then you have to detect if the
                        function passed to sort(fun) takes one or two arguments, which is not
                        reliable when default values come into play. So I resorted to named
                        arguments. Didn't make it any clearer?

                        So here is a pure python prototype to shed some light on the matter:

                        class dsu(list):
                        def sort(self, compare=None, mapping=None):
                        if mapping:
                        decorated = [(mapping(i), i) for i in self]
                        decorated.sort( )
                        self[:] = [i[1] for i in decorated]
                        else:
                        list.sort(self, compare)

                        sample = dsu("ab Aa AC d e f".split())

                        # "traditiona l" sort
                        sample.sort(lam bda s, t: -cmp(s, t))
                        print sample

                        # decorate-sort-undecorate
                        sample.sort(map ping=lambda s: s.lower())
                        print sample


                        Peter

                        Comment

                        • Alex Martelli

                          #13
                          Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                          Duncan Booth wrote:
                          ...[color=blue][color=green][color=darkred]
                          >>> So maybe it's about time to change the sort() method to support a
                          >>> second argument
                          >>>
                          >>> list.sort(compa re=None, mapping=None)
                          >>>
                          >>> that, if provided, would perform the DSU magic. Or was that already
                          >>> proposed and rejected?[/color]
                          >>
                          >> I have not seen this proposed before, and I'm not very clear on what
                          >> the "compare" and "mapping" arguments are supposed to be in order to
                          >> let you specify any DSU. Basically it seems you would need two
                          >> callables, "decorate" to be called with each item in the list (to
                          >> return for each item the decorated tuple) and "undecorate " to be
                          >> called with each decorated tuple after the sort (to return the item
                          >> for the result). How do you turn that into "compare" and "mapping"?
                          >>
                          >>[/color]
                          > You don't need two callables because the sort function would be doing
                          > the decorating, so it knows also how to undecorate. I think the
                          > suggestion is that the mapping argument returns something that can be
                          > compared.
                          >
                          > For example, here is a DSU function that does a not-in-place sort and
                          > takes a suitable mapping argument. Changing it to in-place sort is,
                          > of course, trivial.
                          >[color=green][color=darkred]
                          >>>> def DSU(aList, aMapping):[/color][/color]
                          > newList = [ (aMapping(item) , index, item) for (index, item)[/color]

                          Ah, I see -- the alleged "mapping" is in fact meant to be a
                          *CALLABLE*, NOT a mapping in the Python sense. Yes, you could
                          get away with just one callable in this way, of course. It
                          also seems to me that shoe-horning that second argument (in
                          practice mutually exclusive with the first one) to the sort
                          method, when there seem to be no real advantages to keeping
                          it a method, is not a great idea. Rather, a new built-in
                          function appears to me to be best here -- something like:

                          def sort(iterable, decorate=None):
                          if decorate is None:
                          aux = [ (item, index, item)
                          for index, item in enumerate(itera ble) ]
                          else:
                          aux = [ (decorate(item) , index, item)
                          for index, item in enumerate(itera ble) ]
                          aux.sort()
                          return [ item for __, __, item in aux ]

                          so as to have no arbitrary constraints on the iterable being
                          a list, or the sort being necessarily in-place, when there
                          is no performance advantage in so doing -- and also serve the
                          frequent need of a non-in-place sort returning the sorted
                          list without any actual need for decoration. E.g., I often
                          use the equivalent of:

                          for key in sort(somesmalld ict):
                          print key, somesmalldict[key]

                          and it would be handy to have that little 'sort' function
                          built-in for all such uses.

                          The only downside I can see to this proposal is that Python
                          isn't really all that good at passing the callable decorate
                          argument -- one would end up with a lot of little ad hoc
                          functions, or lambdas, and neither is a great solution. It's
                          the kind of situation where Smalltalk/Ruby shine by letting
                          an arbitrary block of code (albeit one only) be passed into
                          any method (in Ruby, this DSU-sort would probably become a
                          method of the Enumerable "module" [mix-in] -- a rather elegant
                          solution overall, "architecturall y" nicer-looking than Python's,
                          although "pragmatica lly" we're probably abot even).


                          Alex

                          Comment

                          • Duncan Booth

                            #14
                            Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                            Peter Otten <__peter__@web. de> wrote in
                            news:bkn12v$tns $06$1@news.t-online.com:
                            [color=blue]
                            > My first idea was to add a .dsu(mapping) where the tuples in the
                            > decoration phase would be generated as (mapping(item), item).[/color]

                            Note that if anyone proposes this seriously, it should generate a 3-tuple
                            (mapping(item), index, item) rather than the 2-tuple you suggest.

                            This is because the mapping function could reasonably be used to impose an
                            ordering on objects that have no natural order, so you need to be sure that
                            the comparison never falls through the the original object even where the
                            mapping compares equal. It also has the side effect of making the sort
                            stable (although if stability is a goal you might want another option to
                            reverse the sort which would use '-index' as the second element and call
                            ..reverse() on the result).

                            FWIW, I think something like this belongs in the standard library rather
                            than as a method on lists or a new builtin.

                            --
                            Duncan Booth duncan@rcp.co.u k
                            int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
                            "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

                            Comment

                            • Alex Martelli

                              #15
                              Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                              Duncan Booth wrote:
                              ...[color=blue]
                              > FWIW, I think something like this belongs in the standard library rather
                              > than as a method on lists or a new builtin.[/color]

                              Definitely not a method on lists, we agree on that. Problem is, there
                              is no module in the standard library to collect "functions that are often
                              useful but not QUITE often enough to warrant being built-ins" (and if
                              there was, there are quite a few current built-ins I'd like to relegate
                              into such an auxiliary/utilities module), so that finding a good place,
                              other than built-ins, for such utility functions, is often a bother.


                              Alex

                              Comment

                              Working...