delete duplicates in list

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • christof hoeke

    delete duplicates in list

    hello,
    this must have come up before, so i am already sorry for asking but a
    quick googling did not give me any answer.

    i have a list from which i want a simpler list without the duplicates
    an easy but somehow contrived solution would be
    [color=blue][color=green][color=darkred]
    >>> a = [1, 2, 2, 3]
    >>> d = {}.fromkeys(a)
    >>> b = d.keys()
    >>> print b[/color][/color][/color]
    [1, 2, 3]

    there should be an easier or more intuitive solution, maybe with a list
    comprehension=

    somthing like
    [color=blue][color=green][color=darkred]
    >>> b = [x for x in a if x not in b]
    >>> print b[/color][/color][/color]
    []

    does not work though.

    thanks for any help
    chris


  • Diez B. Roggisch

    #2
    Re: delete duplicates in list

    Hi,[color=blue]
    > this must have come up before, so i am already sorry for asking but a
    > quick googling did not give me any answer.
    >
    > i have a list from which i want a simpler list without the duplicates
    > an easy but somehow contrived solution would be[/color]

    In python 2.3, this should work:

    import sets
    l = [1,2,2,3]
    s = sets.Set(l)

    A set is a container for which the invariant that no object is in it twice
    holds. So it suits your needs.

    Regards,

    Diez

    Comment

    • Bernard Delmée

      #3
      Re: delete duplicates in list

      > >>> a = [1, 2, 2, 3][color=blue][color=green][color=darkred]
      > >>> d = {}.fromkeys(a)
      > >>> b = d.keys()
      > >>> print b[/color][/color]
      > [1, 2, 3][/color]

      That, or using a Set (python 2.3+). Actually I seem to recall that
      "The Python CookBook" still advises building a dict as the fastest
      solution - if your elements can be hashed. See the details at



      Cheers,

      Bernard.

      Comment

      • Alex Martelli

        #4
        Re: delete duplicates in list

        christof hoeke wrote:
        ...[color=blue]
        > i have a list from which i want a simpler list without the duplicates[/color]

        Canonical is:

        import sets
        simplerlist = list(sets.Set(t helist))

        if you're allright with destroying order, as your example solution suggests.
        But dict.fromkeys(a ).keys() is probably faster. Your assertion:
        [color=blue]
        > there should be an easier or more intuitive solution, maybe with a list
        > comprehension=[/color]

        doesn't seem self-evident to me. A list-comprehension might be, e.g:

        [ x for i, x in enumerate(a) if i==a.index(x) ]

        and it does have the advantages of (a) keeping order AND (b) not
        requiring hashable (nor even inequality-comparable!) elements -- BUT
        it has the non-indifferent cost of being O(N*N) while the others
        are about O(N). If you really want something similar to your approach:
        [color=blue][color=green][color=darkred]
        > >>> b = [x for x in a if x not in b][/color][/color][/color]

        you'll have, o horrors!-), to do a loop, so name b is always bound to
        "the result list so far" (in the LC, name b is only bound at the end):

        b = []
        for x in a:
        if x not in b:
        b.append(x)

        However, this is O(N*N) too. In terms of "easier or more intuitive",
        I suspect only this latter solution might qualify.


        Alex

        Comment

        • Alex Martelli

          #5
          Re: delete duplicates in list

          Bernard Delmée wrote:
          [color=blue][color=green][color=darkred]
          >> >>> a = [1, 2, 2, 3]
          >> >>> d = {}.fromkeys(a)
          >> >>> b = d.keys()
          >> >>> print b[/color]
          >> [1, 2, 3][/color]
          >
          > That, or using a Set (python 2.3+). Actually I seem to recall that
          > "The Python CookBook" still advises building a dict as the fastest
          > solution - if your elements can be hashed. See the details at
          >
          > http://aspn.activestate.com/ASPN/Coo...n/Recipe/52560[/color]

          Yep, but a Set is roughly as fast as a dict -- it has a small
          penalty wrt a dict, but, emphasis on small. Still, if you're
          trying to squeeze every last cycle out of a bottleneck, it's
          worth measuring, and perhaps moving from Set to dict.


          Alex

          Comment

          • Alex Martelli

            #6
            Re: delete duplicates in list

            Diez B. Roggisch wrote:

            [color=blue][color=green]
            >> this must have come up before, so i am already sorry for asking but a
            >> quick googling did not give me any answer.
            >>
            >> i have a list from which i want a simpler list without the duplicates
            >> an easy but somehow contrived solution would be[/color]
            >
            > In python 2.3, this should work:
            >
            > import sets
            > l = [1,2,2,3]
            > s = sets.Set(l)
            >
            > A set is a container for which the invariant that no object is in it twice
            > holds. So it suits your needs.[/color]

            ....except it's not a LIST, which was part of the specifications given
            by the original poster. It may, of course, be that you've read his
            mind correctly and that, despite his words, he doesn't really care
            whether he gets a list or a very different container:-).


            Alex

            Comment

            • Diez B. Roggisch

              #7
              Re: delete duplicates in list

              Hi,
              [color=blue]
              > ...except it's not a LIST, which was part of the specifications given
              > by the original poster. It may, of course, be that you've read his
              > mind correctly and that, despite his words, he doesn't really care
              > whether he gets a list or a very different container:-).[/color]

              You're right - mathematically. However, there is no such thing like a set in
              computers - you always end up with some sort of list :)

              So - he'll have a list anyway. But if it respects the order the list
              parameter... <just checking, standby>

              .... nope:
              [color=blue][color=green][color=darkred]
              >>> l = [1,2,3,2]
              >>> import sets
              >>> sets.Set(l)[/color][/color][/color]
              Set([1, 2, 3])[color=blue][color=green][color=darkred]
              >>> l = [2,1,2,3,2]
              >>> sets.Set(l)[/color][/color][/color]
              Set([1, 2, 3])

              Which is of course reasonable, as the check for existence in the set might
              be performed in O(ln n) instead of O(n).

              Regards,

              Diez

              Comment

              • Hannu Kankaanp??

                #8
                Re: delete duplicates in list

                "Diez B. Roggisch" <deets_noospaam @web.de> wrote in message news:<bnplva$g1 s$07$1@news.t-online.com>...[color=blue]
                > Hi,
                >[color=green]
                > > ...except it's not a LIST, which was part of the specifications given
                > > by the original poster. It may, of course, be that you've read his
                > > mind correctly and that, despite his words, he doesn't really care
                > > whether he gets a list or a very different container:-).[/color]
                >
                > You're right - mathematically. However, there is no such thing like a set in
                > computers - you always end up with some sort of list :)[/color]

                Those are just implementation details. There could be a group of monkeys
                emulating Python under the hood and their implementation of a set
                would be a neural network instead of any kind of sequence, but you
                still wouldn't care as a programmer. The only thing that matters is,
                if the interface stays same. Nope, the items in a set are no longer
                accessible by their index (among other differences).
                [color=blue]
                > So - he'll have a list anyway.[/color]

                So I can't agree with this. You don't know if his Python virtual machine
                is a group of monkeys. Python is supposed to be a high level language.
                [color=blue]
                > But if it respects the order the list
                > parameter... <just checking, standby>
                >
                > ... nope:
                >[color=green][color=darkred]
                > >>> l = [1,2,3,2]
                > >>> import sets
                > >>> sets.Set(l)[/color][/color]
                > Set([1, 2, 3])[color=green][color=darkred]
                > >>> l = [2,1,2,3,2]
                > >>> sets.Set(l)[/color][/color]
                > Set([1, 2, 3])
                >
                > Which is of course reasonable, as the check for existence in the set might
                > be performed in O(ln n) instead of O(n).[/color]

                Actually the Set in sets module has an average lookup of O(1), worst
                case O(n) (not 100% sure of worst case, but 99% sure). It's been
                implemented with dictionaries, which in turn are hash tables.

                Comment

                • Duncan Booth

                  #9
                  Re: delete duplicates in list

                  Alex Martelli <aleax@aleax.it > wrote in
                  news:L7Wnb.6554 7$e5.2408695@ne ws1.tin.it:
                  [color=blue]
                  > Your assertion:
                  >[color=green]
                  >> there should be an easier or more intuitive solution, maybe with a list
                  >> comprehension=[/color]
                  >
                  > doesn't seem self-evident to me. A list-comprehension might be, e.g:
                  >
                  > [ x for i, x in enumerate(a) if i==a.index(x) ]
                  >
                  > and it does have the advantages of (a) keeping order AND (b) not
                  > requiring hashable (nor even inequality-comparable!) elements -- BUT
                  > it has the non-indifferent cost of being O(N*N) while the others
                  > are about O(N). If you really want something similar to your approach:
                  >[color=green][color=darkred]
                  >> >>> b = [x for x in a if x not in b][/color][/color]
                  >
                  > you'll have, o horrors!-), to do a loop, so name b is always bound to
                  > "the result list so far" (in the LC, name b is only bound at the end):
                  >
                  > b = []
                  > for x in a:
                  > if x not in b:
                  > b.append(x)
                  >
                  > However, this is O(N*N) too. In terms of "easier or more intuitive",
                  > I suspect only this latter solution might qualify.[/color]

                  I dunno about more intuitive, but here's a fairly simple list comprehension
                  solution which is O(N) and preserves the order. Of course its back to
                  requiring hashable elements again:
                  [color=blue][color=green][color=darkred]
                  >>> startList = [5,1,2,1,3,4,2,5 ,3,4]
                  >>> d = {}
                  >>> [ d.setdefault(x, x) for x in startList if x not in d ][/color][/color][/color]
                  [5, 1, 2, 3, 4]

                  And for the 'I must do it on one line' freaks, here's the single expression
                  variant of the above: :^)
                  [color=blue][color=green][color=darkred]
                  >>> [ d.setdefault(x, x) for d in [{}] for x in startList if x not in d ][/color][/color][/color]
                  [5, 1, 2, 3, 4]

                  --
                  Duncan Booth duncan@rcp.co.u k
                  int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
                  "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

                  Comment

                  • Alex Martelli

                    #10
                    Re: delete duplicates in list

                    Hannu Kankaanp?? wrote:
                    ...[color=blue][color=green]
                    >> So - he'll have a list anyway.[/color]
                    >
                    > So I can't agree with this. You don't know if his Python virtual machine
                    > is a group of monkeys. Python is supposed to be a high level language.[/color]

                    So, in that case those monkeys would surely be perched up on tall trees.

                    [color=blue][color=green]
                    >> Which is of course reasonable, as the check for existence in the set
                    >> might be performed in O(ln n) instead of O(n).[/color]
                    >
                    > Actually the Set in sets module has an average lookup of O(1), worst
                    > case O(n) (not 100% sure of worst case, but 99% sure). It's been[/color]

                    Hmmm -- could you give an example of that worstcase...? a _full_
                    hashtable would give such behavior, but Python's dicts always ensure
                    the underlying hashtables aren't too full...
                    [color=blue]
                    > implemented with dictionaries, which in turn are hash tables.[/color]

                    Yep - check it out...:

                    [alex@lancelot bo]$ timeit.py -c -s'import sets' -s'x=sets.Set(xr ange(100))'
                    '7 in x'
                    100000 loops, best of 3: 2.3 usec per loop
                    [alex@lancelot bo]$ timeit.py -c -s'import sets'
                    -s'x=sets.Set(xr ange(1000))' '7 in x'
                    100000 loops, best of 3: 2.3 usec per loop
                    [alex@lancelot bo]$ timeit.py -c -s'import sets'
                    -s'x=sets.Set(xr ange(10000))' '7 in x'
                    100000 loops, best of 3: 2.2 usec per loop
                    [alex@lancelot bo]$ timeit.py -c -s'import sets'
                    -s'x=sets.Set(xr ange(100000))' '7 in x'
                    100000 loops, best of 3: 2.2 usec per loop
                    [alex@lancelot bo]$ timeit.py -c -s'import sets'
                    -s'x=sets.Set(xr ange(1000000))' '7 in x'
                    100000 loops, best of 3: 2.3 usec per loop

                    see the pattern...?-)


                    Same when something is NOT there, BTW:

                    [alex@lancelot bo]$ timeit.py -c -s'import sets'
                    -s'x=sets.Set(xr ange(1000000))' 'None in x'
                    100000 loops, best of 3: 2.3 usec per loop

                    etc, etc.


                    Comment

                    • Michael Hudson

                      #11
                      Re: delete duplicates in list

                      Alex Martelli <aleax@aleax.it > writes:
                      [color=blue][color=green]
                      > > Actually the Set in sets module has an average lookup of O(1), worst
                      > > case O(n) (not 100% sure of worst case, but 99% sure). It's been[/color]
                      >
                      > Hmmm -- could you give an example of that worstcase...? a _full_
                      > hashtable would give such behavior, but Python's dicts always ensure
                      > the underlying hashtables aren't too full...[/color]

                      Try inserting a bunch of instances of

                      class C:
                      def __hash__(self): return 0

                      into a dictionary.

                      Cheers,
                      mwh

                      --
                      Whaaat? That is the most retarded thing I have seen since, oh,
                      yesterday -- Kaz Kylheku, comp.lang.lisp

                      Comment

                      • Thomas Heller

                        #12
                        hashing mutable instances (was Re: delete duplicates in list)

                        Michael Hudson <mwh@python.net > writes:
                        [color=blue]
                        > Alex Martelli <aleax@aleax.it > writes:
                        >[color=green][color=darkred]
                        >> > Actually the Set in sets module has an average lookup of O(1), worst
                        >> > case O(n) (not 100% sure of worst case, but 99% sure). It's been[/color]
                        >>
                        >> Hmmm -- could you give an example of that worstcase...? a _full_
                        >> hashtable would give such behavior, but Python's dicts always ensure
                        >> the underlying hashtables aren't too full...[/color]
                        >
                        > Try inserting a bunch of instances of
                        >
                        > class C:
                        > def __hash__(self): return 0
                        >
                        > into a dictionary.[/color]

                        I've though about using something like this in production code
                        to be able to store mutable instances in a dict.
                        Performance problems aside (since there are only a couple of key/value
                        pairs in the dict), is it such a bad idea?

                        Thomas

                        Comment

                        • Michael Hudson

                          #13
                          Re: hashing mutable instances (was Re: delete duplicates in list)

                          Thomas Heller <theller@python .net> writes:
                          [color=blue]
                          > Michael Hudson <mwh@python.net > writes:
                          >[color=green]
                          > > Alex Martelli <aleax@aleax.it > writes:
                          > >[color=darkred]
                          > >> > Actually the Set in sets module has an average lookup of O(1), worst
                          > >> > case O(n) (not 100% sure of worst case, but 99% sure). It's been
                          > >>
                          > >> Hmmm -- could you give an example of that worstcase...? a _full_
                          > >> hashtable would give such behavior, but Python's dicts always ensure
                          > >> the underlying hashtables aren't too full...[/color]
                          > >
                          > > Try inserting a bunch of instances of
                          > >
                          > > class C:
                          > > def __hash__(self): return 0
                          > >
                          > > into a dictionary.[/color]
                          >
                          > I've though about using something like this in production code
                          > to be able to store mutable instances in a dict.[/color]

                          Eek!
                          [color=blue]
                          > Performance problems aside (since there are only a couple of key/value
                          > pairs in the dict), is it such a bad idea?[/color]

                          Um, I don't know actually. It pretty much defeats the point of using
                          a hashtable. If your class has some non-mutable parts it'd be an
                          idea to hash them, of course.

                          And the performance problems are severe -- inserting just a thousand
                          instances of the class C() into a dictionary takes noticable time.

                          (What *is* a bad idea is giving such classes an __eq__ method that
                          mutates the dict being inserted into -- I found a bunch of such bugs a
                          couple years back and I think it's still possible to core Python (via
                          stack overflow) by being even more devious).

                          Cheers,
                          mwh

                          --
                          Our Constitution never promised us a good or efficient government,
                          just a representative one. And that's what we got.
                          -- http://www.advogato.org/person/mrorg...html?start=109

                          Comment

                          • Corrado Gioannini

                            #14
                            Re: delete duplicates in list

                            On Wed, Oct 29, 2003 at 10:02:08PM +0100, christof hoeke wrote:[color=blue]
                            > there should be an easier or more intuitive solution, maybe with a list
                            > comprehension=
                            >
                            > somthing like
                            > [color=green][color=darkred]
                            > >>> b = [x for x in a if x not in b]
                            > >>> print b[/color][/color]
                            > []
                            >
                            > does not work though.[/color]
                            if you want a list comp solution, similar to the one you proposed and valid
                            also with python < 2.3 you could do:
                            [color=blue][color=green][color=darkred]
                            >>> a = [1, 2, 2, 3]
                            >>> b=[]
                            >>> [b.append(x) for x in a if x not in b][/color][/color][/color]
                            [None, None, None][color=blue][color=green][color=darkred]
                            >>> b[/color][/color][/color]
                            [1, 2, 3]

                            or even:
                            [color=blue][color=green][color=darkred]
                            >>> a = [1, 2, 2, 3]
                            >>> [b.append(x) for b in [[]] for x in a if x not in b][/color][/color][/color]
                            [None, None, None][color=blue][color=green][color=darkred]
                            >>> b[/color][/color][/color]
                            [1, 2, 3]

                            Corrado.
                            --
                            Thought is only a flash between two long nights,
                            but this flash is everything.
                            (H. Poincaré)


                            Comment

                            • Bernhard Herzog

                              #15
                              Re: hashing mutable instances

                              Thomas Heller <theller@python .net> writes:
                              [color=blue]
                              > Michael Hudson <mwh@python.net > writes:
                              >[color=green]
                              >> Try inserting a bunch of instances of
                              >>
                              >> class C:
                              >> def __hash__(self): return 0
                              >>
                              >> into a dictionary.[/color]
                              >
                              > I've though about using something like this in production code
                              > to be able to store mutable instances in a dict.
                              > Performance problems aside (since there are only a couple of key/value
                              > pairs in the dict), is it such a bad idea?[/color]

                              IMO it depends on what equality means for instances. E.g. if two
                              instances are only equal if they're identical, i.e. a == b is equivalent
                              to a is b, then defining __hash__ can be very useful, because then you
                              can use them in dictionaries and mutability doesn't really matter
                              because no change to one instance can make it equal to a nother
                              instance.

                              I'd define __hash__ to return id(self), though, so that the hash values
                              are different for different instances to reduce collisions.

                              This also seems to be what class objects in Python do:
                              [color=blue][color=green][color=darkred]
                              >>> class C(object):[/color][/color][/color]
                              .... pass
                              ....[color=blue][color=green][color=darkred]
                              >>> hash(C)[/color][/color][/color]
                              135622420[color=blue][color=green][color=darkred]
                              >>> id(C)[/color][/color][/color]
                              135622420[color=blue][color=green][color=darkred]
                              >>>[/color][/color][/color]

                              Bernhard

                              --
                              Intevation GmbH http://intevation.de/
                              Sketch http://sketch.sourceforge.net/
                              Thuban http://thuban.intevation.org/

                              Comment

                              Working...