Slow comparison between two lists

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jani Tiainen

    Slow comparison between two lists

    I have rather simple 'Address' object that contains streetname,
    number, my own status and x,y coordinates for it. I have two lists
    both containing approximately 30000 addresses.

    I've defined __eq__ method in my class like this:

    def __eq__(self, other):
    return self.xcoord == other.xcoord and \
    self.ycoord == other.ycoord and \
    self.streetname == other.streetnam e and \
    self.streetno == other.streetno

    But it turns out to be very, very slow.

    Then I setup two lists:

    list_external = getexternal()
    list_internal = getinternal()

    Now I need get all all addresses from 'list_external' that are not in
    'list_internal' , and mark them as "new".

    I did it like this:

    for addr in list_external:
    if addr not in list_internal:
    addr.status = 1 # New address

    But in my case running that loop takes about 10 minutes. What I am
    doing wrong?

    --

    Jani Tiainen
  • Hrvoje Niksic

    #2
    Re: Slow comparison between two lists

    Jani Tiainen <redetin@gmail. comwrites:
    for addr in list_external:
    if addr not in list_internal:
    addr.status = 1 # New address
    >
    But in my case running that loop takes about 10 minutes. What I am
    doing wrong?
    The nested loop takes time proportional to the product of the number
    of elements in the loops. To avoid it, convert the inner loop to a
    set, which can be looked up in constant time:

    internal = set(list_intern al)
    for addr in list_external:
    if addr in internal:
    addr.status = 1

    Comment

    • Peter Otten

      #3
      Re: Slow comparison between two lists

      Jani Tiainen wrote:
      I have rather simple 'Address' object that contains streetname,
      number, my own status and x,y coordinates for it. I have two lists
      both containing approximately 30000 addresses.
      >
      I've defined __eq__ method in my class like this:
      >
      def __eq__(self, other):
      return self.xcoord == other.xcoord and \
      self.ycoord == other.ycoord and \
      self.streetname == other.streetnam e and \
      self.streetno == other.streetno
      >
      But it turns out to be very, very slow.
      >
      Then I setup two lists:
      >
      list_external = getexternal()
      list_internal = getinternal()
      >
      Now I need get all all addresses from 'list_external' that are not in
      'list_internal' , and mark them as "new".
      >
      I did it like this:
      >
      for addr in list_external:
      if addr not in list_internal:
      addr.status = 1 # New address
      >
      But in my case running that loop takes about 10 minutes. What I am
      doing wrong?
      Even if list_external and list_internal contain the same items you need
      about len(list_extern al)*(len(list_i nternal)/2), or 45 million comparisons.
      You can bring that down to 30000*some_smal l_factor if you make your
      addresses hashable and put the internal ones into a dict or set

      def __hash__(self):
      return hash((self.xcoo rd, self.yccord, self.streetname , self.streetno))
      def __eq__(self, other):
      # as above

      Then

      set_internal = set(list_intern al)
      for addr in list_external:
      if addr not in set_internal:
      addr.status = 1

      Note that the attributes relevant for hash and equality must not change
      during this process.

      Peter

      Comment

      • bearophileHUGS@lycos.com

        #4
        Re: Slow comparison between two lists

        Hrvoje Niksic:
        internal = set(list_intern al)
        ....

        To do that the original poster may have to define a __hash__ and
        __eq__ methods in his/her class.

        Bye,
        bearophile

        Comment

        • Bruno Desthuilliers

          #5
          Re: Slow comparison between two lists

          Jani Tiainen a écrit :
          I have rather simple 'Address' object that contains streetname,
          number, my own status and x,y coordinates for it. I have two lists
          both containing approximately 30000 addresses.
          >
          I've defined __eq__ method in my class like this:
          >
          def __eq__(self, other):
          return self.xcoord == other.xcoord and \
          self.ycoord == other.ycoord and \
          self.streetname == other.streetnam e and \
          self.streetno == other.streetno
          >
          But it turns out to be very, very slow.
          >
          Then I setup two lists:
          >
          list_external = getexternal()
          list_internal = getinternal()
          >
          Now I need get all all addresses from 'list_external' that are not in
          'list_internal' , and mark them as "new".
          >
          I did it like this:
          >
          for addr in list_external:
          if addr not in list_internal:
          addr.status = 1 # New address
          >
          But in my case running that loop takes about 10 minutes. What I am
          doing wrong?
          mmm... not using sets ?

          Comment

          • Jani Tiainen

            #6
            Re: Slow comparison between two lists

            On 23 loka, 15:24, Peter Otten <__pete...@web. dewrote:
            Jani Tiainen wrote:
            I have rather simple 'Address' object that contains streetname,
            number, my own status and x,y coordinates for it. I have two lists
            both containing approximately 30000 addresses.
            >
            I've defined __eq__ method in my class like this:
            >
                def __eq__(self, other):
                    return self.xcoord == other.xcoord and \
                        self.ycoord == other.ycoord and \
                        self.streetname == other.streetnam e and \
                        self.streetno == other.streetno
            >
            But it turns out to be very, very slow.
            >
            Then I setup two lists:
            >
            list_external = getexternal()
            list_internal = getinternal()
            >
            Now I need get all all addresses from 'list_external' that are not in
            'list_internal' , and mark them as "new".
            >
            I did it like this:
            >
            for addr in list_external:
                if addr not in list_internal:
                    addr.status = 1 # New address
            >
            But in my case running that loop takes about 10 minutes. What I am
            doing wrong?
            >
            Even if list_external and list_internal contain the same items you need
            about len(list_extern al)*(len(list_i nternal)/2), or 45 million comparisons.
            You can bring that down to 30000*some_smal l_factor if you make your
            addresses hashable and put the internal ones into a dict or set
            >
            def __hash__(self):
               return hash((self.xcoo rd, self.yccord, self.streetname , self.streetno))
            def __eq__(self, other):
               # as above
            >
            Then
            >
            set_internal = set(list_intern al)
            for addr in list_external:
                if addr not in set_internal:
                    addr.status = 1
            >
            Note that the attributes relevant for hash and equality must not change
            during this process.
            Very complete answer, thank you very much.

            I tried that hash approach and sets but it seemed to get wrong results
            first time and it was all due my hash method.

            Now it takes like 2-3 seconds to do all that stuff and result seem to
            be correct. Apparently I have lot to learn about Python... :)

            --

            Jani Tiainen

            Comment

            • Hrvoje Niksic

              #7
              Re: Slow comparison between two lists

              bearophileHUGS@ lycos.com writes:
              >internal = set(list_intern al)
              ...
              >
              To do that the original poster may have to define a __hash__ and
              __eq__ methods in his/her class.
              You're right. The OP states he implements __eq__, so he also needs a
              matching __hash__, such as:

              def __hash__(self, other):
              return (hash(self.xcoo rd) ^ hash(self.ycoor d) ^
              hash(self.stree tname) ^ hash(self.stree tno))

              Comment

              • Steven D'Aprano

                #8
                Re: Slow comparison between two lists

                On Thu, 23 Oct 2008 05:03:26 -0700, Jani Tiainen wrote:
                But in my case running that loop takes about 10 minutes. What I am doing
                wrong?
                Others have already suggested you have a O(N**2) algorithm. Here's an
                excellent article that explains more about them:

                We spend a lot of time on this site talking about exciting Big Picture Stuff like .NET versus Java, XML strategy, Lock-In, competitive strategy, software design, architecture, and so forth. All thi…



                The article is biased towards low-level languages like C. In Python the
                way to avoid them is to use sets or dicts, or to keep the list sorted and
                use the bisect module.


                --
                Steven

                Comment

                Working...