improving a huge double-for cycle

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alexzive

    improving a huge double-for cycle

    Hello there :) ,

    I am a python newbie and need to run following code for a task in an
    external simulation programm called "Abaqus" which makes use of python
    to access the mesh (ensamble of nodes with xy coordinates) of a
    certain geometrical model.

    [IN is the starting input containing the nodes to be check, there are
    some double nodes with the same x and y coordinates which need to be
    removed. SN is the output containing such double nodes]

    Code: Select all
    for i in range(len(IN)): #scan all elements of the list IN
    for j in range(len(IN)):
    if i <j:
    if IN[i].coordinates[0] == IN[j].coordinates[0]:
    if IN[i].coordinates[1] == IN[j].coordinates[1]:
    SN.append(IN[i].label)



    Unfortunately my len(IN) is about 100.000 and the running time about
    15h !!!! :(

    Any idea to improve it?

    I have already tried to group the "if statements" in a single one:

    Code: Select all
    if i <j and if IN[i].coordinates[0] == IN[j].coordinates[0] and
    if IN[i].coordinates[1] == IN[j].coordinates[1]:


    but no improvements.

    Many thanks, Alex
  • skip@pobox.com

    #2
    Re: improving a huge double-for cycle


    AlexUnfortunate ly my len(IN) is about 100.000 and the running time
    Alexabout 15h !!!! :(

    AlexAny idea to improve it?

    numpy?




    More immediately, note that you are building a list of len(IN) ints every
    time through the inner loop. A quick hit might be this simple change:

    indexes = range(len(IN))
    for i in indexes: #scan all elements of the list IN
    for j in indexes:
    if i != j:
    if (IN[i].coordinates[0] == IN[j].coordinates[0] and
    IN[i].coordinates[1] == IN[j].coordinates[1]):
    SN.append(IN[i].label)

    Skip

    Comment

    • Peter Otten

      #3
      Re: improving a huge double-for cycle

      Alexzive wrote:
      Hello there :) ,
      >
      I am a python newbie and need to run following code for a task in an
      external simulation programm called "Abaqus" which makes use of python
      to access the mesh (ensamble of nodes with xy coordinates) of a
      certain geometrical model.
      >
      [IN is the starting input containing the nodes to be check, there are
      some double nodes with the same x and y coordinates which need to be
      removed. SN is the output containing such double nodes]
      >
      Code: Select all
      for i in range(len(IN)): #scan all elements of the list IN
      for j in range(len(IN)):
      if i <j:
      if IN[i].coordinates[0] == IN[j].coordinates[0]:
      if IN[i].coordinates[1] == IN[j].coordinates[1]:
      SN.append(IN[i].label)
      >
      >
      >
      Unfortunately my len(IN) is about 100.000 and the running time about
      15h !!!! :(
      Any idea to improve it?
      >
      I have already tried to group the "if statements" in a single one:
      >
      Code: Select all
      if i <j and if IN[i].coordinates[0] == IN[j].coordinates[0] and
      if IN[i].coordinates[1] == IN[j].coordinates[1]:
      >
      >
      but no improvements.
      >
      Many thanks, Alex
      When you're looking for duplicates an efficient solution is likely to be
      based on a set or dict object.

      # untested
      from collections import defaultdict

      groups = defaultdict(lis t)
      for item in IN:
      c = item.coordinate s
      groups[c[0], c[1]].append(item.la bel)
      SN = []
      for labels in groups.itervalu es():
      if len(labels) 1:
      SN.extend(label s) # or labels[1:] if you want to keep one item

      Peter

      Comment

      • Tim Chase

        #4
        Re: improving a huge double-for cycle

        Code: Select all
        for i in range(len(IN)): #scan all elements of the list IN
        for j in range(len(IN)):
        if i <j:
        if IN[i].coordinates[0] == IN[j].coordinates[0]:
        if IN[i].coordinates[1] == IN[j].coordinates[1]:
        SN.append(IN[i].label)
        >
        >
        Unfortunately my len(IN) is about 100.000 and the running time about
        15h !!!! :(
        >
        Any idea to improve it?
        [snip]
        I have already tried to group the "if statements" in a single one:
        >
        Code: Select all
        if i <j and if IN[i].coordinates[0] == IN[j].coordinates[0] and
        if IN[i].coordinates[1] == IN[j].coordinates[1]:
        >
        but no improvements.
        It's like rearranging deck-chairs on the Titanic :) Yes, it may
        give a speed up, but what's 3 seconds when you're waiting 15hr :)

        Not knowing the len(IN[x].coordinates) or their structure, if
        it's a list of len==2, you should be able to just do

        if i <j and IN[i].coordinates == IN[j].coordinates

        or

        if i <j and IN[i].coordinates[:2] == IN[j].coordinates[:2]

        However, again, this is just polish. The big problem is that you
        have an O(N^2) algorithm that's killing you.

        1) use xrange instead of range to save eating memory with a huge
        unneeded array.

        2) unless you need to append duplicate labels, you know that when
        I and J are swapped, you'll reach the same condition again, so it
        might be worth writing the outer loops to eliminate this
        scenario, and in the process, but just starting at i+1 rather
        than i, you can forgo the check if "i<>j".

        Such changes might look something like

        for i in xrange(len(IN)) :
        for j in xrange(i+1, len(IN)):
        if IN[i].coordinates == IN[j].coordinates:
        SN.append(IN[i].label)

        If my college algorithms memory serves me sufficiently, this
        reduces your O(N^2) to O(N log N) which will garner you some
        decent time savings.

        -tkc


        Comment

        • bearophileHUGS@lycos.com

          #5
          Re: improving a huge double-for cycle

          Skip:
          indexes = range(len(IN))
          for i in indexes: #scan all elements of the list IN
          for j in indexes:
          Nope, use xrange in both situations, and save a list.


          Tim Chase:
          for i in xrange(len(IN)) :
          for j in xrange(i+1, len(IN)):
          if IN[i].coordinates == IN[j].coordinates:
          SN.append(IN[i].label)
          >
          If my college algorithms memory serves me sufficiently, this
          reduces your O(N^2) to O(N log N) which will garner you some
          decent time savings.
          That's O(n^2) still, it's just half matrix, a triangle.

          Bye,
          bearophile

          Comment

          • Steven D'Aprano

            #6
            Re: improving a huge double-for cycle

            On Thu, 18 Sep 2008 05:25:02 -0700, Alexzive wrote:
            Hello there :) ,
            >
            I am a python newbie and need to run following code for a task in an
            external simulation programm called "Abaqus" which makes use of python
            to access the mesh (ensamble of nodes with xy coordinates) of a certain
            geometrical model.
            >
            [IN is the starting input containing the nodes to be check, there are
            some double nodes with the same x and y coordinates which need to be
            removed. SN is the output containing such double nodes]
            >
            Code: Select all
            for i in range(len(IN)): #scan all elements of the list IN
            for j in range(len(IN)):
            if i <j:
            if IN[i].coordinates[0] == IN[j].coordinates[0]:
            if IN[i].coordinates[1] == IN[j].coordinates[1]:
            SN.append(IN[i].label)

            Here's a better version of your algorithm, one which avoids the minor
            inefficiencies but keeps the huge inefficiency:

            for node1 in IN:
            for node2 in IN:
            if node1 is not node2:
            if node1.coordinat es == node2.coordinat es:
            SN.append(node1 .label)


            This assumes that node.coordinate s is a type where equality is defined.
            If they are a tuple or list, that should work fine.

            But the huge inefficiency is that you are checking each node not once,
            not twice, but 100,000 times! So you have to iterate 10,000,000,000
            times, which is going to be slow no matter what you do. Especially in
            pure Python.

            Here's a better idea: iterate over the list once only:

            seen = set()
            for node in IN:
            coords = tuple(node.coor dinates)
            if coords in seen:
            SN.append(node. label)
            else:
            seen.add(coords )




            Hope this helps.



            --
            Steven

            Comment

            • Tim Chase

              #7
              Re: improving a huge double-for cycle

              Tim Chase:
              > for i in xrange(len(IN)) :
              > for j in xrange(i+1, len(IN)):
              > if IN[i].coordinates == IN[j].coordinates:
              > SN.append(IN[i].label)
              >>
              >If my college algorithms memory serves me sufficiently, this
              >reduces your O(N^2) to O(N log N) which will garner you some
              >decent time savings.
              >
              That's O(n^2) still, it's just half matrix, a triangle.
              Ah, good catch as I'm thinking about it more, you're right...it's
              O((N^2)/2) which is just O(N^2). Sigh. I'm not awake enough
              yet. However, dividing the time by 2 (from 15hr to 7.5hr) is
              still a significant savings in my book :)

              However, if list-comprehensions are faster as well, you might be
              able to do something like

              SN = [d.label
              for (i,d) in enumerate(IN)
              for j in xrange(i+1, len(IN))
              if d.coordinates == IN[j].coordinates
              ]

              or the slightly more verbose (but closer to my original code) version

              SN = [IN[i].label
              for i in xrange(len(IN))
              for j in xrange(i+1, len(IN))
              if IN[i].coordinates == IN[j].coordinates
              ]

              To the OP: As always, throw some timing code on the various
              samples you get back from the list and see what works for you.

              -tkc




              Comment

              • J. Cliff Dyer

                #8
                Re: improving a huge double-for cycle


                On Thu, 2008-09-18 at 07:57 -0500, Tim Chase wrote:
                Code: Select all
                for i in range(len(IN)): #scan all elements of the list IN
                for j in range(len(IN)):
                if i <j:
                if IN[i].coordinates[0] == IN[j].coordinates[0]:
                if IN[i].coordinates[1] == IN[j].coordinates[1]:
                SN.append(IN[i].label)
                Such changes might look something like
                >
                for i in xrange(len(IN)) :
                for j in xrange(i+1, len(IN)):
                if IN[i].coordinates == IN[j].coordinates:
                SN.append(IN[i].label)
                If you aren't checking j values less than i, you might want to do both

                SN.append(IN[i].label)
                SN.append(IN[j].label)

                on the same pass.

                If my college algorithms memory serves me sufficiently, this
                reduces your O(N^2) to O(N log N) which will garner you some
                decent time savings.
                >
                -tkc
                >
                >
                --

                >

                Comment

                • pruebauno@latinmail.com

                  #9
                  Re: improving a huge double-for cycle

                  On Sep 18, 8:25 am, Alexzive <zasaconsult... @gmail.comwrote :
                  Hello there :) ,
                  >
                  I am a python newbie and need to run following code for a task in an
                  external simulation programm called "Abaqus" which makes use of python
                  to access the mesh (ensamble of nodes with xy coordinates) of a
                  certain geometrical model.
                  >
                  [IN is the starting input containing the nodes to be check, there are
                  some double nodes with the same x and y coordinates which need to be
                  removed. SN is the output containing such double nodes]
                  >
                  Code: Select all
                  for i in range(len(IN)): #scan all elements of the list IN
                  for j in range(len(IN)):
                  if i <j:
                  if IN[i].coordinates[0] == IN[j].coordinates[0]:
                  if IN[i].coordinates[1] == IN[j].coordinates[1]:
                  SN.append(IN[i].label)
                  >
                  Unfortunately my len(IN) is about 100.000 and the running time about
                  15h !!!! :(
                  >
                  Any idea to improve it?
                  >
                  I have already tried to group the "if statements" in a single one:
                  >
                  Code: Select all
                  if i <j and if IN[i].coordinates[0] == IN[j].coordinates[0] and
                  if IN[i].coordinates[1] == IN[j].coordinates[1]:
                  >
                  but no improvements.
                  >
                  Many thanks, Alex

                  dup=set()
                  SN=[]
                  for item in IN:
                  c=item.coordina tes[0], item.coordinate s[1]
                  if c in dup:
                  SN.append(item. label)
                  else:
                  dup.add(c)

                  Comment

                  • Jake Anderson

                    #10
                    Re: improving a huge double-for cycle

                    psyco might help a fair bit (10x-40x) here ;->
                    perhaps look at dumping the data into sqlite then pulling it back out.
                    It (or the other databases) are designed for tossing around large lumps
                    of data.
                    Alexzive wrote:
                    Hello there :) ,
                    >
                    I am a python newbie and need to run following code for a task in an
                    external simulation programm called "Abaqus" which makes use of python
                    to access the mesh (ensamble of nodes with xy coordinates) of a
                    certain geometrical model.
                    >
                    [IN is the starting input containing the nodes to be check, there are
                    some double nodes with the same x and y coordinates which need to be
                    removed. SN is the output containing such double nodes]
                    >
                    Code: Select all
                    for i in range(len(IN)): #scan all elements of the list IN
                    for j in range(len(IN)):
                    if i <j:
                    if IN[i].coordinates[0] == IN[j].coordinates[0]:
                    if IN[i].coordinates[1] == IN[j].coordinates[1]:
                    SN.append(IN[i].label)
                    >
                    >
                    >
                    Unfortunately my len(IN) is about 100.000 and the running time about
                    15h !!!! :(
                    >
                    Any idea to improve it?
                    >
                    I have already tried to group the "if statements" in a single one:
                    >
                    Code: Select all
                    if i <j and if IN[i].coordinates[0] == IN[j].coordinates[0] and
                    if IN[i].coordinates[1] == IN[j].coordinates[1]:
                    >
                    >
                    but no improvements.
                    >
                    Many thanks, Alex
                    --

                    >

                    --

                    Vapour Forge

                    Jake Anderson

                    Project Manager

                    Mobile: 0412 897 125

                    Email: jake@vapourforg e.com

                    Web Page: www.vapourforge.com

                    Your source for custom IT services


                    Comment

                    • giltay@gmail.com

                      #11
                      Re: improving a huge double-for cycle

                      On Sep 18, 11:18 am, prueba...@latin mail.com wrote:
                      dup=set()
                      SN=[]
                      for item in IN:
                         c=item.coordina tes[0], item.coordinate s[1]
                         if c in dup:
                            SN.append(item. label)
                         else:
                            dup.add(c)
                      +1 for O(N)

                      If item.coordinate s is just an (x, y) pair, you can skip building c
                      and save a little memory:

                      seen_coords = set()

                      for node in IN:
                      if node.coordinate s in seen_coords:
                      SN.append(node. label)
                      else:
                      seen_coords.add (node.coordinat es)

                      seen_coords gets populated with references to the existing
                      node.coordinate s objects, instead of new tuples.

                      Geoff G-T

                      Comment

                      • Harald Luessen

                        #12
                        Re: improving a huge double-for cycle

                        On Thu, 18 Sep 2008 Alexzive wrote:
                        >I am a python newbie and need to run following code for a task in an
                        >external simulation programm called "Abaqus" which makes use of python
                        >to access the mesh (ensamble of nodes with xy coordinates) of a
                        >certain geometrical model.
                        >
                        >[IN is the starting input containing the nodes to be check, there are
                        >some double nodes with the same x and y coordinates which need to be
                        >removed. SN is the output containing such double nodes]
                        >
                        >Code: Select all
                        for i in range(len(IN)): #scan all elements of the list IN
                        for j in range(len(IN)):
                        if i <j:
                        if IN[i].coordinates[0] == IN[j].coordinates[0]:
                        if IN[i].coordinates[1] == IN[j].coordinates[1]:
                        SN.append(IN[i].label)
                        I did not test the syntax, but here is an idea with sorted lists.
                        It should be O(NlogN).

                        def sk(x):
                        return x.coordinates[0]

                        IN.sort(key=sk)
                        for i in xrange(len(IN)) :
                        for j in xrange(i+1, len(IN)):
                        if IN[i].coordinates[0] == IN[j].coordinates[0]:
                        if IN[i].coordinates[1] == IN[j].coordinates[1]:
                        SN.append(IN[i].label)
                        else:
                        break

                        Harald

                        Comment

                        • Paul Hankin

                          #13
                          Re: improving a huge double-for cycle

                          On Sep 18, 2:25 pm, Alexzive <zasaconsult... @gmail.comwrote :
                          Hello there :) ,
                          >
                          I am a python newbie and need to run following code for a task in an
                          external simulation programm called "Abaqus" which makes use of python
                          to access the mesh (ensamble of nodes with xy coordinates) of a
                          certain geometrical model.
                          >
                          [IN is the starting input containing the nodes to be check, there are
                          some double nodes with the same x and y coordinates which need to be
                          removed. SN is the output containing such double nodes]
                          >
                          Code: Select all
                              for i in range(len(IN)): #scan all elements of the list IN
                                for j in range(len(IN)):
                                  if i <j:
                                   if IN[i].coordinates[0] == IN[j].coordinates[0]:
                                     if IN[i].coordinates[1] == IN[j].coordinates[1]:
                                        SN.append(IN[i].label)
                          >
                          Unfortunately my len(IN) is about 100.000 and the running time about
                          15h !!!! :(
                          >
                          Any idea to improve it?
                          >
                          I have already tried to group the "if statements" in a single one:
                          >
                          Code: Select all
                              if i <j and if IN[i].coordinates[0] == IN[j].coordinates[0]and
                          if IN[i].coordinates[1] == IN[j].coordinates[1]:
                          A simple O(N) algorithm:

                          from collections import defaultdict

                          d = defaultdict(int )
                          for x in IN:
                          d[x] += 1

                          SN = [x for (x, c) in d.iteritems() if c 1]

                          --
                          Paul Hankin

                          Comment

                          • Paul Hankin

                            #14
                            Re: improving a huge double-for cycle

                            On Sep 18, 2:25 pm, Alexzive <zasaconsult... @gmail.comwrote :
                            Hello there :) ,
                            >
                            I am a python newbie and need to run following code for a task in an
                            external simulation programm called "Abaqus" which makes use of python
                            to access the mesh (ensamble of nodes with xy coordinates) of a
                            certain geometrical model.
                            >
                            [IN is the starting input containing the nodes to be check, there are
                            some double nodes with the same x and y coordinates which need to be
                            removed. SN is the output containing such double nodes]
                            >
                            Code: Select all
                                for i in range(len(IN)): #scan all elements of the list IN
                                  for j in range(len(IN)):
                                    if i <j:
                                     if IN[i].coordinates[0] == IN[j].coordinates[0]:
                                       if IN[i].coordinates[1] == IN[j].coordinates[1]:
                                          SN.append(IN[i].label)
                            Using only a little extra storage to compute IN (but O(N log N)
                            time complexity):

                            import itertools

                            IN.sort()
                            SN = [k for k, v in itertools.group by(IN) if len(list(v)) 1]

                            --
                            Paul Hankin

                            Comment

                            • bearophileHUGS@lycos.com

                              #15
                              Re: improving a huge double-for cycle

                              Bruno Desthuilliers:
                              def doubles9():
                              ...
                              SN = []
                              sn_append = SN.append
                              Few more:

                              import psyco

                              def doubles10():
                              dup = set()
                              SN = []
                              for item in IN:
                              c = item.coordinate s
                              if c in dup:
                              SN.append(item)
                              else:
                              dup.add(c)
                              return SN

                              psyco.bind(doub les10)


                              from collections import deque

                              def doubles11():
                              dup = set()
                              dup_add = dup.add
                              SN = deque()
                              sn_append = SN.append

                              for item in IN:
                              c = item.coordinate s
                              if c in dup:
                              sn_append(item)
                              else:
                              dup_add(c)
                              return SN


                              def doubles12():
                              dup = set()
                              SN = deque()
                              for item in IN:
                              c = item.coordinate s
                              if c in dup:
                              SN.append(item)
                              else:
                              dup.add(c)
                              return SN

                              psyco.bind(doub les12)


                              Timings:

                              doubles00 : 0.522365288653
                              doubles01 : 0.247219812198
                              doubles02 : 0.237889823898
                              doubles03 : 0.238638921389
                              doubles04 : 0.23821698217
                              doubles05 : 0.177042495425
                              doubles06 : 0.13166199162
                              doubles08 : 0.0056972519725 2
                              doubles09 : 0.0041856668566 7
                              doubles10 : 0.0019208692086 9
                              doubles11 : 0.0040332453324 5
                              doubles12 : 0.0018402684026 8

                              Hopefully that's fast enough :-)

                              Bye,
                              bearophile

                              Comment

                              Working...