Execution speed question

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Suresh Pillai

    Execution speed question

    I am performing simulations on networks (graphs). I have a question on
    speed of execution (assuming very ample memory for now). I simplify the
    details of my simulation below, as the question I ask applies more
    generally than my specific case. I would greatly appreciate general
    feedback in terms of computing and of course considerations specific to
    implementation in Python.

    The nodes in my network may be ON or OFF. The network starts off with
    all nodes in the OFF state. I loop through the nodes. For each node
    that is OFF, I consider some probability of it turning ON based on the
    states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
    WHICH ONES TO TURN ON.

    So my question is whether it is faster to

    1. loop through a list of ALL nodes and check for OFF nodes using ifs

    or to

    2. loop through a container of OFF nodes and remove from this when they
    turn ON

    The second would result in looping through less nodes, especially as the
    simulation progresses, but how does the cost of removal compare with
    cheap ifs and would the extra memory usage affect performance.

    I an appreciate that the cost of the if check, the number of nodes, and
    the type of container I use will come into the answer.

    In my case, the ifs are cheap boolean queries (whether the node is ON or
    OFF). The number of nodes is very large: millions for sure, maybe tens
    of millions. If considering (2), take note of my BOLD text above, which
    means I can't remove nodes as I iterate through them in the main loop.

    I naturally started coding with (2), but couldn't decide on the best data
    structure for python. A set seemed ideal for speedy removal, but then I
    can't iterate through them with out popping. An ordered list? Some
    creative solution with numpy arrays?

    There is also the complication that since we are in interpreted python,
    what is theoretically the best data structure may not in reality be
    optimal unless it is a default system object or coded externally in a
    compiled module.

    Of course, I will start experimenting to see what the execution
    difference is, but I would appreciate some suggestions from others re
    which is best and also on best data structure for (2).

    I'm not a newbie, so you can get technical with me python-wise and
    algorithm wise. I realise it is a 'basic' question, but it is something
    that I have always wondered about (cheap ifs versus extra structure) and
    with the number of nodes I am considering, it actually becomes an issue.

    Many Thanks,
    Suresh
  • alex23

    #2
    Re: Execution speed question

    On Jul 25, 7:57 pm, Suresh Pillai <stochash...@ya hoo.cawrote:
    The nodes in my network may be ON or OFF.  The network starts off with
    all nodes in the OFF state.  I loop through the nodes.  For each node
    that is OFF, I consider some probability of it turning ON based on the
    states of its neighbours.  I MUST GO THROUGH ALL NODES BEFORE DECIDING
    WHICH ONES TO TURN ON.
    >
    So my question is whether it is faster to
    >
    1. loop through a list of ALL nodes and check for OFF nodes using ifs
    I'd recommend using 'filter' and list comprehensions.
    >>import random
    >>class Node:
    ... def __init__(self):
    ... self.on = False
    ... def toggle(self):
    ... self.on = random.choice([True, False])
    ...
    >>nodes = [Node() for i in range(0, 10000)]
    >>for node in nodes:
    ... node.toggle()
    ...
    >>off_nodes = filter(lambda x: not x.on, nodes)
    >>len(off_nodes )
    5050


    Comment

    • Jeff

      #3
      Re: Execution speed question

      I'd recommend using 'filter' and list comprehensions.

      Look at using reduce(). You can collect information about all of the
      nodes without necessarily building a large, intermediate list in the
      process.

      You might get some ideas from here [http://en.wikipedia.org/wiki/
      Antiobjects].

      Comment

      • Iain King

        #4
        Re: Execution speed question

        On Jul 25, 10:57 am, Suresh Pillai <stochash...@ya hoo.cawrote:
        I am performing simulations on networks (graphs). I have a question on
        speed of execution (assuming very ample memory for now). I simplify the
        details of my simulation below, as the question I ask applies more
        generally than my specific case. I would greatly appreciate general
        feedback in terms of computing and of course considerations specific to
        implementation in Python.
        >
        The nodes in my network may be ON or OFF. The network starts off with
        all nodes in the OFF state. I loop through the nodes. For each node
        that is OFF, I consider some probability of it turning ON based on the
        states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
        WHICH ONES TO TURN ON.
        >
        So my question is whether it is faster to
        >
        1. loop through a list of ALL nodes and check for OFF nodes using ifs
        >
        or to
        >
        2. loop through a container of OFF nodes and remove from this when they
        turn ON
        or 3. build a new list every iteration intead of deleting from the old
        one:

        while processing:
        new_off_list = []
        for x in off_list:
        if goes_on(x):
        on_list.append( x)
        else:
        new_off_list.ap pend(x)
        off_list = new_off_list
        generation += 1

        Iain

        Comment

        • Iain King

          #5
          Re: Execution speed question

          On Jul 25, 1:46 pm, Iain King <iaink...@gmail .comwrote:
          On Jul 25, 10:57 am, Suresh Pillai <stochash...@ya hoo.cawrote:
          >
          >
          >
          I am performing simulations on networks (graphs). I have a question on
          speed of execution (assuming very ample memory for now). I simplify the
          details of my simulation below, as the question I ask applies more
          generally than my specific case. I would greatly appreciate general
          feedback in terms of computing and of course considerations specific to
          implementation in Python.
          >
          The nodes in my network may be ON or OFF. The network starts off with
          all nodes in the OFF state. I loop through the nodes. For each node
          that is OFF, I consider some probability of it turning ON based on the
          states of its neighbours. I MUST GO THROUGH ALL NODES BEFORE DECIDING
          WHICH ONES TO TURN ON.
          >
          So my question is whether it is faster to
          >
          1. loop through a list of ALL nodes and check for OFF nodes using ifs
          >
          or to
          >
          2. loop through a container of OFF nodes and remove from this when they
          turn ON
          >
          or 3. build a new list every iteration intead of deleting from the old
          one:
          >
          while processing:
          new_off_list = []
          for x in off_list:
          if goes_on(x):
          on_list.append( x)
          else:
          new_off_list.ap pend(x)
          off_list = new_off_list
          generation += 1
          >
          Iain
          I was curious to what extent the different methods varied in time, so
          I checked it out. Here there are three procedures: test_every which
          matches your (1), destructive which matches your (2), and constructive
          which is (3) as I've outlined above.

          On varying the size of the dataset I get this (probability a node goes
          on = 50%):

          Length of initial list: 100000
          Test every: 1.16085492357
          Destructive: 2.592310272
          Constructive: 0.850312458886

          Length of initial list: 200000
          Test every: 2.48013843287
          Destructive: 9.20894689718
          Constructive: 1.73562198439

          Length of initial list: 400000
          Test every: 5.00652267447
          Destructive: 44.9696004134
          Constructive: 3.51687329373

          Length of initial list: 800000
          Test every: 9.67657648655
          Destructive: 220.57583941
          Constructive: 7.06614485537


          and changing the probability that a nodes goes on (dataset size =
          200000):


          Probability goes on: 1/2
          Test every: 2.24765364513
          Destructive: 9.28801971614
          Constructive: 1.62770773816

          Probability goes on: 1/4
          Test every: 4.77387350904
          Destructive: 13.4432467571
          Constructive: 3.45467140006

          Probability goes on: 1/8
          Test every: 11.0514899721
          Destructive: 18.4026878278
          Constructive: 6.86778036177

          Probability goes on: 1/16
          Test every: 22.5896021593
          Destructive: 25.7784044083
          Constructive: 13.8631404605

          Probability goes on: 1/32
          Test every: 49.7667941179
          Destructive: 39.3652502735
          Constructive: 27.2527219598

          Probability goes on: 1/64
          Test every: 91.0523955153
          Destructive: 65.7747103963
          Constructive: 54.4087322936

          Code:

          import random
          from timeit import Timer

          SIZE = 100000
          MAX = 2

          def goes_on(x):
          global MAX
          return random.randint( 1,MAX) == 1

          def test_every():
          global SIZE
          print "Test every:",
          nodes = range(SIZE)
          is_on = [False for x in xrange(SIZE)]
          count = SIZE
          while count:
          for i,x in enumerate(nodes ):
          if not is_on[i] and goes_on(x):
          is_on[i] = True
          count -= 1

          def destructive():
          global SIZE
          print "Destructiv e:",
          off_list = range(SIZE)
          on_list = []
          count = SIZE
          while count:
          for i in xrange(len(off_ list)-1, -1, -1):
          x = off_list[i]
          if goes_on(x):
          on_list.append( x)
          del(off_list[i])
          count -= 1

          def constructive():
          global SIZE
          print "Constructive:" ,
          off_list = range(SIZE)
          on_list = []
          count = SIZE
          while count:
          new_off_list = []
          for x in off_list:
          if goes_on(x):
          on_list.append( x)
          count -= 1
          else:
          new_off_list.ap pend(x)
          off_list = new_off_list

          #SIZE = 200000
          while True:
          print "Length of initial list:", SIZE
          #print "Probabilit y goes on: 1/%d" % MAX
          print Timer("test_eve ry()", "from __main__ import
          test_every").ti meit(1)
          print Timer("destruct ive()", "from __main__ import
          destructive").t imeit(1)
          print Timer("construc tive()", "from __main__ import
          constructive"). timeit(1)
          print
          SIZE *= 2
          #MAX *= 2



          Conclusions:

          On size, (2) really doesn't like bigger datasets, taking exponentially
          longer as it increases, while (1) and (3) happily increase linearly.
          (3) is faster.

          On probability it's (1) who's the loser, while (2) and (3) are happy.
          (3) is once again faster.

          I think (2)'s poor performance is being amplified by how python
          handles lists and list deletions; the effect may be stymied in other
          languages, or by using other data constructs in python (like a
          dictionary or a user made list class). If you were short on memory
          then (2) would have an advantage, but as it is, (3) is the clear
          winner.
          I'm a fan of list comprehensions, and it feels like they could be nice
          here, but since we are making two lists at once here I don't see how
          to... anyone see how to use them (or 'map' if you want to be old
          school)?

          Iain

          Comment

          • alex23

            #6
            Re: Execution speed question

            On Jul 25, 9:54 pm, Jeff <jeffo...@gmail .comwrote:
            Look at using reduce().  You can collect information about all of the
            nodes without necessarily building a large, intermediate list in the
            process.
            From the OP's description, I assumed there'd be a list of all nodes,
            from which he wishes to derive a 2nd list of specific nodes. reduce()
            applies "a function of two arguments cumulatively to the items of a
            sequence, from left to right, so as to reduce the sequence to a single
            value", which doesn't seem to me to be what the OP was asking for. I
            could understand using map() across the filter'd list, or embedding
            the conditional check within the map'd function and ignoring filter(),
            but at no point does the OP ask to perform any kind of function based
            on two nodes...

            I may have misunderstand your point, though :) Could you provide a
            quick code sample to clarify?


            Comment

            • Suresh Pillai

              #7
              Re: Execution speed question

              That's a good comparison for the general question I posed. Thanks.
              Although I do believe lists are less than ideal here and a different data
              structure should be used.

              To be more specific to my case:
              As mentioned in my original post, I also have the specific condition that
              one does not know which nodes to turn ON until after all the
              probabilities are calculated (lets say we take the top m for example).
              In this case, the second and third will perform worse as the second one
              will require a remove from the list after the fact and the third will
              require another loop through the nodes to build the new list.

              Comment

              • Fredrik Lundh

                #8
                Re: Execution speed question

                Iain King wrote:
                I think (2)'s poor performance is being amplified by how python
                handles lists and list deletions; the effect may be stymied in other
                languages
                Delete is O(n) (or "O(n/2) on average", if you prefer), while append is
                amortized O(1).

                Unless I'm missing something, your example keeps going until it's
                flagged *all* nodes as "on", which, obviously, kills performance for the
                first version as the probability goes down. The OP's question was about
                a single pass (but he did mention "as the simulation progresses", so I
                guess it's fair to test a complete simulation.)

                Btw, if the nodes can be enumerated, I'd probably do something like:

                node_list = ... get list of nodes ...
                random.shuffle( node_list)

                start = 0
                end = len(node_list)
                step = end / MAX

                while start < end:

                for i in xrange(start, start + step):
                ... switch on node_list[i] ...

                ... do whatever you want to do after a step ...

                # prepare for next simulation step
                start += step
                step = max((len(node_l ist) - start) / MAX, 1)

                which is near O(n) overall, and mostly constant wrt. the probability for
                each pass (where the probability is 1:MAX).

                Might need some tuning; tweak as necessary.

                </F>

                Comment

                • Suresh Pillai

                  #9
                  Re: Execution speed question

                  On Fri, 25 Jul 2008 16:51:42 +0200, Fredrik Lundh wrote:
                  Unless I'm missing something, your example keeps going until it's
                  flagged *all* nodes as "on", which, obviously, kills performance for the
                  first version as the probability goes down. The OP's question was about
                  a single pass (but he did mention "as the simulation progresses", so I
                  guess it's fair to test a complete simulation.)
                  I was referring to multiple passes as in Iain' test cases. Although not
                  necessarily till all nodes are ON, let's say to to a large proportion at
                  least.

                  Comment

                  • Iain King

                    #10
                    Re: Execution speed question

                    On Jul 25, 3:39 pm, Suresh Pillai <stochash...@ya hoo.cawrote:
                    That's a good comparison for the general question I posed. Thanks.
                    Although I do believe lists are less than ideal here and a different data
                    structure should be used.
                    >
                    To be more specific to my case:
                    As mentioned in my original post, I also have the specific condition that
                    one does not know which nodes to turn ON until after all the
                    probabilities are calculated (lets say we take the top m for example).
                    In this case, the second and third will perform worse as the second one
                    will require a remove from the list after the fact and the third will
                    require another loop through the nodes to build the new list.
                    So you need to loops through twice regardless? i.e. loop once to
                    gather data on off nodes, do some calculation to work out what to turn
                    on, then loop again to turn on the relevant nodes? If so, then I
                    think the functions above remain the same, becoming the 2nd loop.
                    Every iteration you do a first loop over the off_nodes (or them all
                    for (1)) to gather the data on them, perform your calculation, and
                    then perform one of the above functions (minus the setup code at the
                    begining; basically starting at the 'for') as a second loop, with the
                    goes_on function now returning a value based on the calculation
                    (rather than the calculation itself as I had it). Performance should
                    be similar.

                    Iain

                    Comment

                    • Matthew Fitzgibbons

                      #11
                      Re: Execution speed question

                      Suresh Pillai wrote:
                      That's a good comparison for the general question I posed. Thanks.
                      Although I do believe lists are less than ideal here and a different data
                      structure should be used.
                      >
                      To be more specific to my case:
                      As mentioned in my original post, I also have the specific condition that
                      one does not know which nodes to turn ON until after all the
                      probabilities are calculated (lets say we take the top m for example).
                      In this case, the second and third will perform worse as the second one
                      will require a remove from the list after the fact and the third will
                      require another loop through the nodes to build the new list.
                      --

                      >
                      It seems like the probability calculation applies to all three equally,
                      and can therefore be ignored for the simulations. You said that your
                      algorithm must be a two-stage process: (A) calculate the probabilities
                      then (B) turn on some nodes. Iain's simulations assume (A) is already
                      done. He just addressed the performance of (B). Part (A) is invariant
                      for all his simulations, because your requirement forces it to be.

                      As for different data structures, it largely depends on how you need to
                      access the data. If you don't need to index the data, just loop through
                      it, you might try a linked list. The performance hit in (2) is coming
                      from the list del; using a linked list would make the removal constant
                      rather than O(n), and may even execute faster than (3) as well.

                      -Matt

                      Comment

                      • Iain King

                        #12
                        Re: Execution speed question

                        On Jul 25, 4:22 pm, Matthew Fitzgibbons <eles...@nienna .orgwrote:
                        It seems like the probability calculation applies to all three equally,
                        and can therefore be ignored for the simulations.
                        The probability affects (1) more. My reasoning for this being: as
                        probability gets lower the number of times you have to loop over the
                        list increases. (1) always loops over the full list, but with each
                        successive iteration (2) and (3) are looping over smaller and smaller
                        lists. In the end this adds up, with (1) becoming slower than (2),
                        even though it starts out quicker.

                        Iain

                        Comment

                        • Matthew Fitzgibbons

                          #13
                          Re: Execution speed question

                          Iain King wrote:
                          On Jul 25, 4:22 pm, Matthew Fitzgibbons <eles...@nienna .orgwrote:
                          >It seems like the probability calculation applies to all three equally,
                          >and can therefore be ignored for the simulations.
                          >
                          The probability affects (1) more. My reasoning for this being: as
                          probability gets lower the number of times you have to loop over the
                          list increases. (1) always loops over the full list, but with each
                          successive iteration (2) and (3) are looping over smaller and smaller
                          lists. In the end this adds up, with (1) becoming slower than (2),
                          even though it starts out quicker.
                          >
                          Iain
                          --

                          >
                          I meant the _calculation_ of the probability affects all three equally,
                          not the value itself. As your simulations show, different probabilities
                          affect the algorithms differently; I'm talking about the algorithm to
                          arrive at the probability value.

                          -Matt

                          Comment

                          • Eric Wertman

                            #14
                            Re: Execution speed question

                            The number of nodes is very large: millions for sure, maybe tens
                            of millions. If considering (2), take note of my BOLD text above, which
                            means I can't remove nodes as I iterate through them in the main loop.
                            Since your use of 'node' is pretty vague and I don't have a good sense
                            of what tests you are running and how long they would take, I'm only
                            speculating, but a single loop might be the wrong way to go about
                            that.

                            If you are going to be frequently running tests and switching nodes
                            on/off, have you considered a separate set of processes to do both?
                            For example:


                            A set of some number of "tester" threads, that loop through and test,
                            recording thier results (somewhere).


                            You could then have a separate loop that runs every so often, checks
                            all the current test values, and runs through the nodes once,
                            switching them on or off.


                            I know it's not exactly what you asked, but depending on what your
                            nodes are exactly, you can avoid a lot of other problems down the
                            road. What if your single loop dies or gets hung on a test? With a
                            separate approach, you'll have a lot more resilience too.. if there's
                            some problem with a single tester or node, it won't keep the rest of
                            the program from continuing to function.

                            Comment

                            • Suresh Pillai

                              #15
                              Re: Execution speed question

                              On Fri, 25 Jul 2008 08:08:57 -0700, Iain King wrote:
                              On Jul 25, 3:39 pm, Suresh Pillai <stochash...@ya hoo.cawrote:
                              >That's a good comparison for the general question I posed. Thanks.
                              >Although I do believe lists are less than ideal here and a different
                              >data structure should be used.
                              >>
                              >To be more specific to my case:
                              >As mentioned in my original post, I also have the specific condition
                              >that one does not know which nodes to turn ON until after all the
                              >probabilitie s are calculated (lets say we take the top m for example).
                              >In this case, the second and third will perform worse as the second one
                              >will require a remove from the list after the fact and the third will
                              >require another loop through the nodes to build the new list.
                              >
                              So you need to loops through twice regardless? i.e. loop once to gather
                              data on off nodes, do some calculation to work out what to turn on, then
                              loop again to turn on the relevant nodes? If so, then I think the
                              functions above remain the same, becoming the 2nd loop. Every iteration
                              you do a first loop over the off_nodes (or them all for (1)) to gather
                              the data on them, perform your calculation, and then perform one of the
                              above functions (minus the setup code at the begining; basically
                              starting at the 'for') as a second loop, with the goes_on function now
                              returning a value based on the calculation (rather than the calculation
                              itself as I had it). Performance should be similar.
                              >
                              Iain
                              If do I settle on an explicit loop to remove the nodes turned ON, then I
                              realised this weekend that I could do this in the next iteration of the
                              simulation (first loop above) and save some iteration overhead (the if
                              checking will still be there of course).

                              And thanks for pointing out that constructing a new list, for long lists,
                              is faster than simple removal. It's obvious but I never really thought
                              of it; good tip.

                              Comment

                              Working...