Large Data Sets: Use base variables or classes? And some bindingquestions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Patrick  Sullivan

    Large Data Sets: Use base variables or classes? And some bindingquestions

    Hello.

    I will be using some large data sets ("points" from 2 to 12 variables)
    and would like to use one class for each point rather than a list or
    dictionary. I imagine this is terribly inefficient, but how much?

    What is the cost of creating a new class?

    What is the cost of referencing a class variable?

    What is the cost of calling a class method to just return a variable?

    Key point: The point objects, once created, and essentially non-
    mutable. Static. Is there a way to "bind" a variable to a object
    method in a way that is more efficient than the function calling
    self.variable_n ame ?

    I'll run some profile tests later today but if anyone has any cost/
    efficiency of object creation in python, or any other idioms related
    to variable creation, I'd greatly appreciate some links.

    Thanks!

    Patrick
  • malkarouri

    #2
    Re: Large Data Sets: Use base variables or classes? And some bindingquestion s

    On 26 Sep, 16:39, Patrick Sullivan <psu...@gmail.c omwrote:
    Hello.
    >
    I will be using some large data sets ("points" from 2 to 12 variables)
    and would like to use one class for each point rather than a list or
    dictionary. I imagine this is terribly inefficient, but how much?
    I can't really get into details here, but I would suggest that you go
    ahead and try first. As you know, premature optimization is the root
    of all evil.

    General points I would suggest:

    - Use Numpy/Scipy (http://www.scipy.org). You will have more
    effeciency easier than if you try to use simply Python lists. And it
    is much easier to later optimize that.
    - Your questions of referencing classes and variables tell me that
    perhaps you are starting from a C background, or Java maybe? Anyway,
    as far as I know, it is not standard practice to write a class method
    (you meant a normal bound method, right?) just to access a variable.
    Use a normal Python variable and if you need to make it a method later
    turn it into a property.
    - Is the efficiency you are looking for is in terms of time or memory?
    That difference leads to different optimization tricks sometimes.
    - By using Numpy there is probably another advantage to you: some
    efficiency in the data representation, as the NumPy array stores data,
    say integers, without memory overhead per member (point). Just an
    array of integers. Of course there is additional constant memory per
    array which is independent of the number of elements (points) you are
    storing.
    - Generally try to think in terms of arrays of data rather than single
    points. If it helps, think in terms of matrices. That is more or less
    the design of Matlab, and Numpy is more or less similar.


    Now if you specify your problem further I am sure that you will get
    better advice from the community here. Don't focus on the details,
    probably the bigger picture will help. Working in graphics? Image
    processing? Machine Learning/Statistics/Data Mining/ etc..?

    --
    Muhammad Alkarouri

    Comment

    • Terry Reedy

      #3
      Re: Large Data Sets: Use base variables or classes? And some bindingquestion s

      Patrick Sullivan wrote:
      Hello.
      >
      I will be using some large data sets ("points" from 2 to 12 variables)
      and would like to use one class for each point rather than a list or
      dictionary. I imagine this is terribly inefficient, but how much?
      I strongly suspect that you should use one class and a class instance
      for each 'point'. You can make instances 'fixed' after initialization
      by customizing appropriate methods, but I would not bother for private code.

      Comment

      • Carl Banks

        #4
        Re: Large Data Sets: Use base variables or classes? And some bindingquestion s

        On Sep 26, 11:39 am, Patrick Sullivan <psu...@gmail.c omwrote:
        Hello.
        Hi, I have a couple suggestions.

        I will be using some large data sets ("points" from 2 to 12 variables)
        and would like to use one class for each point rather than a list or
        dictionary.
        Ok, point of terminology. It's not really a nit-pick, either, since
        it affects some of your questions below. When you say you want to use
        one class for each point, you apparently mean you would like to use
        one class instance, or one object, for each point.

        One class for each point would be terribly inefficient; one instance,
        perhaps not.

        I imagine this is terribly inefficient, but how much?
        You say large data sets, which suggests that __slots__ mechanism could
        be useful to you.

        class A(object):
        __slots__ = ['var1','var2',' var3']

        Normally, each class instance has an associated dict which stores the
        attributes, but if you define __slots__ then the variables will be
        stored in fixed memory locations and no dict will be created.

        However, it seems from the rest of your comments that speed is your
        main concern. Last time someone reported __slots__ didn't make a big
        difference in access time, but it probably would speed up creating
        objects a bit. Of course, you should profile it to make sure.

        What is the cost of creating a new class?
        I'm assuming you want to know the cost of creating a class instance.
        Generally speaking, the main cost of this is that you'd be executing
        Python code (whereas list and dict are written in C).

        What is the cost of referencing a class variable?
        I assume you mean an instance variable.

        What is the cost of calling a class method to just return a variable?
        Significant penalty.

        This is because even if the method call is faster (and I doubt very
        highly that it is), the method still has to access the variable, which
        is going to take the same amount of time as accessing the variable
        directly. I.e., you're getting the overhead of a method call to do
        the same thing you could have done directly.

        I highly recommend against doing this, not only because it's less
        efficient, but also because it's considered bad style in Python.

        Key point: The point objects, once created, and essentially non-
        mutable. Static. Is there a way to "bind" a variable to a object
        method in a way that is more efficient than the function calling
        self.variable_n ame ?
        Python 2.6 has a new object type called namedtuple in the collections
        module. (Actually it's a type factory that creates a subclass of
        tuple with attribute names mapped to the indices.) This might be a
        perfect fit for your needs. You have to upgrade to 2.6, though, which
        won't be released for a few days.


        Carl Banks

        Comment

        • Steven D'Aprano

          #5
          Re: Large Data Sets: Use base variables or classes? And somebindingques tions

          On Fri, 26 Sep 2008 14:54:36 -0700, Carl Banks wrote:
          However, it seems from the rest of your comments that speed is your main
          concern. Last time someone reported __slots__ didn't make a big
          difference in access time, but it probably would speed up creating
          objects a bit.
          Carl probably knows this already, but for the benefit of the Original
          Poster:

          __slots__ is intended as a memory optimization, not speed optimization.
          If it speeds up creation, that's a serendipitous side-effect of using
          less memory.

          Of course, you should profile it to make sure.
          Absolutely.

          Can I ask the OP how large is "large" in the Large Data Sets? What seems
          large to people is often not large at all a modern computer.



          --
          Steven

          Comment

          • Carl Banks

            #6
            Re: Large Data Sets: Use base variables or classes? And some bindingquestion s

            On Sep 26, 7:43 pm, Steven D'Aprano <st...@REMOVE-THIS-
            cybersource.com .auwrote:
            On Fri, 26 Sep 2008 14:54:36 -0700, Carl Banks wrote:
            However, it seems from the rest of your comments that speed is your main
            concern.  Last time someone reported __slots__ didn't make a big
            difference in access time, but it probably would speed up creating
            objects a bit.  
            >
            Carl probably knows this already, but for the benefit of the Original
            Poster:
            >
            __slots__ is intended as a memory optimization, not speed optimization.
            If it speeds up creation, that's a serendipitous side-effect of using
            less memory.
            No, it'd be a serendipitous side-effect of not having to take the time
            to create a dict object, which is quite a bit more of a direct cause.

            It might still end up being slower (creating slot descriptors might
            take more time for all I know) but it's more than just an effect of
            less memory.

            Carl Banks

            Comment

            • Carl Banks

              #7
              Re: Large Data Sets: Use base variables or classes? And some bindingquestion s

              On Sep 26, 8:53 pm, Carl Banks <pavlovevide... @gmail.comwrote :
              It might still end up being slower (creating slot descriptors might
              take more time for all I know) but it's more than just an effect of
              less memory.
              Actually scratch that. Descriptors are only created when the type
              object is created. I can't think of anything that would need to be
              done in an instance only if no dict is present, so using slots
              probably almost certianly makes object creation faster. Still, the
              last word is the profiler.


              Carl Banks

              Comment

              Working...