Fastest way to store ints and floats on disk

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Laszlo Nagy

    Fastest way to store ints and floats on disk


    Hi,

    I'm working on a pivot table. I would like to write it in Python. I
    know, I should be doing that in C, but I would like to create a cross
    platform version which can deal with smaller databases (not more than a
    million facts).

    The data is first imported from a csv file: the user selects which
    columns contain dimension and measure data (and which columns to
    ignore). In the next step I would like to build up a database that is
    efficient enough to be used for making pivot tables. Here is my idea for
    the database:

    Original CSV file with column header and values:

    "Color","Year", "Make","Price", "VMax"
    Yellow,2000,Fer rari,100000,254
    Blue,2003,Volvo ,50000,210

    Using the GUI, it is converted to this:

    dimensions = [
    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    'Yellow' ], },
    { 'name':'Year', colindex:1, 'values':[
    1995,1999,2000, 2001,2002,2003, 2007 ], },
    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    'Lamborgini' ], },
    ]
    measures = [
    { 'name', 'Price', 'colindex':3 },
    { 'name', 'Vmax', 'colindex':4 },
    ]
    facts = [
    ( (3,2,0),(100000 .0,254.0) ), # ( dimension_value _indexes,
    measure_values )
    ( (1,5,1),(50000. 0,210.0) ),
    .... # Some million rows or less
    ]


    The core of the idea is that, when using a relatively small number of
    possible values for each dimension, the facts table becomes
    significantly smaller and easier to process. (Processing the facts would
    be: iterate over facts, filter out some of them, create statistical
    values of the measures, grouped by dimensions.)

    The facts table cannot be kept in memory because it is too big. I need
    to store it on disk, be able to read incrementally, and make statistics.
    In most cases, the "statistic" will be simple sum of the measures, and
    counting the number of facts affected. To be effective, reading the
    facts from disk should not involve complex conversions. For this reason,
    storing in CSV or XML or any textual format would be bad. I'm thinking
    about a binary format, but how can I interface that with Python?

    I already looked at:

    - xdrlib, which throws me DeprecationWarn ing when I store some integers
    - struct which uses format string for each read operation, I'm concerned
    about its speed

    What else can I use?

    Thanks,

    Laszlo



  • castironpi

    #2
    Re: Fastest way to store ints and floats on disk

    On Aug 7, 1:41 pm, Laszlo Nagy <gand...@shopze us.comwrote:
      Hi,
    >
    I'm working on a pivot table. I would like to write it in Python. I
    know, I should be doing that in C, but I would like to create a cross
    platform version which can deal with smaller databases (not more than a
    million facts).
    >
    The data is first imported from a csv file: the user selects which
    columns contain dimension and measure data (and which columns to
    ignore). In the next step I would like to build up a database that is
    efficient enough to be used for making pivot tables. Here is my idea for
    the database:
    >
    Original CSV file with column header and values:
    >
    "Color","Year", "Make","Price", "VMax"
    Yellow,2000,Fer rari,100000,254
    Blue,2003,Volvo ,50000,210
    >
    Using the GUI, it is converted to this:
    >
    dimensions = [
        { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    'Yellow' ], },
        { 'name':'Year', colindex:1, 'values':[
    1995,1999,2000, 2001,2002,2003, 2007 ], },
        { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    'Lamborgini' ], },
    ]
    measures = [
        { 'name', 'Price', 'colindex':3 },
        { 'name', 'Vmax', 'colindex':4 },
    ]
    facts = [
        ( (3,2,0),(100000 .0,254.0)  ), # ( dimension_value _indexes,
    measure_values )
        ( (1,5,1),(50000. 0,210.0) ),
       .... # Some million rows or less
    ]
    >
    The core of the idea is that, when using a relatively small number of
    possible values for each dimension, the facts table becomes
    significantly smaller and easier to process. (Processing the facts would
    be: iterate over facts, filter out some of them, create statistical
    values of the measures, grouped by dimensions.)
    >
    The facts table cannot be kept in memory because it is too big. I need
    to store it on disk, be able to read incrementally, and make statistics.
    In most cases, the "statistic" will be simple sum of the measures, and
    counting the number of facts affected. To be effective, reading the
    facts from disk should not involve complex conversions. For this reason,
    storing in CSV or XML or any textual format would be bad. I'm thinking
    about a binary format, but how can I interface that with Python?
    >
    I already looked at:
    >
    - xdrlib, which throws me DeprecationWarn ing when I store some integers
    - struct which uses format string for each read operation, I'm concerned
    about its speed
    >
    What else can I use?
    >
    Thanks,
    >
       Laszlo
    Take a look at the mmap module. You get direct memory access, backed
    by the file system. struct + mmap, if you keep your strings small?

    Comment

    • Matthew Woodcraft

      #3
      Re: Fastest way to store ints and floats on disk

      Laszlo Nagy <gandalf@shopze us.comwrites:
      The facts table cannot be kept in memory because it is too big. I need to
      store it on disk, be able to read incrementally, and make statistics. In most
      cases, the "statistic" will be simple sum of the measures, and counting the
      number of facts affected. To be effective, reading the facts from disk should
      not involve complex conversions. For this reason, storing in CSV or XML or any
      textual format would be bad. I'm thinking about a binary format, but how can I
      interface that with Python?
      >
      I already looked at:
      >
      - xdrlib, which throws me DeprecationWarn ing when I store some integers
      - struct which uses format string for each read operation, I'm concerned about
      its speed
      >
      What else can I use?
      pytables (<http://www.pytables.or g/>) looks like the right kind of
      thing.

      -M-

      Comment

      Working...