Numpy array to gzip file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sean Davis

    Numpy array to gzip file

    I have a set of numpy arrays which I would like to save to a gzip
    file. Here is an example without gzip:

    b=numpy.ones(10 00000,dtype=num py.uint8)
    a=numpy.zeros(1 000000,dtype=nu mpy.uint8)
    fd = file('test.dat' ,'wb')
    a.tofile(fd)
    b.tofile(fd)
    fd.close()

    This works fine. However, this does not:

    fd = gzip.open('test .dat','wb')
    a.tofile(fd)

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    IOError: first argument must be a string or open file

    In the bigger picture, I want to be able to write multiple numpy
    arrays with some metadata to a binary file for very fast reading, and
    these arrays are pretty compressible (strings of small integers), so I
    can probably benefit in speed and file size by gzipping.

    Thanks,
    Sean
  • drobinow@gmail.com

    #2
    Re: Numpy array to gzip file

    On Jun 11, 9:17 am, Sean Davis <seand...@gmail .comwrote:
    I have a set of numpy arrays which I would like to save to a gzip
    file. Here is an example without gzip:
    >
    b=numpy.ones(10 00000,dtype=num py.uint8)
    a=numpy.zeros(1 000000,dtype=nu mpy.uint8)
    fd = file('test.dat' ,'wb')
    a.tofile(fd)
    b.tofile(fd)
    fd.close()
    >
    This works fine. However, this does not:
    >
    fd = gzip.open('test .dat','wb')
    a.tofile(fd)
    >
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    IOError: first argument must be a string or open file
    >
    In the bigger picture, I want to be able to write multiple numpy
    arrays with some metadata to a binary file for very fast reading, and
    these arrays are pretty compressible (strings of small integers), so I
    can probably benefit in speed and file size by gzipping.
    >
    Thanks,
    Sean
    Use
    fd.write(a)

    The documentation says that gzip simulates most of the methods of a
    file object.
    Apparently that means it does not subclass it. numpy.tofile wants a
    file object
    Or something like that.

    Comment

    • Sean Davis

      #3
      Re: Numpy array to gzip file

      On Jun 11, 12:42 pm, "drobi...@gmail .com" <drobi...@gmail .comwrote:
      On Jun 11, 9:17 am, Sean Davis <seand...@gmail .comwrote:
      >
      >
      >
      I have a set of numpy arrays which I would like to save to a gzip
      file. Here is an example without gzip:
      >
      b=numpy.ones(10 00000,dtype=num py.uint8)
      a=numpy.zeros(1 000000,dtype=nu mpy.uint8)
      fd = file('test.dat' ,'wb')
      a.tofile(fd)
      b.tofile(fd)
      fd.close()
      >
      This works fine. However, this does not:
      >
      fd = gzip.open('test .dat','wb')
      a.tofile(fd)
      >
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      IOError: first argument must be a string or open file
      >
      In the bigger picture, I want to be able to write multiple numpy
      arrays with some metadata to a binary file for very fast reading, and
      these arrays are pretty compressible (strings of small integers), so I
      can probably benefit in speed and file size by gzipping.
      >
      Thanks,
      Sean
      >
      Use
      fd.write(a)
      That seems to work fine. Just to add to the answer a bit, one can
      then use:

      b=numpy.frombuf fer(fd.read(),d type=numpy.uint 8)

      to get the array back as a numpy uint8 array.

      Thanks for the help.

      Sean

      Comment

      • Robert Kern

        #4
        Re: Numpy array to gzip file

        Sean Davis wrote:
        I have a set of numpy arrays which I would like to save to a gzip
        file. Here is an example without gzip:
        >
        b=numpy.ones(10 00000,dtype=num py.uint8)
        a=numpy.zeros(1 000000,dtype=nu mpy.uint8)
        fd = file('test.dat' ,'wb')
        a.tofile(fd)
        b.tofile(fd)
        fd.close()
        >
        This works fine. However, this does not:
        >
        fd = gzip.open('test .dat','wb')
        a.tofile(fd)
        >
        Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        IOError: first argument must be a string or open file
        As drobinow says, the .tofile() method needs an actual file object with a real
        FILE* pointer underneath it. You will need to call fd.write() on strings (or
        buffers) made from the arrays instead. If your arrays are large (as they must be
        if compression helps), then you will probably want to split it up. Use
        numpy.array_spl it() to do this. For example:

        In [13]: import numpy

        In [14]: a=numpy.zeros(1 000000,dtype=nu mpy.uint8)

        In [15]: chunk_size = 256*1024

        In [17]: import gzip

        In [18]: fd = gzip.open('foo. gz', 'wb')

        In [19]: for chunk in numpy.array_spl it(a, len(a) // chunk_size):
        ....: fd.write(buffer (chunk))
        ....:
        In the bigger picture, I want to be able to write multiple numpy
        arrays with some metadata to a binary file for very fast reading, and
        these arrays are pretty compressible (strings of small integers), so I
        can probably benefit in speed and file size by gzipping.
        File size perhaps, but I suspect the speed gains you get will be swamped by the
        Python-level manipulation you will have to do to reconstruct the array. You will
        have to read in (partial!) strings and then put the data into an array. If you
        think compression will really help, look into PyTables. It uses the HDF5 library
        which includes the ability to compress arrays with gzip and other compression
        schemes. All of the decompression happens in C, so you don't have to do all of
        the manipulations at the Python level. If you stand to gain anything from
        compression, this is the best way to find out and probably the best way to
        implement it, too.



        If you have more numpy questions, you will probably want to ask on the numpy
        mailing list:



        --
        Robert Kern

        "I have come to believe that the whole world is an enigma, a harmless enigma
        that is made terrible by our own mad attempt to interpret it as though it had
        an underlying truth."
        -- Umberto Eco

        Comment

        Working...