Problem with zipfile and newlines

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Neil Crighton

    Problem with zipfile and newlines

    I'm using the zipfile library to read a zip file in Windows, and it
    seems to be adding too many newlines to extracted files. I've found
    that for extracted text-encoded files, removing all instances of '\r'
    in the extracted file seems to fix the problem, but I can't find an
    easy solution for binary files.

    The code I'm using is something like:

    from zipfile import Zipfile
    z = Zipfile(open('z ippedfile.zip') )
    extractedfile = z.read('filenam e_in_zippedfile ')

    I'm using Python version 2.5. Has anyone else had this problem
    before, or know how to fix it?

    Thanks,

    Neil
  • John Machin

    #2
    Re: Problem with zipfile and newlines

    On Mar 10, 8:31 pm, "Neil Crighton" <neilcrigh...@g mail.comwrote:
    I'm using the zipfile library to read a zip file in Windows, and it
    seems to be adding too many newlines to extracted files. I've found
    that for extracted text-encoded files, removing all instances of '\r'
    in the extracted file seems to fix the problem, but I can't find an
    easy solution for binary files.
    >
    The code I'm using is something like:
    >
    from zipfile import Zipfile
    z = Zipfile(open('z ippedfile.zip') )
    extractedfile = z.read('filenam e_in_zippedfile ')
    >
    "Too many newlines" is fixed by removing all instances of '\r'. What
    are you calling a newline? '\r'??

    How do you know there are too many thingies? What operating system
    were the original files created on?

    When you do:
    # using a more meaningful name :-)
    extractedfileco ntents = z.read('filenam e_in_zippedfile ')
    then:
    print repr(extractedf ilecontents)
    what do you see at the end of what you regard as each line:
    (1) \n
    (2) \r\n
    (3) \r
    (4) something else
    ?

    Do you fiddle with extractedfileco ntents (other than trying to fix it)
    before writing it to the file?

    When you write out a text file,
    do you do:
    open('foo.txt', 'w').write(extr actedfileconten ts)
    or
    open('foo.txt', 'wb').write(ext ractedfileconte nts)
    ?

    When you write out a binary file,
    do you do:
    open('foo.txt', 'w').write(extr actedfileconten ts)
    or
    open('foo.txt', 'wb').write(ext ractedfileconte nts)
    ?

    Comment

    • Duncan Booth

      #3
      Re: Problem with zipfile and newlines

      "Neil Crighton" <neilcrighton@g mail.comwrote:
      I'm using the zipfile library to read a zip file in Windows, and it
      seems to be adding too many newlines to extracted files. I've found
      that for extracted text-encoded files, removing all instances of '\r'
      in the extracted file seems to fix the problem, but I can't find an
      easy solution for binary files.
      >
      The code I'm using is something like:
      >
      from zipfile import Zipfile
      z = Zipfile(open('z ippedfile.zip') )
      extractedfile = z.read('filenam e_in_zippedfile ')
      >
      I'm using Python version 2.5. Has anyone else had this problem
      before, or know how to fix it?
      >
      Thanks,
      >
      Zip files aren't text. Try opening the zipfile file in binary mode:

      open('zippedfil e.zip', 'rb')

      Comment

      • John Machin

        #4
        Re: Problem with zipfile and newlines

        On Mar 10, 11:14 pm, Duncan Booth <duncan.bo...@i nvalid.invalid>
        wrote:
        "Neil Crighton" <neilcrigh...@g mail.comwrote:
        I'm using the zipfile library to read a zip file in Windows, and it
        seems to be adding too many newlines to extracted files. I've found
        that for extracted text-encoded files, removing all instances of '\r'
        in the extracted file seems to fix the problem, but I can't find an
        easy solution for binary files.
        >
        The code I'm using is something like:
        >
        from zipfile import Zipfile
        z = Zipfile(open('z ippedfile.zip') )
        extractedfile = z.read('filenam e_in_zippedfile ')
        >
        I'm using Python version 2.5. Has anyone else had this problem
        before, or know how to fix it?
        >
        Thanks,
        >
        Zip files aren't text. Try opening the zipfile file in binary mode:
        >
        open('zippedfil e.zip', 'rb')
        Good pickup, but that indicates that the OP may have *TWO* problems,
        the first of which is not posting the code that was actually executed.

        If the OP actually executed the code that he posted, it is highly
        likely to have died in a hole long before it got to the z.read()
        stage, e.g.
        >>import zipfile
        >>z = zipfile.ZipFile (open('foo.zip' ))
        Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "C:\python25\li b\zipfile.py", line 346, in __init__
        self._GetConten ts()
        File "C:\python25\li b\zipfile.py", line 366, in _GetContents
        self._RealGetCo ntents()
        File "C:\python25\li b\zipfile.py", line 404, in _RealGetContent s
        centdir = struct.unpack(s tructCentralDir , centdir)
        File "C:\python25\li b\struct.py", line 87, in unpack
        return o.unpack(s)
        struct.error: unpack requires a string argument of length 46
        >>z = zipfile.ZipFile (open('foo.zip' , 'rb')) # OK
        >>z = zipfile.ZipFile ('foo.zip', 'r') # OK
        If it somehow made it through the open stage, it surely would have
        blown up at the read stage, when trying to decompress a contained
        file.

        Cheers,
        John

        Comment

        • neilcrighton@gmail.com

          #5
          Re: Problem with zipfile and newlines

          Sorry my initial post was muddled. Let me try again.

          I've got a zipped archive that I can extract files from with my
          standard archive unzipping program, 7-zip. I'd like to extract the
          files in python via the zipfile module. However, when I extract the
          file from the archive with ZipFile.read(), it isn't the same as the 7-
          zip-extracted file. For text files, the zipfile-extracted version has
          '\r\n' everywhere the 7-zip-extracted file only has '\n'. I haven't
          tried comparing binary files via the two extraction methods yet.

          Regarding the code I posted; I was writing it from memory, and made a
          mistake. I didn't use:

          z = zipfile.ZipFile (open('foo.zip' , 'r'))

          I used this:

          z = zipfile.ZipFile ('foo.zip')

          But Duncan's comment was useful, as I generally only ever work with
          text files, and I didn't realise you have to use 'rb' or 'wb' options
          when reading and writing binary files.

          To answer John's questions - I was calling '\r' a newline. I should
          have said carriage return. I'm not sure what operating system the
          original zip file was created on. I didn't fiddle with the extracted
          file contents, other than replacing '\r' with ''. I wrote out all the
          files with open('outputfil e','w') - I seems that I should have been
          using 'wb' when writing out the binary files.

          Thanks for the quick responses - any ideas why the zipfile-extracted
          files and 7-zip-extracted files are different?

          On Mar 10, 9:37 pm, John Machin <sjmac...@lexic on.netwrote:
          On Mar 10, 11:14 pm, Duncan Booth <duncan.bo...@i nvalid.invalid>
          wrote:
          >
          >
          >
          "Neil Crighton" <neilcrigh...@g mail.comwrote:
          I'm using the zipfile library to read a zip file in Windows, and it
          seems to be adding too many newlines to extracted files. I've found
          that for extracted text-encoded files, removing all instances of '\r'
          in the extracted file seems to fix the problem, but I can't find an
          easy solution for binary files.
          >
          The code I'm using is something like:
          >
          from zipfile import Zipfile
          z = Zipfile(open('z ippedfile.zip') )
          extractedfile = z.read('filenam e_in_zippedfile ')
          >
          I'm using Python version 2.5. Has anyone else had this problem
          before, or know how to fix it?
          >
          Thanks,
          >
          Zip files aren't text. Try opening the zipfile file in binary mode:
          >
          open('zippedfil e.zip', 'rb')
          >
          Good pickup, but that indicates that the OP may have *TWO* problems,
          the first of which is not posting the code that was actually executed.
          >
          If the OP actually executed the code that he posted, it is highly
          likely to have died in a hole long before it got to the z.read()
          stage, e.g.
          >
          >import zipfile
          >z = zipfile.ZipFile (open('foo.zip' ))
          >
          Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "C:\python25\li b\zipfile.py", line 346, in __init__
          self._GetConten ts()
          File "C:\python25\li b\zipfile.py", line 366, in _GetContents
          self._RealGetCo ntents()
          File "C:\python25\li b\zipfile.py", line 404, in _RealGetContent s
          centdir = struct.unpack(s tructCentralDir , centdir)
          File "C:\python25\li b\struct.py", line 87, in unpack
          return o.unpack(s)
          struct.error: unpack requires a string argument of length 46
          >
          >z = zipfile.ZipFile (open('foo.zip' , 'rb')) # OK
          >z = zipfile.ZipFile ('foo.zip', 'r') # OK
          >
          If it somehow made it through the open stage, it surely would have
          blown up at the read stage, when trying to decompress a contained
          file.
          >
          Cheers,
          John

          Comment

          • neilcrighton@gmail.com

            #6
            Re: Problem with zipfile and newlines

            I think I've worked it out after reading the 'Binary mode for files'
            section of http://zephyrfalcon.org/labs/python_pitfalls.html

            zipfile extracts as file as a binary series of characters, and I'm
            writing out this binary file as a text file with open('foo','w') .
            Normally Python converts a '\n' in a text file to whatever the
            platform-dependent indication of a new line is ('\n' on Unix, '\r\n'
            on Windows, '\r' on Macs). So it sees '\r\n' in the binary file and
            converts it to '\r\r\n' for the text file.

            The upshot of this is that writing out the zipfile-extracted files
            with open('foo','wb' ) instead of open('foo','w') solves my problem.

            On Mar 11, 8:43 pm, neilcrigh...@gm ail.com wrote:
            Sorry my initial post was muddled. Let me try again.
            >
            I've got a zipped archive that I can extract files from with my
            standard archive unzipping program, 7-zip. I'd like to extract the
            files in python via the zipfile module. However, when I extract the
            file from the archive with ZipFile.read(), it isn't the same as the 7-
            zip-extracted file. For text files, the zipfile-extracted version has
            '\r\n' everywhere the 7-zip-extracted file only has '\n'. I haven't
            tried comparing binary files via the two extraction methods yet.
            >
            Regarding the code I posted; I was writing it from memory, and made a
            mistake. I didn't use:
            >
            z = zipfile.ZipFile (open('foo.zip' , 'r'))
            >
            I used this:
            >
            z = zipfile.ZipFile ('foo.zip')
            >
            But Duncan's comment was useful, as I generally only ever work with
            text files, and I didn't realise you have to use 'rb' or 'wb' options
            when reading and writing binary files.
            >
            To answer John's questions - I was calling '\r' a newline. I should
            have said carriage return. I'm not sure what operating system the
            original zip file was created on. I didn't fiddle with the extracted
            file contents, other than replacing '\r' with ''. I wrote out all the
            files with open('outputfil e','w') - I seems that I should have been
            using 'wb' when writing out the binary files.
            >
            Thanks for the quick responses - any ideas why the zipfile-extracted
            files and 7-zip-extracted files are different?
            >
            On Mar 10, 9:37 pm, John Machin <sjmac...@lexic on.netwrote:
            >
            On Mar 10, 11:14 pm, Duncan Booth <duncan.bo...@i nvalid.invalid>
            wrote:
            >
            "Neil Crighton" <neilcrigh...@g mail.comwrote:
            I'm using the zipfile library to read a zip file in Windows, and it
            seems to be adding too many newlines to extracted files. I've found
            that for extracted text-encoded files, removing all instances of '\r'
            in the extracted file seems to fix the problem, but I can't find an
            easy solution for binary files.
            >
            The code I'm using is something like:
            >
            from zipfile import Zipfile
            z = Zipfile(open('z ippedfile.zip') )
            extractedfile = z.read('filenam e_in_zippedfile ')
            >
            I'm using Python version 2.5. Has anyone else had this problem
            before, or know how to fix it?
            >
            Thanks,
            >
            Zip files aren't text. Try opening the zipfile file in binary mode:
            >
            open('zippedfil e.zip', 'rb')
            >
            Good pickup, but that indicates that the OP may have *TWO* problems,
            the first of which is not posting the code that was actually executed.
            >
            If the OP actually executed the code that he posted, it is highly
            likely to have died in a hole long before it got to the z.read()
            stage, e.g.
            >
            >>import zipfile
            >>z = zipfile.ZipFile (open('foo.zip' ))
            >
            Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "C:\python25\li b\zipfile.py", line 346, in __init__
            self._GetConten ts()
            File "C:\python25\li b\zipfile.py", line 366, in _GetContents
            self._RealGetCo ntents()
            File "C:\python25\li b\zipfile.py", line 404, in _RealGetContent s
            centdir = struct.unpack(s tructCentralDir , centdir)
            File "C:\python25\li b\struct.py", line 87, in unpack
            return o.unpack(s)
            struct.error: unpack requires a string argument of length 46
            >
            >>z = zipfile.ZipFile (open('foo.zip' , 'rb')) # OK
            >>z = zipfile.ZipFile ('foo.zip', 'r') # OK
            >
            If it somehow made it through the open stage, it surely would have
            blown up at the read stage, when trying to decompress a contained
            file.
            >
            Cheers,
            John

            Comment

            Working...