Marshal Obj is String or Binary?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Mike

    Marshal Obj is String or Binary?

    Hi,

    The example below shows that result of a marshaled data structure is
    nothing but a string
    [color=blue][color=green][color=darkred]
    >>> data = {2:'two', 3:'three'}
    >>> import marshal
    >>> bytes = marshal.dumps(d ata)
    >>> type(bytes)[/color][/color][/color]
    <type 'str'>[color=blue][color=green][color=darkred]
    >>> bytes[/color][/color][/color]
    '{i\x02\x00\x00 \x00t\x03\x00\x 00\x00twoi\x03\ x00\x00\x00t\x0 5\x00\x00\x00th ree0'

    Now, I need to store this data safely in my database as CLEAR TEXT, not
    BLOB. It seems to me that it should work just fine since it is string
    anyways. So, why does O'reilly's Python Cookbook is insisting in saving
    it as a binary file and BLOB type?

    Am I missing out something?

    Thanks,
    Mike

  • Marc 'BlackJack' Rintsch

    #2
    Re: Marshal Obj is String or Binary?

    In <1137183910.659 813.210550@g47g 2000cwa.googleg roups.com>, Mike wrote:
    [color=blue]
    > The example below shows that result of a marshaled data structure is
    > nothing but a string
    >[color=green][color=darkred]
    >>>> data = {2:'two', 3:'three'}
    >>>> import marshal
    >>>> bytes = marshal.dumps(d ata)
    >>>> type(bytes)[/color][/color]
    > <type 'str'>[color=green][color=darkred]
    >>>> bytes[/color][/color]
    > '{i\x02\x00\x00 \x00t\x03\x00\x 00\x00twoi\x03\ x00\x00\x00t\x0 5\x00\x00\x00th ree0'
    >
    > Now, I need to store this data safely in my database as CLEAR TEXT, not
    > BLOB. It seems to me that it should work just fine since it is string
    > anyways. So, why does O'reilly's Python Cookbook is insisting in saving
    > it as a binary file and BLOB type?
    >
    > Am I missing out something?[/color]

    Yes, that a string is *binary* data. But only a subset of strings is safe
    to use as `TEXT` in databases. Do you see all those '\x??' escapes?
    '\x00' is *one* byte! A byte with the value zero. Something your DB
    doesn't allow in a `TEXT` type.

    Ciao,
    Marc 'BlackJack' Rintsch

    Comment

    • Mike

      #3
      Re: Marshal Obj is String or Binary?

      Wait a sec. \x00 may represent a byte when unmarshaled, but as long as
      marshal likes it as \x00, I think my db is capable of storing \ x 0 0
      characters. What is the problem? Is it that \? I could escape that...
      actually I think my django framework already does that for me.

      Thanks,
      Mike

      Comment

      • Mike

        #4
        Re: Marshal Obj is String or Binary?

        Wait a sec. \x00 may represent a byte when unmarshaled, but as long as
        marshal likes it as \x00, I think my db is capable of storing \ x 0 0
        characters. What is the problem? Is it that \? I could escape that...
        actually I think my django framework already does that for me.

        Thanks,
        Mike

        Comment

        • casevh@comcast.net

          #5
          Re: Marshal Obj is String or Binary?

          Try...
          [color=blue][color=green][color=darkred]
          >>> for i in bytes: print ord(i)[/color][/color][/color]

          or
          [color=blue][color=green][color=darkred]
          >>> len(bytes)[/color][/color][/color]

          What you see isn't always what you have. Your database is capable of
          storing \ x 0 0 characters, but your string contains a single byte of
          value zero. When Python displays the string representation to you, it
          escapes the values so they can be displayed.

          casevh

          Comment

          • Giovanni Bajo

            #6
            Re: Marshal Obj is String or Binary?

            casevh@comcast. net wrote:
            [color=blue]
            > Try...
            >[color=green][color=darkred]
            >>>> for i in bytes: print ord(i)[/color][/color]
            >
            > or
            >[color=green][color=darkred]
            >>>> len(bytes)[/color][/color]
            >
            > What you see isn't always what you have. Your database is capable of
            > storing \ x 0 0 characters, but your string contains a single byte of
            > value zero. When Python displays the string representation to you, it
            > escapes the values so they can be displayed.[/color]

            He can still store the repr of the string into the database, and then
            reconstruct it with eval:
            [color=blue][color=green][color=darkred]
            >>> bytes = "\x00\x01\x 02"
            >>> bytes[/color][/color][/color]
            '\x00\x01\x02'[color=blue][color=green][color=darkred]
            >>> len(bytes)[/color][/color][/color]
            3[color=blue][color=green][color=darkred]
            >>> ord(bytes[0])[/color][/color][/color]
            0[color=blue][color=green][color=darkred]
            >>> rb = repr(bytes)
            >>> rb[/color][/color][/color]
            "'\\x00\\x01\\x 02'"[color=blue][color=green][color=darkred]
            >>> len(rb)[/color][/color][/color]
            14[color=blue][color=green][color=darkred]
            >>> rb[0][/color][/color][/color]
            "'"[color=blue][color=green][color=darkred]
            >>> rb[1][/color][/color][/color]
            '\\'[color=blue][color=green][color=darkred]
            >>> rb[2][/color][/color][/color]
            'x'[color=blue][color=green][color=darkred]
            >>> rb[3][/color][/color][/color]
            '0'[color=blue][color=green][color=darkred]
            >>> rb[4][/color][/color][/color]
            '0'[color=blue][color=green][color=darkred]
            >>> bytes2 = eval(rb)
            >>> bytes == bytes2[/color][/color][/color]
            True

            --
            Giovanni Bajo


            Comment

            • Mike

              #7
              Re: Marshal Obj is String or Binary?

              Thanks everyone. It seems broken storing complex structures as escaped
              strings, but I think I'll take my changes.

              Thanks,
              Mike

              Comment

              • Steven D'Aprano

                #8
                Re: Marshal Obj is String or Binary?

                On Fri, 13 Jan 2006 22:20:27 -0800, Mike wrote:
                [color=blue]
                > Thanks everyone. It seems broken storing complex structures as escaped
                > strings, but I think I'll take my changes.[/color]


                Have you read the marshal reference?

                This module contains functions that can read and write Python values in a binary format. The format is specific to Python, but independent of machine architecture issues (e.g., you can write a Pyth...


                marshal doesn't store data as escaped strings, it stores them as binary
                strings. When you print the binary string to the console, unprintable
                characters are shown escaped.

                I'm guessing you probably want to use pickle instead of marshal. marshal
                is intended only for dealing with .pyc files, and has some important
                limitations. pickle is intended to be a general purpose serializer.


                --
                Steve.

                Comment

                • Max

                  #9
                  Re: Marshal Obj is String or Binary?

                  Giovanni Bajo wrote:[color=blue][color=green]
                  >>
                  >>What you see isn't always what you have. Your database is capable of
                  >>storing \ x 0 0 characters, but your string contains a single byte of
                  >>value zero. When Python displays the string representation to you, it
                  >>escapes the values so they can be displayed.[/color]
                  >
                  >
                  > He can still store the repr of the string into the database, and then
                  > reconstruct it with eval:
                  >[/color]

                  Yes, but len(repr('\x00' )) is 4, while len('\x00') is 1. So if he uses
                  BLOB his data will take almost a quarter of the space, compared to your
                  method (stored as TEXT).

                  --Max

                  Comment

                  • Steven D'Aprano

                    #10
                    Re: Marshal Obj is String or Binary?

                    On Sat, 14 Jan 2006 12:36:59 +0200, Max wrote:
                    [color=blue][color=green]
                    >> He can still store the repr of the string into the database, and then
                    >> reconstruct it with eval:
                    >>[/color]
                    >
                    > Yes, but len(repr('\x00' )) is 4, while len('\x00') is 1.[/color]

                    Incorrect:
                    [color=blue][color=green][color=darkred]
                    >>> len(repr('\x00' ))[/color][/color][/color]
                    6[color=blue][color=green][color=darkred]
                    >>> repr('\x00')[/color][/color][/color]
                    "'\\x00'"


                    [color=blue]
                    > So if he uses
                    > BLOB his data will take almost a quarter of the space, compared to your
                    > method (stored as TEXT).[/color]

                    Also incorrect. That depends utterly on which particular characters end up
                    in the serialised data. You may or may not be able to predict what that
                    mix may be.

                    # nothing but printable data
                    [color=blue][color=green][color=darkred]
                    >>> s = ''.join(['a' for i in range(256)])
                    >>> len(s)[/color][/color][/color]
                    256[color=blue][color=green][color=darkred]
                    >>> len(repr(s))[/color][/color][/color]
                    258


                    # nothing but unprintable data
                    [color=blue][color=green][color=darkred]
                    >>> s = ''.join(['\0' for i in range(256)])
                    >>> len(s)[/color][/color][/color]
                    256[color=blue][color=green][color=darkred]
                    >>> len(repr(s))[/color][/color][/color]
                    1026


                    # one particular mix of both printable and unprintable data
                    [color=blue][color=green][color=darkred]
                    >>> s = ''.join([chr(i) for i in range(256)])
                    >>> len(s)[/color][/color][/color]
                    256[color=blue][color=green][color=darkred]
                    >>> len(repr(s))[/color][/color][/color]
                    737


                    # a different mix of both printable and unprintable data
                    [color=blue][color=green][color=darkred]
                    >>> s = '+'.join([chr(i) for i in range(128)])
                    >>> len(s)[/color][/color][/color]
                    255[color=blue][color=green][color=darkred]
                    >>> len(repr(s))[/color][/color][/color]
                    352





                    --
                    Steven.

                    Comment

                    • Giovanni Bajo

                      #11
                      Re: Marshal Obj is String or Binary?

                      Max wrote:
                      [color=blue][color=green][color=darkred]
                      >>> What you see isn't always what you have. Your database is capable of
                      >>> storing \ x 0 0 characters, but your string contains a single byte
                      >>> of value zero. When Python displays the string representation to
                      >>> you, it escapes the values so they can be displayed.[/color]
                      >>
                      >>
                      >> He can still store the repr of the string into the database, and then
                      >> reconstruct it with eval:
                      >>[/color]
                      >
                      > Yes, but len(repr('\x00' )) is 4, while len('\x00') is 1. So if he uses
                      > BLOB his data will take almost a quarter of the space, compared to
                      > your method (stored as TEXT).[/color]

                      Sure, but he didn't ask for the best strategy to store the data into the
                      database, he specified very clearly that he *can't* use BLOB, and asked how to
                      tuse TEXT.
                      --
                      Giovanni Bajo


                      Comment

                      • Mike

                        #12
                        Re: Marshal Obj is String or Binary?

                        Thanks everyone.

                        Why Marshal & not Pickle: Well, Marshal is supposed to be faster. But
                        then, if I wanted to do the whole repr()-eval() hack, I am already
                        defeating the purpose by refusing to save bytes as bytes in terms of
                        both size and speed.

                        At this point, I am considering one of the following:
                        - Save my structure as binary data, and reference the file from my db
                        - Find a clean method of saving bytes into my db

                        Thanks again,
                        Mike

                        Comment

                        • Mike Meyer

                          #13
                          Re: Marshal Obj is String or Binary?

                          "Giovanni Bajo" <noway@sorry.co m> writes:[color=blue]
                          > casevh@comcast. net wrote:[color=green]
                          >> Try...[color=darkred]
                          >>>>> for i in bytes: print ord(i)[/color]
                          >> or[color=darkred]
                          >>>>> len(bytes)[/color]
                          >> What you see isn't always what you have. Your database is capable of
                          >> storing \ x 0 0 characters, but your string contains a single byte of
                          >> value zero. When Python displays the string representation to you, it
                          >> escapes the values so they can be displayed.[/color]
                          > He can still store the repr of the string into the database, and then
                          > reconstruct it with eval:[/color]

                          repr and eval are overkill for this, and as as result create a
                          security hole. Using encode('string-escape') and
                          decode('string-escape') will do the same job without the security
                          hole:
                          [color=blue][color=green][color=darkred]
                          >>> bytes = '\x00\x01\x02'
                          >>> bytes[/color][/color][/color]
                          '\x00\x01\x02'[color=blue][color=green][color=darkred]
                          >>> ord(bytes[0])[/color][/color][/color]
                          0[color=blue][color=green][color=darkred]
                          >>> rb = bytes.encode('s tring-escape')
                          >>> rb[/color][/color][/color]
                          '\\x00\\x01\\x0 2'[color=blue][color=green][color=darkred]
                          >>> len(rb)[/color][/color][/color]
                          12[color=blue][color=green][color=darkred]
                          >>> rb[0][/color][/color][/color]
                          '\\'[color=blue][color=green][color=darkred]
                          >>> bytes2 = rb.decode('stri ng-escape')
                          >>> bytes == bytes2[/color][/color][/color]
                          True[color=blue][color=green][color=darkred]
                          >>>[/color][/color][/color]

                          <mike
                          --
                          Mike Meyer <mwm@mired.or g> http://www.mired.org/home/mwm/
                          Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

                          Comment

                          • Steven D'Aprano

                            #14
                            Re: Marshal Obj is String or Binary?

                            On Sat, 14 Jan 2006 13:50:24 -0800, Mike wrote:
                            [color=blue]
                            > Thanks everyone.
                            >
                            > Why Marshal & not Pickle: Well, Marshal is supposed to be faster.[/color]

                            Faster than cPickle?

                            Even faster would be to write your code in assembly, and dump that
                            ridiculously bloated database and just write everything to raw bytes on
                            an unformatted disk. Of course, it might take the programmer a thousand
                            times longer to actually write the program, and there will probably be
                            hundreds of bugs in it, but the important thing is that you'll save three
                            or four milliseconds at runtime.

                            Right?

                            Unless you've actually done proper measurements of the time taken, with
                            realistic sample data, worrying about saving a byte here and a
                            millisecond there is just wasting your time, and is often
                            counter-productive. Optimization without measurement is as likely to
                            result in slower, fatter performance as it is faster and leaner.

                            marshal is not designed to be portable across versions. Do you *really*
                            think it is a good idea to tie the data in your database to one specific
                            version of Python?

                            [color=blue]
                            > But
                            > then, if I wanted to do the whole repr()-eval() hack, I am already
                            > defeating the purpose by refusing to save bytes as bytes in terms of
                            > both size and speed.
                            >
                            > At this point, I am considering one of the following:
                            > - Save my structure as binary data, and reference the file from my db
                            > - Find a clean method of saving bytes into my db[/color]

                            Your database either can handle binary data, or it can't.

                            If it can, then just use pickle with a binary protocol and be done with it.

                            If it can't, then just use pickle with a plain text protocol and be done
                            with it.

                            Either way, you have to find a way to translate your Python data
                            structures into something that you can feed to the database. Your database
                            can't automatically suck data structures out of Python's working memory!
                            So why re-invent the wheel? marshal is not recommended, but if you can
                            live with the limitations of marshal then it might do the job. But trying
                            to optimise code that hasn't even been written yet is a sure way to
                            trouble.


                            --
                            Steven.

                            Comment

                            • Steve Holden

                              #15
                              Re: Marshal Obj is String or Binary?

                              Mike wrote:[color=blue]
                              > Hi,
                              >
                              > The example below shows that result of a marshaled data structure is
                              > nothing but a string
                              >
                              >[color=green][color=darkred]
                              >>>>data = {2:'two', 3:'three'}
                              >>>>import marshal
                              >>>>bytes = marshal.dumps(d ata)
                              >>>>type(byte s)[/color][/color]
                              >
                              > <type 'str'>
                              >[color=green][color=darkred]
                              >>>>bytes[/color][/color]
                              >
                              > '{i\x02\x00\x00 \x00t\x03\x00\x 00\x00twoi\x03\ x00\x00\x00t\x0 5\x00\x00\x00th ree0'
                              >
                              > Now, I need to store this data safely in my database as CLEAR TEXT, not
                              > BLOB. It seems to me that it should work just fine since it is string
                              > anyways. So, why does O'reilly's Python Cookbook is insisting in saving
                              > it as a binary file and BLOB type?
                              >[/color]
                              Well, the Cookbook isn't an exhaustive list of everything you can do
                              with Python, it's just a record of some of the things people *have* done.

                              I presume your database has no datatype that will store binary data of
                              indeterminate length? Clearly that would be the most satisfactory solution.

                              regards
                              Steve
                              --
                              Steve Holden +44 150 684 7255 +1 800 494 3119
                              Holden Web LLC www.holdenweb.com
                              PyCon TX 2006 www.python.org/pycon/

                              Comment

                              Working...