PEP 358 and operations on bytes

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Gerrit Holl

    PEP 358 and operations on bytes

    Hi,

    In Python 3, reading from a file gives bytes rather than characters.
    Some operations currently performed on strings also make sense when
    performed on bytes, either if it's binary data or if it's text of
    unknown or mixed encoding. Those include of course slicing and other
    operators that exist in lists, but also other operations that aren't
    currently defined in PEP 358, like:

    - str methods endswith, find, partition, replace, split(lines),
    startswith,
    - Regular expressions

    I think those can be useful on a bytes type. Perhaps bytes and str could
    share a common parent class? They certainly share a lot of properties
    and possible operations one might want to perform.

    kind regards,
    Gerrit Holl.

    --
    My first English-language post ever was made to this newsgroup:
    http://groups.google.com/group/comp....57acf785ddfb71 :)
  • John Machin

    #2
    Re: PEP 358 and operations on bytes


    Gerrit Holl wrote:
    Hi,
    >
    In Python 3, reading from a file gives bytes rather than characters.
    Some operations currently performed on strings also make sense when
    performed on bytes, either if it's binary data or if it's text of
    unknown or mixed encoding. Those include of course slicing and other
    operators that exist in lists, but also other operations that aren't
    currently defined in PEP 358, like:
    >
    - str methods endswith, find, partition, replace, split(lines),
    startswith,
    - Regular expressions
    >
    I think those can be useful on a bytes type. Perhaps bytes and str could
    share a common parent class? They certainly share a lot of properties
    and possible operations one might want to perform.
    >
    I look at it this way::
    Processing text? Use unicode.
    Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
    bytes.
    Nostalgic for confused mixed-use? Don't upgrade.

    IMHO, core dev time would be better used on:

    * making /relevant/ modules (e.g. struct) work with bytes -- this topic
    is not mentioned in the PEP.
    * ensuring it covers everything that array.array('B' , ...) does.
    * being able to initialise a bytes array to (typically) all zeroes
    without having to instantiate an initialiser e.g. record =
    bytes(size=996, fill=0) instead of record = bytes(996 * [0])

    than on starts(ends)wit h etc, and regexes.

    Cheers,
    John

    Comment

    • Gerrit Holl

      #3
      Re: PEP 358 and operations on bytes

      On 2006-10-04 05:10:32 +0200, John Machin wrote:
      - str methods endswith, find, partition, replace, split(lines),
      startswith,
      - Regular expressions

      I think those can be useful on a bytes type. Perhaps bytes and str could
      share a common parent class? They certainly share a lot of properties
      and possible operations one might want to perform.
      >
      I look at it this way::
      Processing text? Use unicode.
      Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
      bytes.
      But can I use regular expressions on bytes?
      Regular expressions are not limited to text.

      Gerrit.

      Comment

      • John Machin

        #4
        Re: PEP 358 and operations on bytes


        Gerrit Holl wrote:
        On 2006-10-04 05:10:32 +0200, John Machin wrote:
        - str methods endswith, find, partition, replace, split(lines),
        startswith,
        - Regular expressions
        >
        I think those can be useful on a bytes type. Perhaps bytes and str could
        share a common parent class? They certainly share a lot of properties
        and possible operations one might want to perform.
        >
        I look at it this way::
        Processing text? Use unicode.
        Binary structures and file I/O, interfacing to 8-bit-wide channels? Use
        bytes.
        >
        But can I use regular expressions on bytes?
        Regular expressions are not limited to text.
        So why haven't you been campaigning for regular expression support for
        sequences of int, and for various array.array subtypes?

        Comment

        • Paul Rubin

          #5
          Re: PEP 358 and operations on bytes

          "John Machin" <sjmachin@lexic on.netwrites:
          So why haven't you been campaigning for regular expression support for
          sequences of int, and for various array.array subtypes?
          regexps work on byte arrays.

          Comment

          • John Machin

            #6
            Re: PEP 358 and operations on bytes


            Paul Rubin wrote:
            "John Machin" <sjmachin@lexic on.netwrites:
            So why haven't you been campaigning for regular expression support for
            sequences of int, and for various array.array subtypes?
            >
            regexps work on byte arrays.
            But not on other integer subtypes. If regexps should not be restricted
            to text, they should work on domains whose number of symbols is greater
            than 256, shouldn't they?

            Comment

            • Paul Rubin

              #7
              Re: PEP 358 and operations on bytes

              "John Machin" <sjmachin@lexic on.netwrites:
              But not on other integer subtypes. If regexps should not be restricted
              to text, they should work on domains whose number of symbols is greater
              than 256, shouldn't they?
              I think the underlying regexp C library isn't written that way. I can
              see reasons to want a higher-level regexp library that works on
              arbitrary sequences, calling a user-supplied function to classify
              sequence elements, the way current regexps use the character code to
              classify characters.

              Comment

              • bearophileHUGS@lycos.com

                #8
                Re: PEP 358 and operations on bytes

                Paul Rubin:
                I think the underlying regexp C library isn't written that way. I can
                see reasons to want a higher-level regexp library that works on
                arbitrary sequences, calling a user-supplied function to classify
                sequence elements, the way current regexps use the character code to
                classify characters.
                To begin with something concrete some days ago I was starting to write
                a simple RE engine that works on lists/tuples/arrays and uses Psyco in
                a good way (but then I have stopped developing it). Once and only once
                some good uses has being found, later someone can translate the code to
                C, if necessary.
                It seems an interesting thing, but can you find some uses for it?

                Bye,
                bearophile

                Comment

                • Paul Rubin

                  #9
                  Re: PEP 358 and operations on bytes

                  bearophileHUGS@ lycos.com writes:
                  I think the underlying regexp C library isn't written that way. I can
                  see reasons to want a higher-level regexp library that works on
                  arbitrary sequences, calling a user-supplied function to classify
                  sequence elements, the way current regexps use the character code to
                  classify characters.
                  ...It seems an interesting thing, but can you find some uses for it?
                  Yes, I want something like that all the time for file scanning without
                  having to resort to parser modules or hand coded automata.

                  Comment

                  • Fredrik Lundh

                    #10
                    Re: PEP 358 and operations on bytes

                    John Machin wrote:
                    But not on other integer subtypes. If regexps should not be restricted
                    to text, they should work on domains whose number of symbols is greater
                    than 256, shouldn't they?
                    they do:

                    import re, array

                    data = [0, 1, 1, 2]

                    array_type = "IH"[re.sre_compile. MAXCODE == 0xffff]

                    a = array.array(arr ay_type, data)

                    m = re.search(r"\x0 1+", a)

                    if m:
                    print m.span()
                    print m.group()

                    </F>

                    Comment

                    • bearophileHUGS@lycos.com

                      #11
                      Re: PEP 358 and operations on bytes

                      A simple RE engine written in Python can be short, this is a toy:

                      If you can't live without the usual syntax:


                      Paul Rubin:
                      Yes, I want something like that all the time for file scanning without
                      having to resort to parser modules or hand coded automata.
                      Once read a file is a string or unicode. On them you can use normal
                      REs. If you need list-REs you probably slit the data in some parts. Can
                      you show one or more examples where you think simple list-REs can be
                      useful?

                      Bye,
                      bearophile

                      Comment

                      • John Machin

                        #12
                        Re: PEP 358 and operations on bytes


                        Fredrik Lundh wrote:
                        John Machin wrote:
                        >
                        But not on other integer subtypes. If regexps should not be restricted
                        to text, they should work on domains whose number of symbols is greater
                        than 256, shouldn't they?
                        >
                        they do:
                        >
                        import re, array
                        >
                        data = [0, 1, 1, 2]
                        >
                        array_type = "IH"[re.sre_compile. MAXCODE == 0xffff]
                        >
                        a = array.array(arr ay_type, data)
                        >
                        m = re.search(r"\x0 1+", a)
                        >
                        if m:
                        print m.span()
                        print m.group()
                        Very minor nit: re.sre_compile doesn't exist before Python 2.5.
                        Presumably sys.maxunicode can substitute for re.sre_compile. MAXCODE.

                        That aside, I'd like to nominate myself as UGPOTM (utterly gobsmacked
                        poster of the month). Not only does that work, but so does this, all
                        the way back to 2.1 at least:

                        import re, array
                        data = [0, 1, 1, 2, 257, 257, 258]
                        # array_type = "IH"[re.sre_compile. MAXCODE == 0xffff] # Python 2.5
                        array_type = "H"
                        a = array.array(arr ay_type, data)
                        for q in (r"\x01+", ur"\u0101+"):
                        m = re.search(q, a)
                        if m:
                        print m.span()
                        print m.group()

                        produces:

                        (1, 3)
                        array('H', [1, 1])
                        (4, 6)
                        array('H', [257, 257])

                        Now, scurrying back towards Gerrit's original point: this feature is
                        not documented, even for array.array('B' , ...). Should it be left as a
                        happy accident of duck-typing, accessible only to those who stumble
                        over it, or should it be supported? Should it be included in Python 3?

                        Cheers,
                        John

                        Comment

                        Working...