Filtering out non-readable characters

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • MKoool

    Filtering out non-readable characters

    I have a file with binary and ascii characters in it. I massage the
    data and convert it to a more readable format, however it still comes
    up with some binary characters mixed in. I'd like to write something
    to just replace all non-printable characters with '' (I want to delete
    non-printable characters).

    I am having trouble figuring out an easy python way to do this... is
    the easiest way to just write some regular expression that does
    something like replace [^\p] with ''?

    Or is it better to go through every character and do ord(character),
    check the ascii values?

    What's the easiest way to do something like this?

    thanks

  • Bengt Richter

    #2
    Re: Filtering out non-readable characters

    On 15 Jul 2005 17:33:39 -0700, "MKoool" <mohan@teraboli c.com> wrote:
    [color=blue]
    >I have a file with binary and ascii characters in it. I massage the
    >data and convert it to a more readable format, however it still comes
    >up with some binary characters mixed in. I'd like to write something
    >to just replace all non-printable characters with '' (I want to delete
    >non-printable characters).
    >
    >I am having trouble figuring out an easy python way to do this... is
    >the easiest way to just write some regular expression that does
    >something like replace [^\p] with ''?
    >
    >Or is it better to go through every character and do ord(character),
    >check the ascii values?
    >
    >What's the easiest way to do something like this?
    >[/color]
    [color=blue][color=green][color=darkred]
    >>> import string
    >>> string.printabl e[/color][/color][/color]
    '0123456789abcd efghijklmnopqrs tuvwxyzABCDEFGH IJKLMNOPQRSTUVW XYZ!"#$%&\'()*+ ,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'[color=blue][color=green][color=darkred]
    >>> identity = ''.join([chr(i) for i in xrange(256)])
    >>> unprintable = ''.join([c for c in identity if c not in string.printabl e])
    >>>
    >>> def remove_unprinta ble(s):[/color][/color][/color]
    ... return s.translate(ide ntity, unprintable)
    ...[color=blue][color=green][color=darkred]
    >>> set(remove_unpr intable(identit y)) - set(string.prin table)[/color][/color][/color]
    set([])[color=blue][color=green][color=darkred]
    >>> set(remove_unpr intable(identit y))[/color][/color][/color]
    set(['\x0c', ' ', '$', '(', ',', '0', '4', '8', '<', '@', 'D', 'H', 'L', 'P', 'T', 'X', '\\', '`
    ', 'd', 'h', 'l', 'p', 't', 'x', '|', '\x0b', '#', "'", '+', '/', '3', '7', ';', '?', 'C', 'G',
    'K', 'O', 'S', 'W', '[', '_', 'c', 'g', 'k', 'o', 's', 'w', '{', '\n', '"', '&', '*', '.', '2',
    '6', ':', '>', 'B', 'F', 'J', 'N', 'R', 'V', 'Z', '^', 'b', 'f', 'j', 'n', 'r', 'v', 'z', '~', '
    \t', '\r', '!', '%', ')', '-', '1', '5', '9', '=', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y', ']', 'a',
    'e', 'i', 'm', 'q', 'u', 'y', '}'])[color=blue][color=green][color=darkred]
    >>> sorted(set(remo ve_unprintable( identity))) == sorted(set(stri ng.printable))[/color][/color][/color]
    True[color=blue][color=green][color=darkred]
    >>> sorted((remove_ unprintable(ide ntity))) == sorted((string. printable))[/color][/color][/color]
    True

    After that, to get clean file text, something like

    cleantext = remove_unprinta ble(file('uncle an.txt').read() )

    should do it. Or you should be able to iterate by lines something like (untested)

    for uncleanline in file('unclean.t xt'):
    cleanline = remove_unprinta ble(uncleanline )
    # ... do whatever with clean line

    If there is something in string.printabl e that you don't want included, just use your own
    string of printables. BTW,
    [color=blue][color=green][color=darkred]
    >>> help(str.transl ate)[/color][/color][/color]
    Help on method_descript or:

    translate(...)
    S.translate(tab le [,deletechars]) -> string

    Return a copy of the string S, where all characters occurring
    in the optional argument deletechars are removed, and the
    remaining characters have been mapped through the given
    translation table, which must be a string of length 256.

    Regards,
    Bengt Richter

    Comment

    • Raymond Hettinger

      #3
      Re: Filtering out non-readable characters

      Wow, that was the most thorough answer to a comp.lang.pytho n question
      since the Martellibot got busy in the search business.

      Comment

      • Peter Hansen

        #4
        Re: Filtering out non-readable characters

        Bengt Richter wrote:[color=blue][color=green][color=darkred]
        > >>> identity = ''.join([chr(i) for i in xrange(256)])
        > >>> unprintable = ''.join([c for c in identity if c not in string.printabl e])[/color][/color][/color]

        And note that with Python 2.4, in each case the above square brackets
        are unnecessary (though harmless), because of the arrival of "generator
        expressions" in the language. (Bengt knows this already, of course, but
        his brain is probably resisting the reprogramming. :-) )

        -Peter

        Comment

        • Steven D'Aprano

          #5
          Re: Filtering out non-readable characters

          On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote:
          [color=blue]
          > Bengt Richter wrote:[color=green][color=darkred]
          >> >>> identity = ''.join([chr(i) for i in xrange(256)])
          >> >>> unprintable = ''.join([c for c in identity if c not in string.printabl e])[/color][/color]
          >
          > And note that with Python 2.4, in each case the above square brackets
          > are unnecessary (though harmless), because of the arrival of "generator
          > expressions" in the language.[/color]

          But to use generator expressions, wouldn't you need an extra pair of round
          brackets?

          eg identity = ''.join( ( chr(i) for i in xrange(256) ) )

          with the extra spaces added for clarity.

          That is, the brackets after join make the function call, and the nested
          brackets make the generator. That, at least, is my understanding.



          --
          Steven
          who is still using Python 2.3, and probably will be for quite some time


          Comment

          • Peter Hansen

            #6
            Re: Filtering out non-readable characters

            Steven D'Aprano wrote:[color=blue]
            > On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote:[color=green]
            >>Bengt Richter wrote:
            >>[color=darkred]
            >>> >>> identity = ''.join([chr(i) for i in xrange(256)])[/color]
            >>
            >>And note that with Python 2.4, in each case the above square brackets
            >>are unnecessary (though harmless), because of the arrival of "generator
            >>expressions " in the language.[/color]
            >
            > But to use generator expressions, wouldn't you need an extra pair of round
            > brackets?
            >
            > eg identity = ''.join( ( chr(i) for i in xrange(256) ) )[/color]

            Come on, Steven. Don't tell us you didn't have access to a Python
            interpreter to check before you posted:

            c:\>python
            Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32[color=blue][color=green][color=darkred]
            >>> ''.join(chr(c) for c in range(65, 91))[/color][/color][/color]
            'ABCDEFGHIJKLMN OPQRSTUVWXYZ'

            -Peter

            Comment

            • Bengt Richter

              #7
              Re: Filtering out non-readable characters

              On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen <peter@engcorp. com> wrote:
              [color=blue]
              >Bengt Richter wrote:[color=green][color=darkred]
              >> >>> identity = ''.join([chr(i) for i in xrange(256)])
              >> >>> unprintable = ''.join([c for c in identity if c not in string.printabl e])[/color][/color]
              >
              >And note that with Python 2.4, in each case the above square brackets
              >are unnecessary (though harmless), because of the arrival of "generator
              >expressions" in the language. (Bengt knows this already, of course, but
              >his brain is probably resisting the reprogramming. :-) )
              >[/color]
              Thanks for the nudge. Actually, I know about generator expressions, but
              at some point I must have misinterpreted some bug in my code to mean
              that join in particular didn't like generator expression arguments,
              and wanted lists. Actually it seems to like anything at all that can
              be iterated produce a sequence of strings. So I'm glad to find that
              join is fine after all, and to get that misap[com?:-)]prehension
              out of my mind ;-)

              Regards,
              Bengt Richter

              Comment

              • George Sakkis

                #8
                Re: Filtering out non-readable characters

                "Bengt Richter" <bokr@oz.net> wrote:
                [color=blue][color=green][color=darkred]
                > >>> identity = ''.join([chr(i) for i in xrange(256)])
                > >>> unprintable = ''.join([c for c in identity if c not in string.printabl e])[/color][/color][/color]

                Or equivalently:
                [color=blue][color=green][color=darkred]
                >>> identity = string.maketran s('','')
                >>> unprintable = identity.transl ate(identity, string.printabl e)[/color][/color][/color]

                George


                Comment

                • Peter Hansen

                  #9
                  Re: Filtering out non-readable characters

                  George Sakkis wrote:[color=blue]
                  > "Bengt Richter" <bokr@oz.net> wrote:[color=green][color=darkred]
                  >> >>> identity = ''.join([chr(i) for i in xrange(256)])[/color][/color]
                  >
                  > Or equivalently:[color=green][color=darkred]
                  >>>>identity = string.maketran s('','')[/color][/color][/color]

                  Wow! That's handy, not to mention undocumented. (At least in the
                  string module docs.) Where did you learn that, George?

                  -Peter

                  Comment

                  • Steven D'Aprano

                    #10
                    Re: Filtering out non-readable characters

                    On Sat, 16 Jul 2005 16:42:58 -0400, Peter Hansen wrote:
                    [color=blue]
                    > Steven D'Aprano wrote:[color=green]
                    >> On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote:[color=darkred]
                    >>>Bengt Richter wrote:
                    >>>
                    >>>> >>> identity = ''.join([chr(i) for i in xrange(256)])
                    >>>
                    >>>And note that with Python 2.4, in each case the above square brackets
                    >>>are unnecessary (though harmless), because of the arrival of "generator
                    >>>expression s" in the language.[/color]
                    >>
                    >> But to use generator expressions, wouldn't you need an extra pair of round
                    >> brackets?
                    >>
                    >> eg identity = ''.join( ( chr(i) for i in xrange(256) ) )[/color]
                    >
                    > Come on, Steven. Don't tell us you didn't have access to a Python
                    > interpreter to check before you posted:[/color]

                    Er, as I wrote in my post:

                    "Steven
                    who is still using Python 2.3, and probably will be for quite some time"

                    So, no, I didn't have access to a Python interpreter running version 2.4.

                    I take it then that generator expressions work quite differently
                    than list comprehensions? The equivalent "implied delimiters" for a list
                    comprehension would be something like this:
                    [color=blue][color=green][color=darkred]
                    >>> L = [1, 2, 3]
                    >>> L[ i for i in range(2) ][/color][/color][/color]
                    File "<stdin>", line 1
                    L[ i for i in range(2) ]
                    ^
                    SyntaxError: invalid syntax

                    which is a very different result from:
                    [color=blue][color=green][color=darkred]
                    >>> L[ [i for i in range(2)] ][/color][/color][/color]
                    Traceback (most recent call last):
                    File "<stdin>", line 1, in ?
                    TypeError: list indices must be integers

                    In other words, a list comprehension must have the [ ] delimiters to be
                    recognised as a list comprehension, EVEN IF the square brackets are there
                    from some other element. But a generator expression doesn't care where the
                    round brackets come from, so long as they are there: they can be part of
                    the function call.

                    I hope that makes sense to you.


                    --
                    Steven

                    Comment

                    • Steven D'Aprano

                      #11
                      Re: Filtering out non-readable characters

                      On Sat, 16 Jul 2005 19:01:50 -0400, Peter Hansen wrote:
                      [color=blue]
                      > George Sakkis wrote:[color=green]
                      >> "Bengt Richter" <bokr@oz.net> wrote:[color=darkred]
                      >>> >>> identity = ''.join([chr(i) for i in xrange(256)])[/color]
                      >>
                      >> Or equivalently:[color=darkred]
                      >>>>>identity = string.maketran s('','')[/color][/color]
                      >
                      > Wow! That's handy, not to mention undocumented. (At least in the
                      > string module docs.) Where did you learn that, George?[/color]

                      I can't answer for George, but I also noticed that behaviour. I discovered
                      it by trial and error. I thought, oh what a nuisance that the arguments
                      for maketrans had to include all 256 characters, then I wondered what
                      error you would get if you left some out, and discovered that you didn't
                      get an error at all.

                      That actually disappointed me at the time, because I was looking for
                      behaviour where the missing characters weren't filled in, but I've come to
                      appreciate it since.


                      --
                      Steven


                      Comment

                      • Steven D'Aprano

                        #12
                        Re: Filtering out non-readable characters

                        Replying to myself... this is getting to be a habit.

                        On Sun, 17 Jul 2005 15:08:12 +1000, Steven D'Aprano wrote:
                        [color=blue]
                        > I hope that makes sense to you.[/color]

                        That wasn't meant as a snide little dig at Peter, and I'm sorry if anyone
                        reads it that way. I found myself struggling to explain simply the
                        different behaviour between list comps and generator expressions, and
                        couldn't be sure I was explaining myself as clearly as I wanted. It might
                        have been better if I had left off the "to you".



                        --
                        Steven

                        Comment

                        • Peter Hansen

                          #13
                          Re: Filtering out non-readable characters

                          Steven D'Aprano wrote:[color=blue]
                          > On Sat, 16 Jul 2005 16:42:58 -0400, Peter Hansen wrote:[color=green]
                          >>Come on, Steven. Don't tell us you didn't have access to a Python
                          >>interpreter to check before you posted:[/color]
                          >
                          > Er, as I wrote in my post:
                          >
                          > "Steven
                          > who is still using Python 2.3, and probably will be for quite some time"[/color]

                          Sorry, missed that! I don't generally notice signatures much, partly
                          because Thunderbird is smart enough to "grey them out" (the main text is
                          displayed as black, quoted material in blue, and signatures in a light
                          gray.)

                          I don't have a firm answer (though I suspect the language reference
                          does) about when "dedicated" parentheses are required around a generator
                          expression. I just know that, so far, they just work when I want them
                          to. Like most of Python. :-)

                          -Peter

                          Comment

                          • Steven Bethard

                            #14
                            Re: Filtering out non-readable characters

                            Bengt Richter wrote:[color=blue]
                            > Thanks for the nudge. Actually, I know about generator expressions, but
                            > at some point I must have misinterpreted some bug in my code to mean
                            > that join in particular didn't like generator expression arguments,
                            > and wanted lists.[/color]

                            I suspect this is bug 905389 [1]:
                            [color=blue][color=green][color=darkred]
                            >>> def gen():[/color][/color][/color]
                            .... yield 1
                            .... raise TypeError('from gen()')
                            ....[color=blue][color=green][color=darkred]
                            >>> ''.join([x for x in gen()])[/color][/color][/color]
                            Traceback (most recent call last):
                            File "<interacti ve input>", line 1, in ?
                            File "<interacti ve input>", line 3, in gen
                            TypeError: from gen()[color=blue][color=green][color=darkred]
                            >>> ''.join(x for x in gen())[/color][/color][/color]
                            Traceback (most recent call last):
                            File "<interacti ve input>", line 1, in ?
                            TypeError: sequence expected, generator found

                            I run into this every month or so, and have to remind myself that it
                            means that my generator is raising a TypeError, not that join doesn't
                            accept generator expressions...

                            STeVe

                            [1] http://www.python.org/sf/905389

                            Comment

                            • Michael Ströder

                              #15
                              Re: Filtering out non-readable characters

                              Peter Hansen wrote:[color=blue][color=green][color=darkred]
                              >>>> ''.join(chr(c) for c in range(65, 91))[/color][/color]
                              > 'ABCDEFGHIJKLMN OPQRSTUVWXYZ'[/color]

                              Wouldn't this be a candidate for making the Python language stricter?

                              Do you remember old Python versions treating l.append(n1,n2) the same
                              way like l.append((n1,n2 )). I'm glad this is forbidden now.

                              Ciao, Michael.

                              Comment

                              Working...