A 'raw' codec for binary "strings" in Python?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bill Janssen

    A 'raw' codec for binary "strings" in Python?

    I've encountered an issue dealing with strings read from files. I
    read a line from a file, then try to print it out as an ASCII string:

    line = fp.readline()
    print line.encode('US-ASCII', 'replace')

    and of course I get an error like:

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)

    because the file contained some binary character. You'll notice that
    the problem is in *decoding* the string, not in re-encoding it,
    because I'm using the default "C" locale, and "US-ASCII" is presumed
    for strings. But these strings are *not* US-ASCII, they are raw
    bytes. How do I format a string of raw bytes for conversion to a
    recognized charset encoding for printing?

    There seems to be no 'raw' codec that would capture this. There's no
    way of setting an attribute on a file to express this. It looks like
    the best I can do is

    print string.join([(((ord(x) > 0 and ord(x) < 0x7F) and x) or (r"\x%02x" % ord(x))) for x in line], '')

    which seems extremely inefficient.

    Bill

  • Erik Max Francis

    #2
    Re: A 'raw' codec for binary &quot;strings&q uot; in Python?

    Bill Janssen wrote:
    [color=blue]
    > You'll notice that
    > the problem is in *decoding* the string, not in re-encoding it,
    > because I'm using the default "C" locale, and "US-ASCII" is presumed
    > for strings. But these strings are *not* US-ASCII, they are raw
    > bytes. How do I format a string of raw bytes for conversion to a
    > recognized charset encoding for printing?[/color]

    Since the default encoding is ASCII, those 8-bit octets have no meaning
    unless you do an explicit conversion. Trying to print them _should_
    raise an error, because you're trying to do something that doesn't make
    sense.

    As Gerrit pointed out, it sounds like what you want is repr.

    --
    __ Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
    / \ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    \__/ Liberty is the right to do whatever the law permits.
    -- Charles Louis Montesquieu

    Comment

    • Michael Hudson

      #3
      Re: A 'raw' codec for binary &quot;strings&q uot; in Python?

      Bill Janssen <janssen@parc.c om> writes:
      [color=blue]
      > I've encountered an issue dealing with strings read from files. I
      > read a line from a file, then try to print it out as an ASCII string:
      >
      > line = fp.readline()
      > print line.encode('US-ASCII', 'replace')
      >
      > and of course I get an error like:
      >
      > Traceback (most recent call last):
      > File "<stdin>", line 1, in ?
      > UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xd5 in position 1: ordinal not in range(128)
      >
      > because the file contained some binary character. You'll notice that
      > the problem is in *decoding* the string, not in re-encoding it,
      > because I'm using the default "C" locale, and "US-ASCII" is presumed
      > for strings.[/color]

      Actually, the "C" locale has precisely nothing to do with it.
      [color=blue]
      > But these strings are *not* US-ASCII, they are raw bytes. How do I
      > format a string of raw bytes for conversion to a recognized charset
      > encoding for printing?[/color]

      You don't?

      Wouldn't

      def m(c):
      if c in string.printabl e:
      return c
      else:
      return '?'

      t = ''.join([m(chr(o)) for o in range(m)])

      line.translate( t)

      make more sense?

      Cheers,
      mwh

      --
      I like silliness in a MP skit, but not in my APIs. :-)
      -- Guido van Rossum, python-dev

      Comment

      Working...