converting to and from octal escaped UTF--8

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Michael Goerz

    converting to and from octal escaped UTF--8

    Hi,

    I am writing unicode stings into a special text file that requires to
    have non-ascii characters as as octal-escaped UTF-8 codes.

    For example, the letter "Í" (latin capital I with acute, code point 205)
    would come out as "\303\215".

    I will also have to read back from the file later on and convert the
    escaped characters back into a unicode string.

    Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    vice versa?

    I know I can get the code point by doing
    >>"Í".decode(' utf-8').encode('uni code_escape')
    but there doesn't seem to be any similar method for getting the octal
    escaped version.

    Thanks,
    Michael
  • Michael Goerz

    #2
    Re: converting to and from octal escaped UTF--8

    Michael Goerz wrote:
    Hi,
    >
    I am writing unicode stings into a special text file that requires to
    have non-ascii characters as as octal-escaped UTF-8 codes.
    >
    For example, the letter "Í" (latin capital I with acute, code point 205)
    would come out as "\303\215".
    >
    I will also have to read back from the file later on and convert the
    escaped characters back into a unicode string.
    >
    Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    vice versa?
    >
    I know I can get the code point by doing
    >>>"Í".decode( 'utf-8').encode('uni code_escape')
    but there doesn't seem to be any similar method for getting the octal
    escaped version.
    >
    Thanks,
    Michael
    I've come up with the following solution. It's not very pretty, but it
    works (no bugs, I hope). Can anyone think of a better way to do it?

    Michael
    _________

    import binascii

    def escape(s):
    hexstring = binascii.b2a_he x(s)
    result = ""
    while len(hexstring) 0:
    (hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
    octbyte = oct(int(hexbyte , 16)).zfill(3)
    result += "\\" + octbyte[-3:]
    return result

    def unescape(s):
    result = ""
    while len(s) 0:
    if s[0] == "\\":
    (octbyte, s) = (s[1:4], s[4:])
    try:
    result += chr(int(octbyte , 8))
    except ValueError:
    result += "\\"
    s = octbyte + s
    else:
    result += s[0]
    s = s[1:]
    return result

    print escape("\303\21 5")
    print unescape('adf\\ 303\\215adf')

    Comment

    • MonkeeSage

      #3
      Re: converting to and from octal escaped UTF--8

      On Dec 2, 8:38 pm, Michael Goerz <answer...@8439 .e4ward.comwrot e:
      Michael Goerz wrote:
      Hi,
      >
      I am writing unicode stings into a special text file that requires to
      have non-ascii characters as as octal-escaped UTF-8 codes.
      >
      For example, the letter "Í" (latin capital I with acute, code point 205)
      would come out as "\303\215".
      >
      I will also have to read back from the file later on and convert the
      escaped characters back into a unicode string.
      >
      Does anyone have any suggestions on how to go from "Í" to "\303\215" and
      vice versa?
      >
      I know I can get the code point by doing
      >>"Í".decode('u tf-8').encode('uni code_escape')
      but there doesn't seem to be any similar method for getting the octal
      escaped version.
      >
      Thanks,
      Michael
      >
      I've come up with the following solution. It's not very pretty, but it
      works (no bugs, I hope). Can anyone think of a better way to do it?
      >
      Michael
      _________
      >
      import binascii
      >
      def escape(s):
      hexstring = binascii.b2a_he x(s)
      result = ""
      while len(hexstring) 0:
      (hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
      octbyte = oct(int(hexbyte , 16)).zfill(3)
      result += "\\" + octbyte[-3:]
      return result
      >
      def unescape(s):
      result = ""
      while len(s) 0:
      if s[0] == "\\":
      (octbyte, s) = (s[1:4], s[4:])
      try:
      result += chr(int(octbyte , 8))
      except ValueError:
      result += "\\"
      s = octbyte + s
      else:
      result += s[0]
      s = s[1:]
      return result
      >
      print escape("\303\21 5")
      print unescape('adf\\ 303\\215adf')
      Looks like escape() can be a bit simpler...

      def escape(s):
      result = []
      for char in s:
      result.append(" \%o" % ord(char))
      return ''.join(result)

      Regards,
      Jordan

      Comment

      • Michael Goerz

        #4
        Re: converting to and from octal escaped UTF--8

        MonkeeSage wrote:
        Looks like escape() can be a bit simpler...
        >
        def escape(s):
        result = []
        for char in s:
        result.append(" \%o" % ord(char))
        return ''.join(result)
        >
        Regards,
        Jordan
        Very neat! Thanks a lot...
        Michael

        Comment

        • Michael Spencer

          #5
          Re: converting to and from octal escaped UTF--8

          Michael Goerz wrote:
          Hi,
          >
          I am writing unicode stings into a special text file that requires to
          have non-ascii characters as as octal-escaped UTF-8 codes.
          >
          For example, the letter "Í" (latin capital I with acute, code point 205)
          would come out as "\303\215".
          >
          I will also have to read back from the file later on and convert the
          escaped characters back into a unicode string.
          >
          Does anyone have any suggestions on how to go from "Í" to "\303\215" and
          vice versa?
          >
          Perhaps something along the lines of:
          >>def encode(source):
          ... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
          ...
          >>def decode(encoded) :
          ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
          ... return bytes.decode('u tf8')
          ...
          >>encode(u"Í ")
          '\\303\\215'
          >>print decode(_)
          Í
          >>>
          HTH
          Michael

          Comment

          • MonkeeSage

            #6
            Re: converting to and from octal escaped UTF--8

            On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
            Michael Goerz wrote:
            Hi,
            >
            I am writing unicode stings into a special text file that requires to
            have non-ascii characters as as octal-escaped UTF-8 codes.
            >
            For example, the letter "Í" (latin capital I with acute, code point 205)
            would come out as "\303\215".
            >
            I will also have to read back from the file later on and convert the
            escaped characters back into a unicode string.
            >
            Does anyone have any suggestions on how to go from "Í" to "\303\215" and
            vice versa?
            >
            Perhaps something along the lines of:
            >
            >>def encode(source):
            ... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
            ...
            >>def decode(encoded) :
            ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
            ... return bytes.decode('u tf8')
            ...
            >>encode(u"Í" )
            '\\303\\215'
            >>print decode(_)
            Í
            >>>
            >
            HTH
            Michael
            Nice one. :) If I might suggest a slight variation to handle cases
            where the "encoded" string contains plain text as well as octal
            escapes...

            def decode(encoded) :
            for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
            encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
            return encoded.decode( 'utf8')

            This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
            as well as "adf\\303\\215a df".

            Regards,
            Jordan

            Comment

            • MonkeeSage

              #7
              Re: converting to and from octal escaped UTF--8

              On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gma il.comwrote:
              On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
              >
              >
              >
              Michael Goerz wrote:
              Hi,
              >
              I am writing unicode stings into a special text file that requires to
              have non-ascii characters as as octal-escaped UTF-8 codes.
              >
              For example, the letter "Í" (latin capital I with acute, code point 205)
              would come out as "\303\215".
              >
              I will also have to read back from the file later on and convert the
              escaped characters back into a unicode string.
              >
              Does anyone have any suggestions on how to go from "Í" to "\303\215"a nd
              vice versa?
              >
              Perhaps something along the lines of:
              >
              >>def encode(source):
              ... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
              ...
              >>def decode(encoded) :
              ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
              ... return bytes.decode('u tf8')
              ...
              >>encode(u"Í" )
              '\\303\\215'
              >>print decode(_)
              Í
              >
              HTH
              Michael
              >
              Nice one. :) If I might suggest a slight variation to handle cases
              where the "encoded" string contains plain text as well as octal
              escapes...
              >
              def decode(encoded) :
              for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
              encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
              return encoded.decode( 'utf8')
              >
              This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
              as well as "adf\\303\\215a df".
              >
              Regards,
              Jordan
              err...

              def decode(encoded) :
              for octc in re.findall(r'\\ (\d{3})', encoded):
              encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
              return encoded.decode( 'utf8')

              Comment

              • Michael Goerz

                #8
                Re: converting to and from octal escaped UTF--8

                MonkeeSage wrote:
                On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gma il.comwrote:
                >On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
                >>
                >>
                >>
                >>Michael Goerz wrote:
                >>>Hi,
                >>>I am writing unicode stings into a special text file that requires to
                >>>have non-ascii characters as as octal-escaped UTF-8 codes.
                >>>For example, the letter "Í" (latin capital I with acute, code point 205)
                >>>would come out as "\303\215".
                >>>I will also have to read back from the file later on and convert the
                >>>escaped characters back into a unicode string.
                >>>Does anyone have any suggestions on how to go from "Í" to "\303\215" and
                >>>vice versa?
                >>Perhaps something along the lines of:
                >> >>def encode(source):
                >> ... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
                >> ...
                >> >>def decode(encoded) :
                >> ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
                >> ... return bytes.decode('u tf8')
                >> ...
                >> >>encode(u"Í" )
                >> '\\303\\215'
                >> >>print decode(_)
                >> Í
                >>HTH
                >>Michael
                >Nice one. :) If I might suggest a slight variation to handle cases
                >where the "encoded" string contains plain text as well as octal
                >escapes...
                >>
                >def decode(encoded) :
                > for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
                > encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                > return encoded.decode( 'utf8')
                >>
                >This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
                >as well as "adf\\303\\215a df".
                >>
                >Regards,
                >Jordan
                >
                err...
                >
                def decode(encoded) :
                for octc in re.findall(r'\\ (\d{3})', encoded):
                encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                return encoded.decode( 'utf8')
                Great suggestions from both of you! I came up with my "final" solution
                based on them. It encodes only non-ascii and non-printables, and stays
                in unicode strings for both input and output. Also, low ascii values now
                encode into a 3-digit octal sequence also, so that decode can catch them
                properly.

                Thanks a lot,
                Michael

                ____________

                import re

                def encode(source):
                encoded = ""
                for character in source:
                if (ord(character) < 32) or (ord(character) 128):
                for byte in character.encod e('utf8'):
                encoded += ("\%03o" % ord(byte))
                else:
                encoded += character
                return encoded.decode( 'utf-8')

                def decode(encoded) :
                decoded = encoded.encode( 'utf-8')
                for octc in re.findall(r'\\ (\d{3})', decoded):
                decoded = decoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                return decoded.decode( 'utf8')


                orig = u"blaÍblub" + chr(10)
                enc = encode(orig)
                dec = decode(enc)
                print orig
                print enc
                print dec

                Comment

                • Piet van Oostrum

                  #9
                  Re: converting to and from octal escaped UTF--8

                  >>>>Michael Goerz <answer654@8439 .e4ward.com(MG) wrote:
                  >MG if (ord(character) < 32) or (ord(character) 128):
                  If you encode chars < 32 it seems more appropriate to also encode 127.

                  Moreover your code is quadratic in the size of the string so if you use
                  long strings it would be better to use join.
                  --
                  Piet van Oostrum <piet@cs.uu.n l>
                  URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C 4]
                  Private email: piet@vanoostrum .org

                  Comment

                  • MonkeeSage

                    #10
                    Re: converting to and from octal escaped UTF--8

                    On Dec 3, 8:10 am, Michael Goerz <answer...@8439 .e4ward.comwrot e:
                    MonkeeSage wrote:
                    On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@gma il.comwrote:
                    On Dec 2, 11:46 pm, Michael Spencer <m...@telcopart ners.comwrote:
                    >
                    >Michael Goerz wrote:
                    >>Hi,
                    >>I am writing unicode stings into a special text file that requires to
                    >>have non-ascii characters as as octal-escaped UTF-8 codes.
                    >>For example, the letter "Í" (latin capital I with acute, code point205)
                    >>would come out as "\303\215".
                    >>I will also have to read back from the file later on and convert the
                    >>escaped characters back into a unicode string.
                    >>Does anyone have any suggestions on how to go from "Í" to "\303\215" and
                    >>vice versa?
                    >Perhaps something along the lines of:
                    > >>def encode(source):
                    > ... return "".join("\% o" % ord(c) for c in source.encode(' utf8'))
                    > ...
                    > >>def decode(encoded) :
                    > ... bytes = "".join(chr(int (c, 8)) for c in encoded.split(' \\')[1:])
                    > ... return bytes.decode('u tf8')
                    > ...
                    > >>encode(u"Í" )
                    > '\\303\\215'
                    > >>print decode(_)
                    > Í
                    >HTH
                    >Michael
                    Nice one. :) If I might suggest a slight variation to handle cases
                    where the "encoded" string contains plain text as well as octal
                    escapes...
                    >
                    def decode(encoded) :
                    for octc in (c for c in re.findall(r'\\ (\d{3})', encoded)):
                    encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                    return encoded.decode( 'utf8')
                    >
                    This way it can handle both "\\141\\144\\14 6\\303\\215\\14 1\\144\\146"
                    as well as "adf\\303\\215a df".
                    >
                    Regards,
                    Jordan
                    >
                    err...
                    >
                    def decode(encoded) :
                    for octc in re.findall(r'\\ (\d{3})', encoded):
                    encoded = encoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                    return encoded.decode( 'utf8')
                    >
                    Great suggestions from both of you! I came up with my "final" solution
                    based on them. It encodes only non-ascii and non-printables, and stays
                    in unicode strings for both input and output. Also, low ascii values now
                    encode into a 3-digit octal sequence also, so that decode can catch them
                    properly.
                    >
                    Thanks a lot,
                    Michael
                    >
                    ____________
                    >
                    import re
                    >
                    def encode(source):
                    encoded = ""
                    for character in source:
                    if (ord(character) < 32) or (ord(character) 128):
                    for byte in character.encod e('utf8'):
                    encoded += ("\%03o" % ord(byte))
                    else:
                    encoded += character
                    return encoded.decode( 'utf-8')
                    >
                    def decode(encoded) :
                    decoded = encoded.encode( 'utf-8')
                    for octc in re.findall(r'\\ (\d{3})', decoded):
                    decoded = decoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                    return decoded.decode( 'utf8')
                    >
                    orig = u"blaÍblub" + chr(10)
                    enc = encode(orig)
                    dec = decode(enc)
                    print orig
                    print enc
                    print dec
                    An optimization... in decode() store matches as keys in a dict, so you
                    only do the string replacement once for each unique character...

                    def decode(encoded) :
                    decoded = encoded.encode( 'utf-8')
                    matches = {}
                    for octc in re.findall(r'\\ (\d{3})', decoded):
                    matches[octc] = None
                    for octc in matches:
                    decoded = decoded.replace (r'\%s' % octc, chr(int(octc, 8)))
                    return decoded.decode( 'utf8')

                    Untested...

                    Regards,
                    Jordan

                    Comment

                    Working...