UnicodeEncodeError while reading xml file (newbie question)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • nikosk

    UnicodeEncodeError while reading xml file (newbie question)

    I just spent a whole day trying to read an xml file and I got stuck
    with the following error:

    Exception Type: UnicodeEncodeEr ror
    Exception Value: 'charmap' codec can't encode characters in position
    164-167: character maps to <undefined>
    Exception Location: C:\Python25\lib \encodings\cp12 52.py in encode,
    line 12

    The string that could not be encoded/decoded was: H_C="ÊÉÍÁ" A_C

    After some tests I can say with confidence that the error comes up
    when python finds those greek characters after H_C="

    The code that reads the file goes like this :

    from xml.etree import ElementTree as ET

    def read_xml(reques t):
    data = open('live.xml' , 'r').read()
    data = data.decode('ut f-8', 'replace')
    data = ET.XML(data)

    I've tried all the combinations of str.decode str.encode I could
    think of but nothing.

    Can someone please help ?
  • John Machin

    #2
    Re: UnicodeEncodeEr ror while reading xml file (newbie question)

    On Jun 8, 10:12 am, nikosk <nikos.nikos.ni kos.ni...@gmail .comwrote:
    I just spent a whole day trying to read an xml file and I got stuck
    with the following error:
    >
    Exception Type: UnicodeEncodeEr ror
    Exception Value: 'charmap' codec can't encode characters in position
    164-167: character maps to <undefined>
    Exception Location: C:\Python25\lib \encodings\cp12 52.py in encode,
    line 12
    >
    The string that could not be encoded/decoded was: H_C="�� ��" A_C
    >
    After some tests I can say with confidence that the error comes up
    when python finds those greek characters after H_C="
    >
    The code that reads the file goes like this :
    >
    from xml.etree import ElementTree as ET
    >
    def read_xml(reques t):
    data = open('live.xml' , 'r').read()
    data = data.decode('ut f-8', 'replace')
    data = ET.XML(data)
    >
    I've tried all the combinations of str.decode str.encode I could
    think of but nothing.
    >
    Can someone please help ?
    Perhaps, with some more information:
    (1) the *full* traceback
    (2) what encoding is mentioned up the front of the XML file
    (3) why you think you need to have "data.decode(.. ...)" at all
    (4) why you think your input file is encoded in utf8 [which may be
    answered by (2)]
    (5) why you are using 'replace' (which would cover up (for a while)
    any non-utf8 characters in your file)
    (6) what "those greek characters" *really* are -- after fiddling with
    encodings in my browser the best I can make of that is four capital
    gamma characters each followed by a garbage byte or a '?'. Do
    something like:

    print repr(open('your file.xml', 'rb').read()[before_pos:afte r_pos])

    (7) are you expecting non-ASCII characters after H_C= ? what
    characters? when you open your xml file in a browser, what do you see
    there?

    Comment

    • nikosk

      #3
      Re: UnicodeEncodeEr ror while reading xml file (newbie question)

      You won't believe how helpful your reply was. I was looking for a
      problem that did not exist.
      You wrote : (3) why you think you need to have "data.decode(.. ...)"
      at all
      and after that : (7) are you expecting non-ASCII characters after
      H_C= ? what
      characters? when you open your xml file in a browser, what do you see
      there?
      And I went back to see why I was doing this in the first place
      (couldn't remember
      after struggling for so many hours) and I opened the file in Interent
      explorer.
      The browser wouldn't open it because it didn't like the encoding
      declared in the <xmltag
      "System does not support the specified encoding. Error processing
      resource 'http://scores24live.co m/xml/live.xml'. Line 1, ..."
      (IE was the only program that complained, FF and some other tools
      opened it without hassle)

      Then I went back and looked for the original message that got me
      struggling and it was this :
      xml.parsers.exp at.ExpatError: unknown encoding: line 1, column 30

      From then on it was easy to see that it was the xml encoding that was
      wrong :
      <?xml version="1.0" encoding="utf8" ?>

      when I switched that to :
      <?xml version="1.0" encoding="utf-8"?>

      everything just worked.

      I can't thank you enough for opening my eyes...

      PS.: The UnicodeEncodeEr ror must have something to do with Java's
      UTF-8
      implementation (the xml is produced by a Dom4j on a J2EE server).
      Those characters I posted in the original message should
      have read "ΚΙΝΑ" (China in Greek) but I after I copy pasted them in
      the post
      it came up like this : H_C="�� ��" A_C which is weird because
      this
      page is UTF encoded which means that characters should be 1 or 2 bytes
      long.
      From the message you see that instead of 4 characters it reads 8 which
      means
      there were extra information in the string.

      If the above is true then it might be something for python developers
      to address in the language. If someone wishes to investigate further
      here is the link for info on java utf and the file that caused the
      UnicodeEncodeEr ror :
      http://en.wikipedia.org/wiki/UTF-8 (the java section)


      the xml file : http://dsigned.gr/live.xml

      On Jun 8, 3:50 am, John Machin <sjmac...@lexic on.netwrote:
      On Jun 8, 10:12 am, nikosk <nikos.nikos.ni kos.ni...@gmail .comwrote:
      >
      >
      >
      I just spent a whole day trying to read an xml file and I got stuck
      with the following error:
      >
      Exception Type: UnicodeEncodeEr ror
      Exception Value: 'charmap' codec can't encode characters in position
      164-167: character maps to <undefined>
      Exception Location: C:\Python25\lib \encodings\cp12 52.py in encode,
      line 12
      >
      The string that could not be encoded/decoded was: H_C="�� ��" A_C
      >
      After some tests I can say with confidence that the error comes up
      when python finds those greek characters after H_C="
      >
      The code that reads the file goes like this :
      >
      from xml.etree import ElementTree as ET
      >
      def read_xml(reques t):
      data = open('live.xml' , 'r').read()
      data = data.decode('ut f-8', 'replace')
      data = ET.XML(data)
      >
      I've tried all the combinations of str.decode str.encode I could
      think of but nothing.
      >
      Can someone please help ?
      >
      Perhaps, with some more information:
      (1) the *full* traceback
      (2) what encoding is mentioned up the front of the XML file
      (3) why you think you need to have "data.decode(.. ...)" at all
      (4) why you think your input file is encoded in utf8 [which may be
      answered by (2)]
      (5) why you are using 'replace' (which would cover up (for a while)
      any non-utf8 characters in your file)
      (6) what "those greek characters" *really* are -- after fiddling with
      encodings in my browser the best I can make of that is four capital
      gamma characters each followed by a garbage byte or a '?'. Do
      something like:
      >
      print repr(open('your file.xml', 'rb').read()[before_pos:afte r_pos])
      >
      (7) are you expecting non-ASCII characters after H_C= ? what
      characters? when you open your xml file in a browser, what do you see
      there?

      Comment

      Working...