ElementTree cannot parse UTF-8 Unicode?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Erik  Bethke

    ElementTree cannot parse UTF-8 Unicode?

    Hello All,

    I am getting an error of not well-formed at the beginning of the Korean
    text in the second example. I am doing something wrong with how I am
    encoding my Korean? Do I need more of a wrapper about it than simple
    quotes? Is there some sort of XML syntax for indicating a Unicode
    string, or does the Elementree library just not support reading of
    Unicode?

    here is my test snippet:

    from elementtree import ElementTree
    vocabXML = ElementTree.par se('test2.xml') .getroot()

    where I have two data files:

    this one works:
    <?xml version="1.0" encoding="UTF-8"?>
    <Vocab>
    <Word L1='Hahha'></Word>
    </Vocab>

    this one fails:
    <?xml version="1.0" encoding="UTF-8"?>
    <Vocab>
    <Word L1="ì–´ë…•í•˜ì„ ¸ìš”!"></Word>
    </Vocab>

  • Fredrik Lundh

    #2
    Re: ElementTree cannot parse UTF-8 Unicode?

    Erik Bethke wrote:
    [color=blue]
    > I am getting an error of not well-formed at the beginning of the Korean
    > text in the second example. I am doing something wrong with how I am
    > encoding my Korean? Do I need more of a wrapper about it than simple
    > quotes? Is there some sort of XML syntax for indicating a Unicode
    > string, or does the Elementree library just not support reading of
    > Unicode?[/color]

    XML is Unicode, and ElementTree supports all common encodings just
    fine (including UTF-8).
    [color=blue]
    > this one fails:
    > <?xml version="1.0" encoding="UTF-8"?>
    > <Vocab>
    > <Word L1="?????!"></Word>
    > </Vocab>[/color]

    this works just fine on my machine.

    what's the exact error message?

    what does

    print repr(open("test 2.xml").read())

    print on your machine?

    what happens if you attempt to parse

    <Vocab>
    <Word L1="어녕하세요!" />
    </Vocab>

    ?

    </F>



    Comment

    • Erik  Bethke

      #3
      Re: ElementTree cannot parse UTF-8 Unicode?

      Hello Fredrik,

      1) The exact error is in line 1160 of self._parser.Pa rse(data, 0 ):
      xml.parsers.exp at.ExpatError: not well-formed (invalid token): line 3,
      column 16

      2) You are right in that the print of the file read works just fine.

      3) You are also right in that the digitally encoded unicode also works
      fine. However, this solution has two new problems:

      1) The xml file is now not human readable
      2) After ElementTree gets done parsing it, I am feeding the text to a
      wx.TextCtrl via .SetValue() but that is now giving me an error message
      of being unable to convert that style of string

      So it seems to me, that ElementTree is just not expecting to run into
      the Korean characters for it is at column 16 that these begin. Am I
      formatting the XML properly?

      Thank you,
      -Erik

      Comment

      • Jeremy Bowers

        #4
        Re: ElementTree cannot parse UTF-8 Unicode?

        On Wed, 19 Jan 2005 16:35:23 -0800, Erik Bethke wrote:[color=blue]
        > So it seems to me, that ElementTree is just not expecting to run into the
        > Korean characters for it is at column 16 that these begin. Am I
        > formatting the XML properly?[/color]

        You should post the file somewhere on the web. (I wouldn't expect Usenet
        to transmit it properly.)

        (Just jumping in to possibly save you a reply cycle.)

        Comment

        • Fredrik Lundh

          #5
          Re: ElementTree cannot parse UTF-8 Unicode?

          Erik Bethke wrote:
          [color=blue]
          > 2) You are right in that the print of the file read works just fine.[/color]

          but what does it look like? I saved a raw copy of your original mail,
          fixed the quoted-printable encoding, and got an UTF-8 encoded file
          that works just fine. the thing you've been parsing, and that you've
          cut and pasted into your mail, must be different, in some way.
          [color=blue]
          > 3) You are also right in that the digitally encoded unicode also works
          > fine. However, this solution has two new problems:[/color]

          that was just a test to make sure that your version of elementtree could
          handle Unicode characters on your platform.
          [color=blue]
          > 1) The xml file is now not human readable
          > 2) After ElementTree gets done parsing it, I am feeding the text to a
          > wx.TextCtrl via .SetValue() but that is now giving me an error message
          > of being unable to convert that style of string[/color]

          on my machine, the L1 attribute contains a Unicode string:
          [color=blue][color=green][color=darkred]
          >>> print repr(root.find( "Word").get("L1 "))[/color][/color][/color]
          u'\uc5b4\ub155\ ud558\uc138\uc6 94!'

          what does it give you on your machine? (looks like wxPython cannot handle
          Unicode strings, but can that really be true?)
          [color=blue]
          > So it seems to me, that ElementTree is just not expecting to run into
          > the Korean characters for it is at column 16 that these begin. Am I
          > formatting the XML properly?[/color]

          nobody knows...

          </F>



          Comment

          • Do Re Mi chel La Si Do

            #6
            Re: ElementTree cannot parse UTF-8 Unicode?

            Hi !
            [color=blue][color=green][color=darkred]
            >>> ...Usenet to transmit it properly[/color][/color][/color]

            newsgroups (NNTP) : yes, it does it
            usenet : perhaps (that depends on the newsgroups)
            clp : no





            Michel Claveau


            Comment

            • Jorge Luiz Godoy Filho

              #7
              Re: ElementTree cannot parse UTF-8 Unicode?

              Fredrik Lundh, Quinta 20 Janeiro 2005 05:17, wrote:
              [color=blue]
              > what does it give you on your machine? (looks like wxPython cannot handle
              > Unicode strings, but can that really be true?)[/color]

              It does support Unicode if it was built to do so...

              --
              Godoy. <godoy@ieee.org >

              Comment

              • Fredrik Lundh

                #8
                Re: ElementTree cannot parse UTF-8 Unicode?

                Jorge Luiz Godoy Filho wrote:
                [color=blue][color=green]
                >> what does it give you on your machine? (looks like wxPython cannot handle
                >> Unicode strings, but can that really be true?)[/color]
                >
                > It does support Unicode if it was built to do so...[/color]

                Python has supported Unicode in release 1.6, 2.0, 2.1, 2.2, 2.3 and 2.4, so
                you might think that Unicode should be enabled by default in a UI toolkit for
                Python...

                </F>



                Comment

                • Erik  Bethke

                  #9
                  Re: ElementTree cannot parse UTF-8 Unicode?

                  There is something wrong with the physical file... I d/l a trial
                  version of XML Spy home edition and built an equivalent of the korean
                  test file, and tried it and it got past the element tree error and now
                  I am stuck with the wxEditCtrl error.

                  To build the xml file in the first place I had code that looked like
                  this:

                  d=wxFileDialog( self, message="Choose a file",
                  defaultDir=os.g etcwd(), defaultFile="", wildcard="*.xml ", style=wx.SAVE
                  | wxOVERWRITE_PRO MPT | wx.CHANGE_DIR)
                  if d.ShowModal() == wx.ID_OK:
                  # This returns a Python list of files that were selected.
                  paths = d.GetPaths()
                  layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
                  L1Word = self.t1.GetValu e()
                  L2Word = 'undefined'

                  layout += '<Vocab>\n'
                  layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
                  layout += '</Vocab>'
                  open( paths[0], 'w' ).write(layout)
                  d.Destroy()

                  So apprantly there is something wrong with physically constructing the
                  file in this manner?

                  Thank you,
                  -Erik

                  Comment

                  • Fredrik Lundh

                    #10
                    Re: ElementTree cannot parse UTF-8 Unicode?

                    Erik Bethke wrote:
                    [color=blue]
                    > layout += '<Vocab>\n'
                    > layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'[/color]

                    what does "print repr(L1Word)" print (that is, what does wxPython return?).
                    it should be a Unicode string, but that would give you an error when you write
                    it out:
                    [color=blue][color=green][color=darkred]
                    >>> f = open("file.txt" , "w")
                    >>> f.write(u'\uc5b 4\ub155\ud558\u c138\uc694!')[/color][/color][/color]
                    Traceback (most recent call last):
                    File "<stdin>", line 1, in ?
                    UnicodeEncodeEr ror: 'ascii' codec can't encode characters
                    in position 0-4: ordinal not in range(128)

                    have you hacked the default encoding in site/sitecustomize?

                    what happens if you replace the L1Word term with L1Word.encode(" utf-8")

                    can you post the repr() (either of what's in your file or of the thing, whatever
                    it is, that wxPython returns...)

                    </F>



                    Comment

                    • Erik  Bethke

                      #11
                      Re: ElementTree cannot parse UTF-8 Unicode?

                      That was a great clue. I am an idiot and tapped on the wrong download
                      link... now I can read and parse the xml file fine - as long as I
                      create it in XML spy - if I create it by this method:

                      d=wxFileDialog( self, message="Choose a file",
                      defaultDir=os.g etcwd(), defaultFile="", wildcard="*.xml ", style=wx.SAVE
                      | wxOVERWRITE_PRO MPT | wx.CHANGE_DIR)
                      if d.ShowModal() == wx.ID_OK:
                      # This returns a Python list of files that were selected.
                      paths = d.GetPaths()
                      layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
                      L1Word = self.t1.GetValu e()
                      L2Word = 'undefined'

                      layout += '<Vocab>\n'
                      layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
                      layout += '</Vocab>'
                      open( paths[0], 'w' ).write(layout)

                      I get hung up on the write statement, I am off to look for a a Unicode
                      capable file write I think...

                      -Erik

                      Comment

                      • Erik  Bethke

                        #12
                        Re: ElementTree cannot parse UTF-8 Unicode?

                        Woo-hoo! Everything is working now!

                        Thank you everyone!

                        The TWO problems I had:

                        1) I needed to save my XML file in the first place with this code:
                        f = codecs.open(pat hs[0], 'w', 'utf8')
                        2) I needed to download the UNICODE version of wxPython, duh.

                        So why are there non-UNICODE versions of wxPython??? To save memory or
                        something???

                        Thank you all!

                        Best!
                        -Erik

                        Comment

                        • Jarek Zgoda

                          #13
                          Re: ElementTree cannot parse UTF-8 Unicode?

                          Erik Bethke wrote:
                          [color=blue]
                          > So why are there non-UNICODE versions of wxPython??? To save memory or
                          > something???[/color]

                          Win95, Win98, WinME have problems with unicode. GTK1 does not support
                          unicode at all.

                          --
                          Jarek Zgoda
                          http://jpa.berlios.de/ | http://www.zgodowie.org/

                          Comment

                          • Martin v. Löwis

                            #14
                            Re: ElementTree cannot parse UTF-8 Unicode?

                            Jarek Zgoda wrote:
                            [color=blue][color=green]
                            >> So why are there non-UNICODE versions of wxPython??? To save memory or
                            >> something???[/color]
                            >
                            >
                            > Win95, Win98, WinME have problems with unicode.[/color]

                            This problem can be solved - on W9x, wxPython would have to
                            pass all Unicode strings to WideCharToMulti Byte, using
                            CP_ACP, and then pass the result to the API function.

                            Regards,
                            Martin

                            Comment

                            • Stephen Waterbury

                              #15
                              wxPython unicode/ansi builds [was Re: ElementTree cannot parse UTF-8Unicode?]

                              Martin v. Löwis wrote:[color=blue]
                              > Jarek Zgoda wrote:[color=green][color=darkred]
                              >>> So why are there non-UNICODE versions of wxPython??? To save memory or
                              >>> something???[/color][/color][/color]

                              Robin Dunn has an explanation here:



                              .... which is the first hit from a Google search on
                              "wxpython unicode build".

                              Also, from the wxPython downloads page:

                              "There are two versions of wxPython for each of the supported
                              Python versions on Win32. They are nearly identical, except one
                              of them has been compiled with support for the Unicode version of
                              the platform APIs. If you don't know what that means then you
                              probably don't need the Unicode version, get the ANSI version
                              instead. The Unicode verison works best on Windows NT/2000/XP. It
                              will also mostly work on Windows 95/98/Me systems, but it is
                              based on a Microsoft hack called MSLU (or unicows.dll) that
                              translates unicode API calls to ansi API calls, but the coverage
                              of the API is not complete so there are some difficult bugs
                              lurking in there."

                              Steve

                              Comment

                              Working...