not quite 1252

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anton Vredegoor

    not quite 1252

    I'm trying to import text from an open office document (save as .sxw and
    read the data from content.xml inside the sxw-archive using
    elementtree and such tools).

    The encoding that gives me the least problems seems to be cp1252,
    however it's not completely perfect because there are still characters
    in it like \93 or \94. Has anyone handled this before? I'd rather not
    reinvent the wheel and start translating strings 'by hand'.

    Anton
  • Fredrik Lundh

    #2
    Re: not quite 1252

    Anton Vredegoor wrote:
    [color=blue]
    > I'm trying to import text from an open office document (save as .sxw and
    > read the data from content.xml inside the sxw-archive using
    > elementtree and such tools).
    >
    > The encoding that gives me the least problems seems to be cp1252,
    > however it's not completely perfect because there are still characters
    > in it like \93 or \94. Has anyone handled this before?[/color]

    this might help:



    </F>





    Comment

    • Anton Vredegoor

      #3
      Re: not quite 1252

      Fredrik Lundh wrote:
      [color=blue]
      > Anton Vredegoor wrote:
      >[color=green]
      >> I'm trying to import text from an open office document (save as .sxw and
      >> read the data from content.xml inside the sxw-archive using
      >> elementtree and such tools).
      >>
      >> The encoding that gives me the least problems seems to be cp1252,
      >> however it's not completely perfect because there are still characters
      >> in it like \93 or \94. Has anyone handled this before?[/color]
      >
      > this might help:
      >
      > http://effbot.org/zone/unicode-gremlins.htm[/color]

      Thanks a lot! The code below not only made the strange chars go away,
      but it also fixed the xml-parsing errors ... Maybe it's useful to
      someone else too, use at own risk though.

      Anton

      from gremlins import kill_gremlins
      from zipfile import ZipFile, ZIP_DEFLATED

      def repair(infn,out fn):
      zin = ZipFile(infn, 'r', ZIP_DEFLATED)
      zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
      for x in zin.namelist():
      data = zin.read(x)
      if x == 'contents.xml':
      zout.writestr(x ,kill_gremlins( data).encode('c p1252'))
      else:
      zout.writestr(x ,data)
      zout.close()

      def test():
      infn = "xxxx.sxw"
      outfn = 'dg.sxw'
      repair(infn,out fn)

      if __name__=='__ma in__':
      test()

      Comment

      • Martin v. Löwis

        #4
        Re: not quite 1252

        Anton Vredegoor wrote:[color=blue]
        > The encoding that gives me the least problems seems to be cp1252,
        > however it's not completely perfect because there are still characters
        > in it like \93 or \94. Has anyone handled this before? I'd rather not
        > reinvent the wheel and start translating strings 'by hand'.[/color]

        Not sure I understand the question. If you process data in cp1252,
        then \x94 and \x94 are legal characters, and the Python codec should
        support them just fine.

        Regards,
        Martin

        Comment

        • Anton Vredegoor

          #5
          Re: not quite 1252

          Martin v. Löwis wrote:
          [color=blue]
          > Not sure I understand the question. If you process data in cp1252,
          > then \x94 and \x94 are legal characters, and the Python codec should
          > support them just fine.[/color]

          Tell that to the guys from open-office.

          Anton

          Comment

          • Serge Orlov

            #6
            Re: not quite 1252


            Anton Vredegoor wrote:[color=blue]
            > I'm trying to import text from an open office document (save as .sxw and
            > read the data from content.xml inside the sxw-archive using
            > elementtree and such tools).
            >
            > The encoding that gives me the least problems seems to be cp1252,
            > however it's not completely perfect because there are still characters
            > in it like \93 or \94. Has anyone handled this before? I'd rather not
            > reinvent the wheel and start translating strings 'by hand'.[/color]

            I extracted content.xml from a test file and the header is:
            <?xml version="1.0" encoding="UTF-8"?>

            So any xml library should handle it just fine, without you trying to
            guess the encoding.

            Comment

            • Martin v. Löwis

              #7
              Re: not quite 1252

              Anton Vredegoor wrote:[color=blue][color=green]
              >> Not sure I understand the question. If you process data in cp1252,
              >> then \x94 and \x94 are legal characters, and the Python codec should
              >> support them just fine.[/color]
              >
              > Tell that to the guys from open-office.[/color]

              Ok, I'll rephrase: Can you please explain your problem again, in
              different words?

              I thought you are trying to export data *from* open-office, and
              your message seems to suggest (without actually saying so) that the
              document contains \x93 and \x94 (you said "there are still characters in
              it like \93 or \94").

              So if that is the case: What is the problem then? If you interpret
              the document as cp1252, and it contains \x93 and \x94, what is
              it that you don't like about that? In yet other words: what actions
              are you performing, what are the results you expect to get, and
              what are the results that you actually get?

              Regards,
              Martin

              Comment

              • John Machin

                #8
                Re: not quite 1252

                On 27/04/2006 12:49 AM, Anton Vredegoor wrote:[color=blue]
                > Fredrik Lundh wrote:
                >[color=green]
                >> Anton Vredegoor wrote:
                >>[color=darkred]
                >>> I'm trying to import text from an open office document (save as .sxw and
                >>> read the data from content.xml inside the sxw-archive using
                >>> elementtree and such tools).
                >>>
                >>> The encoding that gives me the least problems seems to be cp1252,
                >>> however it's not completely perfect because there are still characters
                >>> in it like \93 or \94. Has anyone handled this before?[/color]
                >>
                >> this might help:
                >>
                >> http://effbot.org/zone/unicode-gremlins.htm[/color]
                >
                > Thanks a lot! The code below not only made the strange chars go away,
                > but it also fixed the xml-parsing errors[/color]

                What xml-parsing errors were they??
                [color=blue]
                > ... Maybe it's useful to
                > someone else too, use at own risk though.
                >
                > Anton
                >
                > from gremlins import kill_gremlins
                > from zipfile import ZipFile, ZIP_DEFLATED
                >
                > def repair(infn,out fn):
                > zin = ZipFile(infn, 'r', ZIP_DEFLATED)
                > zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
                > for x in zin.namelist():
                > data = zin.read(x)
                > if x == 'contents.xml':[/color]

                Firstly, this should be 'content.xml', not 'contents.xml'.

                Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8
                e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in
                UTF-8. The kill_gremlins function is intended to fix Unicode strings
                that have been obtained by decoding 8-bit strings using 'latin1' instead
                of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins
                function, it changes the \x80 to a Euro symbol, and leaves the other two
                alone. Because the \x9d is not defined in cp1252, it then causes your
                code to die in a hole when you attempt to encode it as cp1252:
                UnicodeEncodeEr ror: 'charmap' codec can't encode character u'\x9d' in
                position 1761: character maps to <undefined>

                I don't see how this code repairs anything (quite the contrary!), unless
                there's some side effect of just read/writestr. Enlightenment, please.
                [color=blue]
                > zout.writestr(x ,kill_gremlins( data).encode('c p1252'))
                > else:
                > zout.writestr(x ,data)
                > zout.close()[/color]

                Comment

                • Anton Vredegoor

                  #9
                  Re: not quite 1252

                  John Machin wrote:
                  [color=blue]
                  > Firstly, this should be 'content.xml', not 'contents.xml'.[/color]

                  Right, the code doesn't do *anything* :-( Thanks for pointing that out.
                  At least it doesn't do much harm either :-|
                  [color=blue]
                  > Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8
                  > e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in
                  > UTF-8. The kill_gremlins function is intended to fix Unicode strings
                  > that have been obtained by decoding 8-bit strings using 'latin1' instead
                  > of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins
                  > function, it changes the \x80 to a Euro symbol, and leaves the other two
                  > alone. Because the \x9d is not defined in cp1252, it then causes your
                  > code to die in a hole when you attempt to encode it as cp1252:
                  > UnicodeEncodeEr ror: 'charmap' codec can't encode character u'\x9d' in
                  > position 1761: character maps to <undefined>[/color]

                  Yeah, converting to cp1252 was all that was necessary, like Sergei wrote.
                  [color=blue]
                  > I don't see how this code repairs anything (quite the contrary!), unless
                  > there's some side effect of just read/writestr. Enlightenment, please.[/color]

                  You're quite right. I'm extremely embarrassed now. What's left for me is
                  just to explain how it got this bad.

                  First I noticed that by extracting from content.xml using OOopy's
                  getiterator function, some \x94 codes were left inside the document.

                  But that was an *artifact*, because if one prints something using
                  s.__repr__() as is used for example when printing a list of strings
                  (duh) the output is not the same as when one prints with 'print s'. I
                  guess what is called then is str(s).

                  Ok, now we have that out of the way, I hope.

                  So I immediately posted a message about conversion errors, assuming
                  something in the open office xml file was not quite 1252. In fact it
                  wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to
                  cp1252, no problem.

                  Then I also noticed that not all xml-tags were printed if I just
                  iterated the xml-tree and filtered out only those elements with a text
                  attribute, like 'if x.text: print x'

                  In fact there are a lot of printable things that haven't got a text
                  attribute, for example some items with tag (xxxx)s.

                  When F pointed me to gremlins there was on this page the following text:

                  <quote>

                  Some applications add CP1252 (Windows, Western Europe) characters to
                  documents marked up as ISO 8859-1 (Latin 1) or other encodings. These
                  characters are not valid ISO-8859-1 characters, and may cause all sorts
                  of problems in processing and display applications.

                  </quote>

                  I concluded that these \x94 codes (which I didn't know about them being
                  a figment of my representation yet) were responsible for my iterator
                  skipping over some text elements, but in fact the iterator skipped them
                  because they had no text attribute even though they were somehow
                  containing text.

                  Now add my natural tendency to see that what I think is the case rather
                  than neutrally observing the world as it is into the mix and of course I
                  saw the \x94 disappear (but that was because I now was printing them
                  straight and not indirectly as elements of a list) and also I thought
                  that now the xml-parsing 'errors' had disappeared but that was just
                  because I saw some text element appear that I thought I hadn't seen
                  before (but in fact it was there all the time).

                  One man's enlightenment sometimes is another's embarrassment, or so it
                  seems. Thanks to you all clearing up my perceptions, and sorry about all
                  the confusion I created.

                  What I want to know next is how to access and print the elements that
                  contain text but have no text attribute, that is, if it's not to taxing
                  on my badly damaged ego.

                  Anton







                  Comment

                  • Anton Vredegoor

                    #10
                    Re: not quite 1252

                    Serge Orlov wrote:
                    [color=blue]
                    > I extracted content.xml from a test file and the header is:
                    > <?xml version="1.0" encoding="UTF-8"?>
                    >
                    > So any xml library should handle it just fine, without you trying to
                    > guess the encoding.[/color]

                    Yes my header also says UTF-8. However some kind person send me an
                    e-mail stating that since I am getting \x94 and such output when using
                    repr (even if str is giving correct output) there could be some problem
                    with the XML-file not being completely UTF-8. Or is there some other
                    reason I'm getting these \x94 codes? Or maybe this is just as it should
                    be and there's no problem at all? Again?

                    Anton

                    'octopussies respond only off-list'

                    Comment

                    • Richard Brodie

                      #11
                      Re: not quite 1252


                      "Anton Vredegoor" <anton.vredegoo r@gmail.com> wrote in message
                      news:4451f9e9$1 @usenet.zapto.o rg...
                      [color=blue]
                      > Yes my header also says UTF-8. However some kind person send me an e-mail stating that
                      > since I am getting \x94 and such output when using repr (even if str is giving correct
                      > output) there could be some problem with the XML-file not being completely UTF-8. Or is
                      > there some other reason I'm getting these \x94 codes?[/color]

                      Well that rather depends on what you are doing. If you take utf-8, decode
                      it to Unicode, then re-encode it as cp1252 you'll possibly get \x94. OTOH,
                      if you see '\x94' in a Unicode string, something is wrong somewhere.


                      Comment

                      • John Machin

                        #12
                        Re: not quite 1252

                        On 28/04/2006 9:21 PM, Anton Vredegoor wrote:[color=blue]
                        > Serge Orlov wrote:
                        >[color=green]
                        >> I extracted content.xml from a test file and the header is:
                        >> <?xml version="1.0" encoding="UTF-8"?>
                        >>
                        >> So any xml library should handle it just fine, without you trying to
                        >> guess the encoding.[/color]
                        >
                        > Yes my header also says UTF-8. However some kind person send me an
                        > e-mail stating that since I am getting \x94 and such output when using
                        > repr (even if str is giving correct output)
                        > there could be some problem
                        > with the XML-file not being completely UTF-8.[/color]

                        I deduce that I am the allegedly kind person.

                        Firstly you have a problem with the "even if" part of "I am getting \x94
                        and such output when using repr (even if str is giving correct output)".

                        Let txt = "\x93hello\x94" . So you print repr(txt) and the result appears
                        as '\x92hello\x94' . That is absolutely correct. It is an unambiguous
                        REPRresentation of the string. In IDLE (or similar) on a Windows box
                        (where the default encoding is cp1252) if you print str(txt) [or merely
                        print txt] the display shows hello preceded/followed by the LEFT/RIGHT
                        DOUBLE QUOTATION MARK (U+201C/U+201D) -- or some other pair of
                        left/right thingies. That is also correct enough.

                        Secondly, I stated nothing such about the XML-file. We were discussing
                        "extracting from content.xml using OOopy's getiterator function". My
                        point was that if you were seeing \x94 anywhere, THE OUTPUT FROM THAT
                        FUNCTION must be encoded as cp1252.

                        Here is the relevant part:
                        =============== ===========
                        AV>> First I noticed that by extracting from content.xml using OOopy's
                        getiterator function, some \x94 codes were left inside the document.

                        AV>> But that was an *artifact*, because if one prints something using
                        s.__repr__() as is used for example when printing a list of strings
                        (duh) the output is not the same as when one prints with 'print s'. I
                        guess what is called then is str(s).

                        JM> Don't *guess*!!!

                        AV>> Ok, now we have that out of the way, I hope.

                        JM>> No, not quite. If you saw \x94 in the repr() output, but it looked
                        "OK" when displayed using str(), then the only reasonable hypotheses are
                        (a) the data was in an 8-bit string, presumably encoded as cp1252
                        (definitely NOT UTF-8), rather than a Unicode string (b) you displayed
                        it via a file whose encoding was 'cp1252'.

                        JM>> "... assuming something in the open office xml file was not quite
                        1252. In fact it wasn't, it was UTF-8 ..." --- another problem was
                        assuming that the encoding used for the output of the OOopy interface
                        (apparently cp1252; is there no documentation?) would be the same as in
                        the .sxw file (UTF-8).

                        === end of extract ===
                        [color=blue]
                        > Or is there some other
                        > reason I'm getting these \x94 codes?[/color]

                        You are getting "these \x94 codes" when you do *WHAT* exactly?

                        I refer you to Martin's unanswered questions:

                        """What is the problem then? If you interpret
                        the document as cp1252, and it contains \x93 and \x94, what is
                        it that you don't like about that? In yet other words: what actions
                        are you performing, what are the results you expect to get, and
                        what are the results that you actually get?"""

                        Comment

                        • Serge Orlov

                          #13
                          Re: not quite 1252


                          Anton Vredegoor wrote:[color=blue]
                          > In fact there are a lot of printable things that haven't got a text
                          > attribute, for example some items with tag (xxxx)s.[/color]

                          In my sample file I see <text:s text:c="2"/>, is that you're talking
                          about? Since my file is small I can say for sure this tag represents
                          two space characters.

                          Comment

                          • Anton Vredegoor

                            #14
                            Re: not quite 1252

                            Richard Brodie wrote:
                            [color=blue]
                            > "Anton Vredegoor" <anton.vredegoo r@gmail.com> wrote in message
                            > news:4451f9e9$1 @usenet.zapto.o rg...
                            >[color=green]
                            >> Yes my header also says UTF-8. However some kind person send me an e-mail stating that
                            >> since I am getting \x94 and such output when using repr (even if str is giving correct
                            >> output) there could be some problem with the XML-file not being completely UTF-8. Or is
                            >> there some other reason I'm getting these \x94 codes?[/color]
                            >
                            > Well that rather depends on what you are doing. If you take utf-8, decode
                            > it to Unicode, then re-encode it as cp1252 you'll possibly get \x94. OTOH,
                            > if you see '\x94' in a Unicode string, something is wrong somewhere.[/color]

                            Well, I mailed the content.xml to someone as a text attachment and it
                            was damaged at the other end, whereas sending it as a file resulted in
                            flawless transfer. So I guess there is something not quite UTF-8 in it.

                            However Firefox has no problem opening it either here or at the other
                            persons computer (the undamaged file of course).

                            By the way, I also sent an MSwWord document (not as text) that I edited
                            using OO back to the same person who is using MsWord and he is at the
                            moment still recovering from an MSWord crash. Could it have something to
                            do with the OO document being half as big as the MsWord Doc :-)

                            Anton

                            Comment

                            • Serge Orlov

                              #15
                              Re: not quite 1252

                              Anton Vredegoor wrote:[color=blue]
                              > Serge Orlov wrote:
                              >[color=green]
                              > > I extracted content.xml from a test file and the header is:
                              > > <?xml version="1.0" encoding="UTF-8"?>
                              > >
                              > > So any xml library should handle it just fine, without you trying to
                              > > guess the encoding.[/color]
                              >
                              > Yes my header also says UTF-8. However some kind person send me an
                              > e-mail stating that since I am getting \x94 and such output when using
                              > repr (even if str is giving correct output) there could be some problem
                              > with the XML-file not being completely UTF-8. Or is there some other
                              > reason I'm getting these \x94 codes? Or maybe this is just as it should
                              > be and there's no problem at all?[/color]

                              Indeed, just load the file into ElementTree. Extending the example you
                              posted before:

                              data = zin.read(x)
                              import elementtree.Ele mentTree as ET
                              doc = ET.fromstring(d ata)
                              officetag = "{http://openoffice.org/2000/office}"
                              body = self.doc.find(" .//"+officetag+"bo dy")
                              for fragment in body.getchildre n():
                              ... process one fragment of document's body ...

                              Comment

                              Working...