How to ask sax for the file encoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Edward K. Ream

    How to ask sax for the file encoding

    Following the usual cookbook examples, my app parses an open file as
    follows::



    parser = xml.sax.make_pa rser()

    parser.setFeatu re(xml.sax.hand ler.feature_ext ernal_ges,1)

    # Hopefully the content handler can figure out the encoding from the <?xml>
    element.

    handler = saxContentHandl er(c,inputFileN ame,silent)

    parser.setConte ntHandler(handl er)

    parser.parse(th eFile)



    Can anyone tell me how the content handler can determine the encoding of the
    file? Can sax provide this info?



    Thanks!



    Edward
    --------------------------------------------------------------------
    Edward K. Ream email: edreamleo@chart er.net
    Leo: http://webpages.charter.net/edreamleo/front.html
    --------------------------------------------------------------------



  • Fredrik Lundh

    #2
    Re: How to ask sax for the file encoding

    Edward K. Ream wrote:
    Can anyone tell me how the content handler can determine the encoding of the file? Can sax
    provide this info?
    there is no encoding on the "inside" of an XML document; it's all Unicode.

    </F>



    Comment

    • Edward K. Ream

      #3
      Re: How to ask sax for the file encoding

      >Can anyone tell me how the content handler can determine the encoding of
      >the file? Can sax provide this info?
      there is no encoding on the "inside" of an XML document; it's all Unicode.
      True, but sax is reading the file, so sax is producing the unicode, so it
      should (must) be able to determine the encoding. Furthermore, xml files
      start with lines like:

      <?xml version="1.0" encoding="utf-8"?>

      so it would seem reasonable for sax to be able to return 'utf-8' somehow.
      Am I missing something?

      Edward
      --------------------------------------------------------------------
      Edward K. Ream email: edreamleo@chart er.net
      Leo: http://webpages.charter.net/edreamleo/front.html
      --------------------------------------------------------------------


      Comment

      • Diez B. Roggisch

        #4
        Re: How to ask sax for the file encoding

        Edward K. Ream wrote:
        >>Can anyone tell me how the content handler can determine the encoding of
        >>the file? Can sax provide this info?
        >
        >there is no encoding on the "inside" of an XML document; it's all
        >Unicode.
        >
        True, but sax is reading the file, so sax is producing the unicode, so it
        should (must) be able to determine the encoding.
        It is, by reading the xml header.
        Furthermore, xml files
        start with lines like:
        >
        <?xml version="1.0" encoding="utf-8"?>
        >
        so it would seem reasonable for sax to be able to return 'utf-8' somehow.
        Am I missing something?
        That sax outputs unicode, which has no encoding associated anymore. And thus
        it is a pretty much irrelevant information. It _could_ be retained, but for
        what purpose?

        Diez

        Comment

        • Fredrik Lundh

          #5
          Re: How to ask sax for the file encoding

          Edward K. Ream wrote:
          <?xml version="1.0" encoding="utf-8"?>
          >
          so it would seem reasonable for sax to be able to return 'utf-8' somehow.
          why? that's an encoding detail, and should be completely irrelevant for
          your application.
          Am I missing something?
          you're confusing artifacts of an external serialization format with the actual
          data model. don't do that, if you can avoid it.

          what's your use case ?

          </F>



          Comment

          • Edward K. Ream

            #6
            Re: How to ask sax for the file encoding

            [The value of the encoding field] _could_ be retained, but for what
            purpose?
            I'm asking this question because my app needs it :-) Imo, there is *no*
            information in any xml file that can be considered irrelvant. My app will
            want to know the original encoding when writing the file.

            Edward
            --------------------------------------------------------------------
            Edward K. Ream email: edreamleo@chart er.net
            Leo: http://webpages.charter.net/edreamleo/front.html
            --------------------------------------------------------------------


            Comment

            • Diez B. Roggisch

              #7
              Re: How to ask sax for the file encoding

              Edward K. Ream wrote:
              >[The value of the encoding field] _could_ be retained, but for what
              >purpose?
              >
              I'm asking this question because my app needs it :-)
              Imo, there is *no*
              information in any xml file that can be considered irrelvant.
              It sure is! The encoding _is_ irrelevant, in the very moment you get unicode
              strings. The order of attributes is irrelevant. There is plenty of
              irrelevant whitespace. And so on...
              My app will
              want to know the original encoding when writing the file.
              When your app needs it, whatfor does it need it? If you write out xml again,
              use whatever encoding suits you best. If you don't, use the encoding that
              the subsequent application or processing step needs.

              Diez

              Comment

              • Edward K. Ream

                #8
                Re: How to ask sax for the file encoding

                The encoding _is_ irrelevant, in the very moment you get unicode strings.

                We shall have to disagree about this. My use case is perfectly reasonable,
                imo.
                If you write out xml again, use whatever encoding suits you best.
                What suits me best is what the *user* specified, and that got put in the
                first xml line.
                I'm going to have to parse this line myself.

                Edward
                --------------------------------------------------------------------
                Edward K. Ream email: edreamleo@chart er.net
                Leo: http://webpages.charter.net/edreamleo/front.html
                --------------------------------------------------------------------


                Comment

                • Rob Wolfe

                  #9
                  Re: How to ask sax for the file encoding

                  "Edward K. Ream" <edreamleo@char ter.netwrites:
                  Can anyone tell me how the content handler can determine the encoding of the
                  file? Can sax provide this info?
                  Try this:

                  <code>
                  from xml.parsers import expat

                  s = """<?xml version='1.0' encoding='iso-8859-1'?>
                  <book>
                  <title>Title</title>
                  <chapter>Chapte r 1</chapter>
                  </book>
                  """

                  class MyParser(object ):
                  def XmlDecl(self, version, encoding, standalone):
                  print "XmlDecl", version, encoding, standalone

                  def Parse(self, data):
                  Parser = expat.ParserCre ate()
                  Parser.XmlDeclH andler = self.XmlDecl
                  Parser.Parse(da ta, 1)

                  parser = MyParser()
                  parser.Parse(s)
                  </code>

                  --
                  HTH,
                  Rob

                  Comment

                  • Irmen de Jong

                    #10
                    Re: How to ask sax for the file encoding

                    Edward K. Ream wrote:
                    What suits me best is what the *user* specified, and that got put in the
                    first xml line.
                    I'm going to have to parse this line myself.
                    Please consider adding some elements to the document itself that
                    describe the desired output format, such as:

                    ....
                    <output>
                    <encoding>utf-8</encoding>
                    </output>
                    ....

                    This allows the client to specify the encoding it wants to receive
                    the document in, even if it's different than the encoding it used
                    to make the first document. More flexibility. Less fooling around.

                    --Irmen

                    Comment

                    • Fredrik Lundh

                      #11
                      Re: How to ask sax for the file encoding

                      Edward K. Ream wrote:
                      I'm asking this question because my app needs it :-) Imo, there is *no*
                      information in any xml file that can be considered irrelvant.
                      the encoding isn't *in* the XML file, it's an artifact of the
                      serialization model used for a specific XML infoset. the XML
                      data is pure Unicode.

                      </F>

                      Comment

                      • Fredrik Lundh

                        #12
                        Re: How to ask sax for the file encoding

                        Edward K. Ream wrote:
                        What suits me best is what the *user* specified, and that got put in the
                        first xml line.
                        are you expecting your users to write XML by hand? ouch.

                        </F>

                        Comment

                        • Edward K. Ream

                          #13
                          Re: How to ask sax for the file encoding

                          are you expecting your users to write XML by hand?

                          Of course not. Leo has the following option:

                          @string new_leo_file_en coding = utf-8

                          Edward
                          --------------------------------------------------------------------
                          Edward K. Ream email: edreamleo@chart er.net
                          Leo: http://webpages.charter.net/edreamleo/front.html
                          --------------------------------------------------------------------


                          Comment

                          • Edward K. Ream

                            #14
                            Re: How to ask sax for the file encoding

                            Please consider adding some elements to the document itself that
                            describe the desired output format,

                            Well, that's what the encoding field in the xml line was supposed to do.
                            Not a bad idea though, except it changes the file format, and I would really
                            rather not do that.

                            Edward
                            --------------------------------------------------------------------
                            Edward K. Ream email: edreamleo@chart er.net
                            Leo: http://webpages.charter.net/edreamleo/front.html
                            --------------------------------------------------------------------


                            Comment

                            • Edward K. Ream

                              #15
                              Re: How to ask sax for the file encoding

                              the encoding isn't *in* the XML file, it's an artifact of the
                              serialization model used for a specific XML infoset. the XML
                              data is pure Unicode.
                              Sorry, but no. The *file* is what I am talking about, and the way it is
                              encoded does, in fact, really make a difference to some users. They have a
                              right, I think, to expect that the original encoding gets preserved when the
                              file is rewritten.

                              Edward
                              --------------------------------------------------------------------
                              Edward K. Ream email: edreamleo@chart er.net
                              Leo: http://webpages.charter.net/edreamleo/front.html
                              --------------------------------------------------------------------


                              Comment

                              Working...