How do xml parsers handle encoding?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • billsahiker@yahoo.com

    How do xml parsers handle encoding?


    if an xml file specifies an encoding, e.g., utf16, do xml browsers and
    xml editors read and verify each character in the file to make sure it
    is utf16? and throw an error if it is not, or. do they do an automatic
    filtering/converting to utf16, or do they do something else?

    Do they default to utf8 if the xml file does not specify an encoding?

    Bill
  • Martin Honnen

    #2
    Re: How do xml parsers handle encoding?

    Martin Honnen wrote:
    And XML parsers are required to check that documents are properly
    encoded. However browser like Firefox or Opera I think might not report
    any such violation. For instance I saved an XML document as UTF-8 but
    with an XML declaration saying encoding="UTF-16" and then loaded with
    Firefox 2.0 and Opera 9 and they both did not report an error, instead
    treated the document as UTF-8. IE 6 reported an error.
    For Mozilla, the FAQ
    The MDN Web Docs site provides information about Open Web technologies including HTML, CSS, and APIs for both Web sites and progressive web apps.

    says:
    "Most well-formedness constraints are enforced. (Currently Mozilla
    does not catch character encoding errors, because the document is
    re-encoded using a lenient encoding converter before the document
    reaches the XML parser. This is a bug.)"



    --

    Martin Honnen

    Comment

    • Joseph J. Kesselman

      #3
      Re: How do xml parsers handle encoding?

      The rules for how they're *supposed* to handle it are spelled out in the
      XML Recommendation. Not all parsers are in strict compliance with all
      parts of the recommendation, alas. Bug Happens.

      If you're asking whether you can get away with cheating: the brief
      answer is that it's extremely bad practice to try. If you're asking
      whether you can be certain a particular parser will or won't let
      something through, you can ask its development/user community... but be
      aware that the next release may fix this, and it's a very bad idea to
      write code that depends on bugs in specific versions.

      Comment

      • billsahiker@yahoo.com

        #4
        Re: How do xml parsers handle encoding?

        On Apr 30, 8:20 am, Martin Honnen <mahotr...@yaho o.dewrote:
        Martin Honnen wrote:
        And XML parsers are required to check that documents are properly
        encoded.
        So how do they do that? do they check every character? or do they just
        convert? if the encoding attribute is utf8 and the file has a
        character not utf8, does the browser error, convert it or what? Like
        if a Korean character is in a file that says it is utf8.

        Bill

        Comment

        • Richard Tobin

          #5
          Re: How do xml parsers handle encoding?

          In article <e96ae004-e602-4b72-a7b5-608f11ef2073@t1 2g2000prg.googl egroups.com>,
          <billsahiker@ya hoo.comwrote:
          And XML parsers are required to check that documents are properly
          encoded.
          >So how do they do that? do they check every character?
          Yes.
          >Like if a Korean character is in a file that says it is utf8.
          utf-8 covers all of Unicode, so it includes Korean characters.

          A parser has to check two things: that the data is legal for the
          encoding (for example, some sequences of bytes are not legal in
          UTF-8), and that the character it encodes is allowed in XML.

          -- Richard
          --
          :wq

          Comment

          • billsahiker@yahoo.com

            #6
            Re: How do xml parsers handle encoding?

            On Apr 30, 9:49 am, rich...@cogsci. ed.ac.uk (Richard Tobin) wrote:
            In article <e96ae004-e602-4b72-a7b5-608f11ef2...@t1 2g2000prg.googl egroups.com>,
            >
             <billsahi...@ya hoo.comwrote:
            And XML parsers are required to check that documents are properly
            encoded.
            So how do they do that? do they check every character?
            >
            Yes.
            >
            Like if a Korean character is in a file that says it is utf8.
            >
            utf-8 covers all of Unicode, so it includes Korean characters.
            >
            A parser has to check two things: that the data is legal for the
            encoding (for example, some sequences of bytes are not legal in
            UTF-8), and that the character it encodes is allowed in XML.
            >
            -- Richard
            --
            :wq
            OK. I dont know if you are a .net programmer or not(Martin is so maybe
            he can respond to this too), but if I use streamreader to read an xml
            file with encoding specified as utf8 and I set the
            streamreader.en coding property to utf8, will streamreader fire an
            exception if a character is not utf8,
            or do I have to parse every character and check its value to see if it
            is in the utf8 range?

            Bill

            Comment

            • Martin Honnen

              #7
              Re: How do xml parsers handle encoding?

              billsahiker@yah oo.com wrote:
              OK. I dont know if you are a .net programmer or not(Martin is so maybe
              he can respond to this too), but if I use streamreader to read an xml
              file with encoding specified as utf8 and I set the
              streamreader.en coding property to utf8, will streamreader fire an
              exception if a character is not utf8,
              or do I have to parse every character and check its value to see if it
              is in the utf8 range?
              As far as I know StreamReader does not throw an exception.


              --

              Martin Honnen

              Comment

              • Joseph J. Kesselman

                #8
                Re: How do xml parsers handle encoding?

                billsahiker@yah oo.com wrote:
                So how do they do that? do they check every character? or do they just
                convert?
                Most hand it off to an appropriate encoding-aware stream reader library
                and let that code do the work. Why build a wheel when you can buy one?

                Comment

                Working...