LoadXML and UTF-8 encoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jmgonet

    LoadXML and UTF-8 encoding

    Hello everybody,
    I'm having troubles loading a Xml string encoded in UTF-8.

    If I try this code:
    ------------------------------
    XmlDocument doc=new XmlDocument();
    String s="<?xml version=\"1.0\" encoding=\"utf-8\"
    standalone=\"ye s\"?><a>Schönb ühl</a>";
    doc.LoadXml(s);
    doc.Save("d:\\t emp\\test.xml") ;
    ------------------------------

    What I get in the test.xml file is:
    ------------------------------
    <?xml version="1.0" encoding="utf-8" standalone="yes "?>
    <a>Schönbà¼hl</a>
    ------------------------------

    I'm puzzled about two points in the test.xml file:
    - What is the "" at the beginning?
    - Why are the special chars double-encoded?

    Am I missing some point? Is there any workaround?

    Thanks in advance,

    jmgonet.
  • Oleg Tkachenko [MVP]

    #2
    Re: LoadXML and UTF-8 encoding

    jmgonet wrote:
    [color=blue]
    > I'm having troubles loading a Xml string encoded in UTF-8.
    >
    > If I try this code:
    > ------------------------------
    > XmlDocument doc=new XmlDocument();
    > String s="<?xml version=\"1.0\" encoding=\"utf-8\"
    > standalone=\"ye s\"?><a>Schönb ühl</a>";
    > doc.LoadXml(s);
    > doc.Save("d:\\t emp\\test.xml") ;
    > ------------------------------
    >
    > What I get in the test.xml file is:
    > ------------------------------
    > <?xml version="1.0" encoding="utf-8" standalone="yes "?>
    > <a>Schönbà¼hl</a>
    > ------------------------------
    >
    > I'm puzzled about two points in the test.xml file:
    > - What is the "" at the beginning?[/color]

    It's Unicode Byte-Order Mark character. It's ok, but actually in UTF-8
    it's optional and you can get rid of it:

    XmlTextWriter w = new XmlTextWriter(" d:\\temp\\test. xml", new
    UTF8Encoding(fa lse);
    doc.Save(w);
    w.Close();
    [color=blue]
    > - Why are the special chars double-encoded?[/color]

    What do you mean?
    --
    Oleg Tkachenko [XML MVP, MCP]

    Comment

    • jmgonet

      #3
      Re: LoadXML and UTF-8 encoding

      Thanks, Oleg, for your reply.

      OK for the "". That's interesting

      By double encoded I mean that the chars are encoded twice to UTF-8:
      Originally the string contained in the xml was "Schönbühl" .
      To put it into the UTF standard, I've transformed it to "Schönbüh l", where
      the ö="ö" and the ü="ü". This is the string I'm trying to load:

      -------------------------------------------------
      XmlDocument doc=new XmlDocument();
      String s="<?xml version=\"1.0\" encoding=\"utf-8\"
      standalone=\"ye s\"?><a>Schönb ühl</a>";
      doc.LoadXml(s);
      doc.Save("d:\\t emp\\test.xml") ;
      --------------------------------------------------

      But when I open "test.xml" in a text editor, I get:
      ------------------------------
      <?xml version="1.0" encoding="utf-8" standalone="yes "?>
      <a>SchÃf¶nbÃf ¼hl</a>
      ------------------------------

      The string is converted to "SchÃf¶nbÃf¼h l", where the Ã="Ãf", the
      ¶="¶"...
      [color=blue][color=green]
      > >
      > > I'm puzzled about two points in the test.xml file:
      > > - What is the "" at the beginning?[/color]
      >
      > It's Unicode Byte-Order Mark character. It's ok, but actually in UTF-8
      > it's optional and you can get rid of it:
      >
      > XmlTextWriter w = new XmlTextWriter(" d:\\temp\\test. xml", new
      > UTF8Encoding(fa lse);
      > doc.Save(w);
      > w.Close();
      >[color=green]
      > > - Why are the special chars double-encoded?[/color]
      >
      > What do you mean?
      > --
      > Oleg Tkachenko [XML MVP, MCP]
      > http://blog.tkachenko.com[/color]



      Comment

      • Oleg Tkachenko [MVP]

        #4
        Re: LoadXML and UTF-8 encoding

        jmgonet wrote:
        [color=blue]
        > By double encoded I mean that the chars are encoded twice to UTF-8:
        > Originally the string contained in the xml was "Schönbühl" .
        > To put it into the UTF standard, I've transformed it to "Schönbüh l", where
        > the ö="ö" and the ü="ü".[/color]

        I think that's a bad idea. UTF-8 defines how Unicode characters are
        represented in bytes. By doubling characters you get just two characters.

        What's wrong with "Schönbühl" ? Just use it as is.

        --
        Oleg Tkachenko [XML MVP, MCP]

        Comment

        • jmgonet

          #5
          Re: LoadXML and UTF-8 encoding

          >What's wrong with "Schönbühl" ?

          Well, I see I was oversimplifying my question, in an attempt to avoid
          discussing about other issues. The point is that the Xml file is provided by
          guys from another company. These guys seem to be very fond of UTF-8. So they
          encode everything into UTF-8. They make UTF-8 files and they leave them in a
          FTP server.

          I'm writing an application logging into the FTP server, getting the XML
          files. Then I end up with the file contained in a string. I don't have any
          control over the content of the file or its format. So I have to accept it
          "as is".

          Now I have a very long string containing lots of datas (about 20kb), its
          header is
          <?xml version="1.0" encoding="utf-8"> standalone="yes "?>

          And there are plenty of elements, some of them containing special chars like
          those in the "Schönbüh l" city name.

          So I want to load this Xml file into a Document. To do this I thought that
          the easiest way was to load it from memory:
          [... Lot of code here ...]
          [... A lot of code here too ...]
          [... And a lot more, just to ...]
          [... obtain a string with ...]
          [... the content of my Xml file in it ...]
          XmlDocument doc=new XmlDocument();
          doc.LoadXml(s);

          At this point I had some issues with bad formated strings. So I started to
          investigate. I investigate the file in the FTP server, I investigate the TCP
          communication between the server and my application, I investigate the
          Encoding in my application. At the end I found that everything seemed to be
          correct, BUT still the special chars were broken.

          So, as I didn't had any clue about what was wrong, I started reducing my
          application, in order to obtain the simplest and shortest code reproducing
          my error. And I end up with this:

          ----------------------------------------------
          XmlDocument doc=new XmlDocument();
          String s="<?xml version=\"1.0\" encoding=\"utf-8\">
          standalone=\"ye s\"?><a>Schönb ühl</a>";
          doc.LoadXml(s);
          doc.Save("d:\\t emp\\test.xml") ;
          ----------------------------------------------

          My objective is not having a file called "test.xml", in a folder called
          "temp", and containing the name of a bizarre city. It is just that this
          small piece of code happens to behave in a way I'm not sure to understand.
          In my mind, if you load the string s as a xml document, then you save it to
          an xml file, the file and the string should have the same identical content.
          Or, at least, an equivalent content. I still believe it.

          But instead, if I use a text editor to read the "test.xml" file, I obtain
          this:

          ----------------------------------------------
          <?xml version="1.0" encoding="utf-8" standalone="yes "?>
          <a>SchÃf¶nbÃf ¼hl</a>
          ----------------------------------------------

          where I expected to see
          ----------------------------------------------
          <?xml version="1.0" encoding="utf-8" standalone="yes "?>
          <a>Schönbüh l</a>
          ----------------------------------------------

          Thanks to you, now I understand the meaning of the "" at the beginning.

          But I'm still puzzled about the double encoding of the "Schönbüh l"'s
          special chars.

          Regards,
          Jean-Michel Gonet.

          "Oleg Tkachenko [MVP]" <oleg@NO!SPAM!P LEASEtkachenko. com> wrote in message
          news:usTUYEoPFH A.1176@TK2MSFTN GP12.phx.gbl...[color=blue]
          > jmgonet wrote:
          >[color=green]
          > > By double encoded I mean that the chars are encoded twice to UTF-8:
          > > Originally the string contained in the xml was "Schönbühl" .
          > > To put it into the UTF standard, I've transformed it to "Schönbüh l",[/color][/color]
          where[color=blue][color=green]
          > > the ö="ö" and the ü="ü".[/color]
          >
          > I think that's a bad idea. UTF-8 defines how Unicode characters are
          > represented in bytes. By doubling characters you get just two characters.
          >
          > What's wrong with "Schönbühl" ? Just use it as is.
          >
          > --
          > Oleg Tkachenko [XML MVP, MCP]
          > http://blog.tkachenko.com[/color]


          Comment

          • Oleg Tkachenko [MVP]

            #6
            Re: LoadXML and UTF-8 encoding

            jmgonet wrote:
            [color=blue]
            > Well, I see I was oversimplifying my question, in an attempt to avoid
            > discussing about other issues. The point is that the Xml file is provided by
            > guys from another company. These guys seem to be very fond of UTF-8. So they
            > encode everything into UTF-8. They make UTF-8 files and they leave them in a
            > FTP server.
            >
            > I'm writing an application logging into the FTP server, getting the XML
            > files. Then I end up with the file contained in a string. I don't have any
            > control over the content of the file or its format. So I have to accept it
            > "as is".
            >
            > Now I have a very long string containing lots of datas (about 20kb), its
            > header is
            > <?xml version="1.0" encoding="utf-8"> standalone="yes "?>
            >
            > And there are plenty of elements, some of them containing special chars like
            > those in the "Schönbüh l" city name.
            >
            > So I want to load this Xml file into a Document. To do this I thought that
            > the easiest way was to load it from memory:
            > [... Lot of code here ...]
            > [... A lot of code here too ...]
            > [... And a lot more, just to ...]
            > [... obtain a string with ...]
            > [... the content of my Xml file in it ...]
            > XmlDocument doc=new XmlDocument();
            > doc.LoadXml(s);
            >
            > At this point I had some issues with bad formated strings. So I started to
            > investigate. I investigate the file in the FTP server, I investigate the TCP
            > communication between the server and my application, I investigate the
            > Encoding in my application. At the end I found that everything seemed to be
            > correct, BUT still the special chars were broken.[/color]

            Hmmm, actually reading UTF-8 XML as a string should work. In fact
            strings in .NET are always UTF-16 encoded and XmlTextReader has special
            ability to recognize such case and to switch to UTF-16.
            Usually problems arise when you read XML to a string - it should be done
            with respect to UTF-8 encoding.
            For instance, XML file:

            <?xml version="1.0" encoding="utf-8" standalone="yes " ?>
            <books>Schönbà ¼hl</books>

            It's stored in UTF-8 on Windows as
            EF BB BF 3C 3F 78 6D 6C │ 20 76 65 72 73 69 6F 6E я╗┐<?xml version
            3D 22 31 2E 30 22 20 65 │ 6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
            22 75 74 66 2D 38 22 20 │ 73 74 61 6E 64 61 6C 6F "utf-8" standalo
            6E 65 3D 22 79 65 73 22 │ 3F 3E 0D 0A 3C 62 6F 6F ne="yes"?>♪◠™<boo
            6B 73 3E 53 63 68 C3 B6 │ 6E 62 C3 BC 68 6C 3C 2F ks>Sch├╢nbâ ”œâ•hl</
            62 6F 6F 6B 73 3E 0D 0A │ 0D 0A books>♪◙♪ ◙

            Note how letter ö gets encoded in UTF-8 as 2 bytes - C3 B6 and ü - as C3 BC.

            Then the following code correctly reads the file into a string and then
            loads it into DOM.

            StreamReader sr = new StreamReader("f oo.xml", Encoding.UTF8);
            string xml = sr.ReadToEnd();
            sr.Close();
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(xml );
            doc.Save("foo2. xml");

            The result is the same data.

            --
            Oleg Tkachenko [XML MVP, MCP]

            Comment

            • jmgonet

              #7
              Re: LoadXML and UTF-8 encoding

              Yes, your example is nice, but see that

              ------------------------------------
              <?xml version="1.0" encoding="utf-8" standalone="yes " ?>
              <books>Schönbüh l</books>
              ------------------------------------

              Is not correctly formated for a UTF-8 file. Anyway, it shows the strange
              behavior of the "LoadXml" method. Look at this: Take your sample, copy it in
              a text editor, and save the file as "test.xml". Then try to open it with
              Internet Explorer. You get an error:

              -----------------------
              An invalid character was found in text content. Error
              processing resource 'file:///D:/TEMP/test.xml'.
              Line 2, Position 11

              <books>Sch
              --------------------------------------

              So, LoadXml can read it, but Internet Explorer can't.

              You can try also this other example. Create with a text editor the following
              file:

              --------------------------------------
              <?xml version=\"1.0\" encoding=\"UTF-8\"
              standalone=\"ye s\"?><a>Schönb ühl</a>
              --------------------------------------

              Save it as "test.xml", and then run the following code:

              --------------------------------------
              XmlDocument doc=new XmlDocument();
              doc.Load("d:\\t emp\\test.xml") ;
              doc.Save("d:\\t emp\\test2.xml" );
              --------------------------------------

              If you open "test2.xml" with the text editor, you can see that it is
              identical to test.xml.

              So the "load" method doesn't behave like "LoadXml":
              - LoadXml seems to load any xml string as a ISO-8859-1, regardless of the
              header. After, Save, uses the encoding information to save the file. But
              that is another story.
              - Load seems to check the header for enconding information.

              But, try this:
              --------------------------------------
              XmlDocument doc=new XmlDocument();
              String s="<?xml version=\"1.0\" encoding=\"UTF-8\"
              standalone=\"ye s\"?><a>Schönb ühl</a>";
              StringReader sr=new StringReader(s) ;
              doc.Load(sr);
              doc.Save("d:\\t emp\\test2.xml" );
              --------------------------------------

              It is the same example as at the begining, but using Load instead of
              LoadXml. The result is

              --------------------------------------
              <?xml version="1.0" encoding="UTF-8" standalone="yes "?>
              <a>SchÃf¶nbÃf ¼hl</a>
              --------------------------------------

              So the "Load" method behaves differently when it reads from a file or when
              it reads from a stream!


              By now I've worked around my problem about UTF-8, but I'm still convinced
              that the Load and LoadXml methods have a bizarre behavior.

              Regards,
              Jean-Michel Gonet.

              [color=blue]
              > Hmmm, actually reading UTF-8 XML as a string should work. In fact
              > strings in .NET are always UTF-16 encoded and XmlTextReader has special
              > ability to recognize such case and to switch to UTF-16.
              > Usually problems arise when you read XML to a string - it should be done
              > with respect to UTF-8 encoding.
              > For instance, XML file:
              >
              > <?xml version="1.0" encoding="utf-8" standalone="yes " ?>
              > <books>Schönbüh l</books>
              >
              > It's stored in UTF-8 on Windows as
              > EF BB BF 3C 3F 78 6D 6C ¦ 20 76 65 72 73 69 6F 6E ?++<?xml version
              > 3D 22 31 2E 30 22 20 65 ¦ 6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
              > 22 75 74 66 2D 38 22 20 ¦ 73 74 61 6E 64 61 6C 6F "utf-8" standalo
              > 6E 65 3D 22 79 65 73 22 ¦ 3F 3E 0D 0A 3C 62 6F 6F ne="yes"?>??<bo o
              > 6B 73 3E 53 63 68 C3 B6 ¦ 6E 62 C3 BC 68 6C 3C 2F ks>Sch+¦nb++hl</
              > 62 6F 6F 6B 73 3E 0D 0A ¦ 0D 0A books>????
              >
              > Note how letter ö gets encoded in UTF-8 as 2 bytes - C3 B6 and ü - as C3[/color]
              BC.[color=blue]
              >
              > Then the following code correctly reads the file into a string and then
              > loads it into DOM.
              >
              > StreamReader sr = new StreamReader("f oo.xml", Encoding.UTF8);
              > string xml = sr.ReadToEnd();
              > sr.Close();
              > XmlDocument doc = new XmlDocument();
              > doc.LoadXml(xml );
              > doc.Save("foo2. xml");
              >
              > The result is the same data.
              >
              > --
              > Oleg Tkachenko [XML MVP, MCP]
              > http://blog.tkachenko.com[/color]


              Comment

              Working...