org.apache.xml.serialize.XMLSerializer problem with UTF-8

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jim Cobban

    org.apache.xml.serialize.XMLSerializer problem with UTF-8

    I must be missing something.

    I am using org.apache.xml. serialize.XMLSe rializer to save a DOM but I am not
    getting non-basic characters converted to UTF-8.

    I create Text nodes in the DOM by, for example:

    Document doc;
    JTextArea textPrompt;
    Text newTextNode;
    Element descElt;
    ....
    newTextNode = doc.createTextN ode(textPrompt. getText());
    descElt.appendC hild(newTextNod e);

    The code to serialize the DOM is:

    private void saveXml(Documen t document)
    {
    // rename the existing layout file
    new File(fileName). renameTo(new File(fileName + "~"));
    // write the document out
    OutputFormat format = new OutputFormat(do cument);
    format.setInden ting(true);
    format.setLineW idth(0);
    format.setPrese rveSpace(true);
    try {
    XMLSerializer serializer;
    serializer = new XMLSerializer (
    new FileWriter(file Name),
    format);
    serializer.asDO MSerializer();
    serializer.seri alize(document) ;
    }
    catch (IOException ioe)
    {
    ....
    }
    }

    If I enter a character such as e' (e with acute accent) into the JTextArea
    and I look at the XML file using a non-UTF-8-aware editor I see that the e'
    has been inserted as a single byte, not as the 2 character UTF-8 escaped
    value. If I subsequently try to read the XML file using XERCES it blows up
    because of the invalid escape sequence.

    How do I get a valid serialization of this DOM into XML using UTF-8?


    --
    Jim Cobban jcobban@magma.c a
    34 Palomino Dr.
    Kanata, ON, CANADA
    K2M 1M1
    +1-613-592-9438


  • Soren Kuula

    #2
    Re: org.apache.xml. serialize.XMLSe rializer problem with UTF-8

    Jim Cobban wrote:
    [color=blue]
    > I must be missing something.[/color]
    [color=blue]
    > XMLSerializer serializer;
    > serializer = new XMLSerializer (
    > new FileWriter(file Name),
    > format);
    > serializer.asDO MSerializer();
    > If I enter a character such as e' (e with acute accent) into the JTextArea
    > and I look at the XML file using a non-UTF-8-aware editor I see that the e'
    > has been inserted as a single byte, not as the 2 character UTF-8 escaped
    > value. If I subsequently try to read the XML file using XERCES it blows up
    > because of the invalid escape sequence.
    >
    > How do I get a valid serialization of this DOM into XML using UTF-8?[/color]

    As far as I know it is the Writer responsible for the encoding.

    From FileWriter API doc:

    public class FileWriter
    extends OutputStreamWri ter

    Convenience class for writing character files. The constructors of this
    class assume that the default character encoding and the default
    byte-buffer size are acceptable. To specify these values yourself,
    construct an OutputStreamWri ter on a FileOutputStrea m.


    - try that.

    Soren

    --
    Fjern de 4 bogstaver i min mailadresse som er indsat for at hindre s...
    Remove the 4 letter word meaning "junk mail" in my mail address.

    Comment

    • Jim Cobban

      #3
      Re: org.apache.xml. serialize.XMLSe rializer problem with UTF-8


      "Soren Kuula" <dongfangspam@b itplanet.net> wrote in message
      news:5K7Db.5914 7$jf4.3408968@n ews000.worldonl ine.dk...[color=blue]
      >
      > As far as I know it is the Writer responsible for the encoding.
      >
      > From FileWriter API doc:
      >
      > public class FileWriter
      > extends OutputStreamWri ter
      >
      > Convenience class for writing character files. The constructors of this
      > class assume that the default character encoding and the default
      > byte-buffer size are acceptable. To specify these values yourself,
      > construct an OutputStreamWri ter on a FileOutputStrea m.[/color]

      Thank you.

      The problem was that I copied the code from one of the examples that came
      with Xerces. It was that example which constructed the default FileWriter.
      Since their is a version of the XMLSerializer constructor which takes an
      OutpuStream and internally constructs a Writer with the correct "utf-8"
      encoding, that is the form of the constructor which I needed to use. I
      should have read the documentation in more detail rather than trusting that
      the example had been written correctly.


      Comment

      Working...