Unicode problem with Java Xerces DOM

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dale Gerdemann

    Unicode problem with Java Xerces DOM

    I'm having trouble with Unicode encoding in DOM. As a simple example,
    I read in a UTF-8 encoded xml file such as:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>

    <aText>letter 'a' with umlaut: ä</aText>

    And when I serialize it, it comes out encoded as ISO-8895-1. But I
    don't think the problem is with serialization. In processing my XML
    files, I'm matching bits and pieces of text and attributes with some
    Unicode/UTF-8 text read in from another souce. When the strings in my
    XML file contain non-ASCII characters, then I have problems.

    Hopefully, I've explained the problem enough so that someone can help.
    In case it's necessary, I attach at the end, a bit of code for reading
    in and serializing a DOM.

    Dale Gerdemann
    ----------------
    import org.xml.sax.Inp utSource;
    import java.io.FileInp utStream;
    import java.io.File;
    import java.io.FileWri ter;
    import org.w3c.dom.Doc ument;
    import org.apache.xerc es.parsers.DOMP arser;
    import org.apache.xerc es.dom.DOMImple mentationImpl;
    import org.xml.sax.SAX Exception;
    import org.w3c.dom.DOM Exception;
    import java.io.IOExcep tion;
    import org.w3c.dom.Ele ment;
    import org.apache.xml. serialize.Outpu tFormat;
    import org.apache.xml. serialize.XMLSe rializer;
    import org.apache.xml. serialize.LineS eparator;


    public class AProblem {

    public static void main(String[] args)
    throws DOMException, IOException, SAXException {

    DOMParser parser = new DOMParser();
    InputSource is = new InputSource(new FileInputStream (new
    File("foo.xml") ));
    is.setEncoding( "UTF-8");
    parser.parse(is );
    Document doc = parser.getDocum ent();
    Element root = doc.getDocument Element();
    System.out.prin tln(root.getChi ldNodes().item( 0));



    OutputFormat format = new OutputFormat(do c);
    format.setLineS eparator(LineSe parator.Unix);

    format.setInden ting(true);
    format.setLineW idth(0);
    format.setPrese rveSpace(true);
    format.setEncod ing("UTF-8");
    FileWriter fw = new FileWriter("bar .xml");

    XMLSerializer serializer = new XMLSerializer(f w, format);
    serializer.seri alize(doc);


    }
    }
  • Kenneth Stephen

    #2
    Re: Unicode problem with Java Xerces DOM

    Dale Gerdemann wrote:[color=blue]
    > I'm having trouble with Unicode encoding in DOM. As a simple example,
    > I read in a UTF-8 encoded xml file such as:
    >
    > <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    >
    > <aText>letter 'a' with umlaut: ä</aText>
    >
    > And when I serialize it, it comes out encoded as ISO-8895-1. But I
    > don't think the problem is with serialization. In processing my XML
    > files, I'm matching bits and pieces of text and attributes with some
    > Unicode/UTF-8 text read in from another souce. When the strings in my
    > XML file contain non-ASCII characters, then I have problems.[/color]
    Dale,

    What JDK are you using and under which env?

    Regards,
    Kenneth

    Comment

    • Steve W. Jackson

      #3
      Re: Unicode problem with Java Xerces DOM

      In article <9313e3af.04092 80604.4b681ee6@ posting.google. com>,
      dg@sfs.nphil.un i-tuebingen.de (Dale Gerdemann) wrote:
      [color=blue]
      >:I'm having trouble with Unicode encoding in DOM. As a simple example,
      >:I read in a UTF-8 encoded xml file such as:
      >:
      >:<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
      >:
      >:<aText>lett er 'a' with umlaut: ä</aText>
      >:
      >:And when I serialize it, it comes out encoded as ISO-8895-1. But I
      >:don't think the problem is with serialization. In processing my XML
      >:files, I'm matching bits and pieces of text and attributes with some
      >:Unicode/UTF-8 text read in from another souce. When the strings in my
      >:XML file contain non-ASCII characters, then I have problems.
      >:
      >:Hopefully, I've explained the problem enough so that someone can help.
      >:In case it's necessary, I attach at the end, a bit of code for reading
      >:in and serializing a DOM.
      >:
      >:Dale Gerdemann
      >:----------------
      >:import org.xml.sax.Inp utSource;
      >:import java.io.FileInp utStream;
      >:import java.io.File;
      >:import java.io.FileWri ter;
      >:import org.w3c.dom.Doc ument;
      >:import org.apache.xerc es.parsers.DOMP arser;
      >:import org.apache.xerc es.dom.DOMImple mentationImpl;
      >:import org.xml.sax.SAX Exception;
      >:import org.w3c.dom.DOM Exception;
      >:import java.io.IOExcep tion;
      >:import org.w3c.dom.Ele ment;
      >:import org.apache.xml. serialize.Outpu tFormat;
      >:import org.apache.xml. serialize.XMLSe rializer;
      >:import org.apache.xml. serialize.LineS eparator;
      >:
      >:
      >:public class AProblem {
      >:
      >: public static void main(String[] args)
      >: throws DOMException, IOException, SAXException {
      >:
      >: DOMParser parser = new DOMParser();
      >: InputSource is = new InputSource(new FileInputStream (new
      >:File("foo.xml ")));
      >: is.setEncoding( "UTF-8");
      >: parser.parse(is );
      >: Document doc = parser.getDocum ent();
      >: Element root = doc.getDocument Element();
      >: System.out.prin tln(root.getChi ldNodes().item( 0));
      >:
      >:
      >:
      >: OutputFormat format = new OutputFormat(do c);
      >: format.setLineS eparator(LineSe parator.Unix);
      >:
      >: format.setInden ting(true);
      >: format.setLineW idth(0);
      >: format.setPrese rveSpace(true);
      >: format.setEncod ing("UTF-8");
      >: FileWriter fw = new FileWriter("bar .xml");
      >:
      >: XMLSerializer serializer = new XMLSerializer(f w, format);
      >: serializer.seri alize(doc);
      >:
      >:
      >: }
      >:}[/color]

      I've encountered this problem myself. The solution was to use something
      besides a FileWriter to output your new XML document, since you need to
      encode both the XML data and the data written to an external file.

      Your OutputFormat object specifies that the XML gets UTF-8 encoding, but
      the FileWriter will use your system's default encoding. What I use now
      is an OutputStreamWri ter with its constructor taking an OutputStream (I
      use a FileOutputStrea m) and a String naming the encoding. That solved
      the problem for me.

      I also note that you're specifying UTF-8 on input. While I doubt it
      does any harm, it shouldn't be necessary.

      Hope this helps.

      = Steve =
      --
      Steve W. Jackson
      Montgomery, Alabama

      Comment

      Working...