I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<aText>letter 'a' with umlaut: ä</aText>
And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.
Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.
Dale Gerdemann
----------------
import org.xml.sax.Inp utSource;
import java.io.FileInp utStream;
import java.io.File;
import java.io.FileWri ter;
import org.w3c.dom.Doc ument;
import org.apache.xerc es.parsers.DOMP arser;
import org.apache.xerc es.dom.DOMImple mentationImpl;
import org.xml.sax.SAX Exception;
import org.w3c.dom.DOM Exception;
import java.io.IOExcep tion;
import org.w3c.dom.Ele ment;
import org.apache.xml. serialize.Outpu tFormat;
import org.apache.xml. serialize.XMLSe rializer;
import org.apache.xml. serialize.LineS eparator;
public class AProblem {
public static void main(String[] args)
throws DOMException, IOException, SAXException {
DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream (new
File("foo.xml") ));
is.setEncoding( "UTF-8");
parser.parse(is );
Document doc = parser.getDocum ent();
Element root = doc.getDocument Element();
System.out.prin tln(root.getChi ldNodes().item( 0));
OutputFormat format = new OutputFormat(do c);
format.setLineS eparator(LineSe parator.Unix);
format.setInden ting(true);
format.setLineW idth(0);
format.setPrese rveSpace(true);
format.setEncod ing("UTF-8");
FileWriter fw = new FileWriter("bar .xml");
XMLSerializer serializer = new XMLSerializer(f w, format);
serializer.seri alize(doc);
}
}
I read in a UTF-8 encoded xml file such as:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<aText>letter 'a' with umlaut: ä</aText>
And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.
Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.
Dale Gerdemann
----------------
import org.xml.sax.Inp utSource;
import java.io.FileInp utStream;
import java.io.File;
import java.io.FileWri ter;
import org.w3c.dom.Doc ument;
import org.apache.xerc es.parsers.DOMP arser;
import org.apache.xerc es.dom.DOMImple mentationImpl;
import org.xml.sax.SAX Exception;
import org.w3c.dom.DOM Exception;
import java.io.IOExcep tion;
import org.w3c.dom.Ele ment;
import org.apache.xml. serialize.Outpu tFormat;
import org.apache.xml. serialize.XMLSe rializer;
import org.apache.xml. serialize.LineS eparator;
public class AProblem {
public static void main(String[] args)
throws DOMException, IOException, SAXException {
DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream (new
File("foo.xml") ));
is.setEncoding( "UTF-8");
parser.parse(is );
Document doc = parser.getDocum ent();
Element root = doc.getDocument Element();
System.out.prin tln(root.getChi ldNodes().item( 0));
OutputFormat format = new OutputFormat(do c);
format.setLineS eparator(LineSe parator.Unix);
format.setInden ting(true);
format.setLineW idth(0);
format.setPrese rveSpace(true);
format.setEncod ing("UTF-8");
FileWriter fw = new FileWriter("bar .xml");
XMLSerializer serializer = new XMLSerializer(f w, format);
serializer.seri alize(doc);
}
}
Comment