[Crossposted as the questions to each group might sound a little
strange without context; trim groups if necessary]
The idea here is relatively simple: a java program (I'm using JDK1.4
if that makes a difference) that loads an HTML file, removes invalid
characters (or replaces them in the case of common ones like
Microsoft's 'smartquotes'), and outputs the file.
The problem is these files will be on disk, so the program won't have
the character encoding information from the server.
Questions:
1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
the byte order markers. How does it identify other encodings? Will it
just assume the system default encoding until it finds bytes that
imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
ISO-8859-1 and US-ASCII, but others may occur.
2) I'm slightly confused by the HTML specification - are the valid
characters precisely those that are defined in Unicode? (Java
internally works with 16 but characters.) (I'm ignoring at this point
characters that in HTML need escaping.)
3) If it fails on esoteric character encodings, how badly is it likely
to fail? Will it totally trash the HTML?
--
Safalra (Stephen Morley)
strange without context; trim groups if necessary]
The idea here is relatively simple: a java program (I'm using JDK1.4
if that makes a difference) that loads an HTML file, removes invalid
characters (or replaces them in the case of common ones like
Microsoft's 'smartquotes'), and outputs the file.
The problem is these files will be on disk, so the program won't have
the character encoding information from the server.
Questions:
1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
the byte order markers. How does it identify other encodings? Will it
just assume the system default encoding until it finds bytes that
imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
ISO-8859-1 and US-ASCII, but others may occur.
2) I'm slightly confused by the HTML specification - are the valid
characters precisely those that are defined in Unicode? (Java
internally works with 16 but characters.) (I'm ignoring at this point
characters that in HTML need escaping.)
3) If it fails on esoteric character encodings, how badly is it likely
to fail? Will it totally trash the HTML?
--
Safalra (Stephen Morley)
Comment