resolving an entity

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dean A. Hoover

    resolving an entity

    I am writing a parser for xml that will not have
    an associated DTD. I want to be able to handle
    certain character references (e.g., ©) in
    the program.

    When I run the following against a chunk of xml
    containing ©, I get the following:

    org.xml.sax.SAX ParseException: Reference to undefined entity "©".
    at org.apache.crim son.parser.Pars er2.fatal(Parse r2.java:3182)
    at org.apache.crim son.parser.Pars er2.fatal(Parse r2.java:3176)
    at
    org.apache.crim son.parser.Pars er2.expandEntit yInContent(Pars er2.java:2513)
    at
    org.apache.crim son.parser.Pars er2.maybeRefere nceInContent(Pa rser2.java:2422 )
    at org.apache.crim son.parser.Pars er2.content(Par ser2.java:1833)
    at org.apache.crim son.parser.Pars er2.maybeElemen t(Parser2.java: 1507)
    at org.apache.crim son.parser.Pars er2.content(Par ser2.java:1779)
    at org.apache.crim son.parser.Pars er2.maybeElemen t(Parser2.java: 1507)
    at org.apache.crim son.parser.Pars er2.content(Par ser2.java:1779)
    at org.apache.crim son.parser.Pars er2.maybeElemen t(Parser2.java: 1507)
    at org.apache.crim son.parser.Pars er2.parseIntern al(Parser2.java :500)
    at org.apache.crim son.parser.Pars er2.parse(Parse r2.java:305)
    at org.apache.crim son.parser.XMLR eaderImpl.parse (XMLReaderImpl. java:442)
    at javax.xml.parse rs.SAXParser.pa rse(SAXParser.j ava:345)
    at javax.xml.parse rs.SAXParser.pa rse(SAXParser.j ava:281)
    at Article.main(Ar ticle.java:18)

    What can I do to catch these references in my code and output replacement
    text for it?

    Thanks.
    Dean Hoover

    Here's the two java files:
    ---
    import java.io.*;
    import javax.xml.parse rs.*;
    import org.xml.sax.*;
    import org.xml.sax.hel pers.*;

    public class Article
    {
    public static void main(String argv[])
    {
    String file = argv[0];
    PrintWriter pw = new PrintWriter(Sys tem.out);
    DefaultHandler handler = new LoadXML(pw, LoadXML.TYPE_HT ML);
    SAXParserFactor y factory = SAXParserFactor y.newInstance() ;

    try
    {
    SAXParser reader = factory.newSAXP arser();
    reader.parse(ne w File(file), handler);
    }
    catch (Exception e)
    {
    e.printStackTra ce();
    return;
    }

    pw.flush();
    }
    }
    ---
    import java.io.*;
    import java.util.*;
    import javax.xml.parse rs.*;
    import org.xml.sax.*;
    import org.xml.sax.hel pers.*;

    public class LoadXML extends DefaultHandler
    {
    public static final int TYPE_HTML = 1;
    public static final int TYPE_TEXT = 2;

    public LoadXML
    (
    java.io.Writer writer,
    int type
    )
    {
    elements_ = new Stack();
    writer_ = writer;
    type_ = type;
    }

    public InputSource resolveEntity
    (
    String publicId,
    String systemId
    ) throws SAXException
    {
    String s = "stuff";
    return new InputSource(new CharArrayReader (s.toCharArray( )));
    }

    public void startDocument() throws SAXException
    {
    }

    public void endDocument() throws SAXException
    {
    }

    public void startElement
    (
    String uri,
    String localName,
    String qName,
    Attributes attributes
    ) throws SAXException
    {
    String elementName = qName;
    elements_.push( elementName);

    try
    {
    if (elementName.eq uals("p"))
    {
    if (type_ == TYPE_HTML)
    writer_.write(" <p class=\"article-text\">");
    }
    else if (elementName.eq uals("title"))
    {
    if (type_ == TYPE_HTML)
    writer_.write(" <p class=\"article-title\">");
    }
    else if (elementName.eq uals("by"))
    {
    if (type_ == TYPE_HTML)
    writer_.write(" <p class=\"article-by\">");
    }
    else if (elementName.eq uals("copyright "))
    {
    if (type_ == TYPE_HTML)
    writer_.write(" <p class=\"article-copyright\">");
    }
    }
    catch (IOException e)
    {
    throw new SAXException(e) ;
    }
    }

    public void endElement
    (
    String uri,
    String localName,
    String qName
    ) throws SAXException
    {
    String elementName = qName;
    elements_.pop() ;

    try
    {
    if (type_ == TYPE_HTML)
    {
    if (elementName.eq uals("p") || elementName.equ als("title") ||
    elementName.equ als("by") || elementName.equ als("copyright" ))
    {
    writer_.write(" </p>\n");
    }
    else if (elementName.eq uals("br"))
    {
    writer_.write(" <br/>\n");
    }
    }
    }
    catch (IOException e)
    {
    throw new SAXException(e) ;
    }
    }

    public void characters
    (
    char[] ch,
    int start,
    int length
    ) throws SAXException
    {
    try
    {
    String content = new String(ch, start, length);
    String top = (String)element s_.peek();
    String text =
    content.replace All("\n", " ").replaceA ll(" +", " ").trim();

    if (text.length() == 0)
    return;

    if (type_ == TYPE_HTML)
    {
    if (top.equals("p" ) || top.equals("tit le") ||
    top.equals("by" ) || top.equals("cop yright"))
    writer_.write(t ext);
    }
    }
    catch (IOException e)
    {
    throw new SAXException(e) ;
    }
    }

    private Stack elements_;
    private java.io.Writer writer_;
    private int type_;
    }



  • Maarten Wiltink

    #2
    Re: resolving an entity

    "Dean A. Hoover" <dhxyz2010@yaho o.com> wrote in message
    news:4qqAb.1893 89$ZC4.25966@tw ister.nyroc.rr. com...[color=blue]
    > I am writing a parser for xml that will not have
    > an associated DTD. I want to be able to handle
    > certain character references (e.g., &copy;) in
    > the program.[/color]

    As I understand it, that's quite impossible. The case is defined
    in the spec, and without a DTD you don't get to choose what
    entities are defined or not.

    But DTD may not mean what you think it does. Would it be permissible
    for this document to have an internal DTD subset?

    <?xml version="1.0"?>
    <!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
    <root>&copy;</root>

    A quick reading of the XML spec suggests (but I may have missed
    something) that this is a correct construction in XML.

    Groetjes,
    Maarten Wiltink


    Comment

    • Dean A. Hoover

      #3
      Re: resolving an entity

      Maarten Wiltink wrote:[color=blue]
      > "Dean A. Hoover" <dhxyz2010@yaho o.com> wrote in message
      > news:4qqAb.1893 89$ZC4.25966@tw ister.nyroc.rr. com...
      >[color=green]
      >>I am writing a parser for xml that will not have
      >>an associated DTD. I want to be able to handle
      >>certain character references (e.g., &copy;) in
      >>the program.[/color]
      >
      >
      > As I understand it, that's quite impossible. The case is defined
      > in the spec, and without a DTD you don't get to choose what
      > entities are defined or not.
      >
      > But DTD may not mean what you think it does. Would it be permissible
      > for this document to have an internal DTD subset?
      >
      > <?xml version="1.0"?>
      > <!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
      > <root>&copy;</root>
      >
      > A quick reading of the XML spec suggests (but I may have missed
      > something) that this is a correct construction in XML.
      >[/color]
      I really don't want any DTD in the document at all. I am writing
      some code that will parse an xml document and output either html
      or plain text depending on a parameter. In the case of HTML it
      would output "&copy;", in the case of plain text it would output
      "(c)". I have other similar context based entities to handle as
      well.

      Dean

      Comment

      • Martin Honnen

        #4
        Re: resolving an entity



        Dean A. Hoover wrote:
        [color=blue]
        > Maarten Wiltink wrote:
        >[color=green]
        >> "Dean A. Hoover" <dhxyz2010@yaho o.com> wrote in message
        >> news:4qqAb.1893 89$ZC4.25966@tw ister.nyroc.rr. com...
        >>[color=darkred]
        >>> I am writing a parser for xml that will not have
        >>> an associated DTD. I want to be able to handle
        >>> certain character references (e.g., &copy;) in
        >>> the program.[/color]
        >>
        >>
        >>
        >> As I understand it, that's quite impossible. The case is defined
        >> in the spec, and without a DTD you don't get to choose what
        >> entities are defined or not.
        >>
        >> But DTD may not mean what you think it does. Would it be permissible
        >> for this document to have an internal DTD subset?
        >>
        >> <?xml version="1.0"?>
        >> <!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
        >> <root>&copy;</root>
        >>
        >> A quick reading of the XML spec suggests (but I may have missed
        >> something) that this is a correct construction in XML.
        >>[/color]
        > I really don't want any DTD in the document at all. I am writing
        > some code that will parse an xml document and output either html
        > or plain text depending on a parameter. In the case of HTML it
        > would output "&copy;", in the case of plain text it would output
        > "(c)". I have other similar context based entities to handle as
        > well.[/color]

        Well, if you write your own parser then you can of course parse
        something alike XML but with references to undefined entities. But then
        don't attempt to parse it with an XML parser which expects entities to
        be defined.

        --

        Martin Honnen


        Comment

        • Maarten Wiltink

          #5
          Re: resolving an entity

          "Dean A. Hoover" <dhxyz2010@yaho o.com> wrote in message
          news:uLvAb.1904 49$ZC4.95913@tw ister.nyroc.rr. com...[color=blue]
          > Maarten Wiltink wrote:[color=green]
          >> "Dean A. Hoover" <dhxyz2010@yaho o.com> wrote in message
          >> news:4qqAb.1893 89$ZC4.25966@tw ister.nyroc.rr. com...[/color][/color]
          [color=blue][color=green][color=darkred]
          >>> I am writing a parser for xml that will not have
          >>> an associated DTD. I want to be able to handle
          >>> certain character references (e.g., &copy;) in
          >>> the program.[/color][/color][/color]
          [...][color=blue]
          > I really don't want any DTD in the document at all. I am writing
          > some code that will parse an xml document and output either html
          > or plain text depending on a parameter. In the case of HTML it
          > would output "&copy;", in the case of plain text it would output
          > "(c)". I have other similar context based entities to handle as
          > well.[/color]

          That's reasonable, but entities simply aren't the solution.
          Would using processing instructions instead be acceptable?

          In XSLT, you could even source in the transformation itself
          with document('') and switch treatment of <?copy?> based on
          the output method.

          I'm working under the assumption that you want the source to
          be well-formed XML, valid if possible.

          Groetjes,
          Maarten Wiltink


          Comment

          • Richard Tobin

            #6
            Re: resolving an entity

            In article <4qqAb.189389$Z C4.25966@twiste r.nyroc.rr.com> ,
            Dean A. Hoover <dhxyz2010@yaho o.com> wrote:[color=blue]
            >I am writing a parser for xml that will not have
            >an associated DTD. I want to be able to handle
            >certain character references (e.g., &copy;) in
            >the program.[/color]

            Well, this is not *real* XML.

            The simplest thing to do would be to read the file into a string and
            prepend an internal subset that declares the entities in question.
            This will be easy if you know that there isn't an XML declaration or
            DOCTYPE declaration in the file and you know the file's encoding.
            Otherwise it will be more tedious.

            -- Richard
            --
            Spam filter: to mail me from a .com/.net site, put my surname in the headers.

            FreeBSD rules!

            Comment

            Working...