Finding and replacing Invalid Tokens in an XML document

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ben Holness

    Finding and replacing Invalid Tokens in an XML document

    Hi all,

    I have a system which allows users to enter a message on a (PHP) website.
    This message is then put into a (MySQL) Database.

    A perl script then picks up the message and creates an XML document.

    The webpages, database and XML are all UTF-8, however every now and then I
    get an error in the XML parser that tells me I have an invalid token. This
    occurs when the message contains particular characters, although I don't
    know which characters - all I can see in the logs is the ANSI
    representation (e.g. @^C). If I copy & paste into word the I get a square
    box after the @ that takes two right cursor presses to go past.

    My script catches that there is an invalid token, but rather than fail the
    message completely, I would like to replace the bad characters with a
    space.
    Is there a simple way to find these characters, or do I have to
    write a function that looks at the output of $@ from the eval and work out
    where the character is from the line/column/byte information in order to
    fix it?

    FYI, the XML is created and parsed with XML::Simple and UTF-8 encoded with
    encode. I have included a simplified snippet (written into this post, so
    may contain typos) at the end of the email.

    Cheers,

    Ben

    -- Snippet of Code --

    # $MessageText is pulled from the database and may contain bad
    characters.

    # Build an array of the elements
    my %arr;
    $arr{'Message'} =encode("UTF-8", $MessageText);

    # Convert the array into an XML Document with XMLOut
    my $tempxml = new XML::Simple (NoAttr=>1, RootName=>'WebM essage');
    my $xmldoc = "<?xml version=\"1.0\" encoding=\"UTF-8\">";
    $xmldoc .= $tempxml->XMLout(\$arr );

    # Parse the XML Document
    my $tempxml2 = new XML::Simple (ForceArray => 1);
    eval ($tempxml2->XMLin($xmldoc) ;};
    if ($@)
    {
    # An error occurred. Usually an invalid token due to a bad character
    # in $MessageText
    }

Working...