xmlrpclib and decoding entity references

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chris Curvey

    xmlrpclib and decoding entity references

    I'm writing an XMLRPC server, which is receiving a request (from a
    non-Python client) that looks like this (formatted for legibility):

    <?xml version="1.0"?>
    <methodCall>
    <methodName>ech o</methodName>
    <params>
    <param>
    <value>
    <string>Le Martyre de Saint Andr&#xe9; &lt;BR&gt; avec inscription
    &apos;Le Dominiquain.&ap os; et &apos;Le tableau fait par le dominicain,
    d&apos;apr&#xe8 ;s son dessein &#xe0;... est &#xe0; Rome, &#xe0;
    l&apos;&#xe9;gl ise Saint Andr&#xe9; della Valle&apos; sur le
    cadre&lt;BR&gt; craie noire, plume et encre brune, lavis brun
    rehauss&#xe9; de blanc sur papier brun&lt;BR&gt; 190 x 228 mm. (7 1/2 x
    9 in.)</string>
    </value>
    </param>
    </params>
    </methodCall>

    But when my "echo" method is invoked, the value of the string is:

    Le Martyre de Saint Andr; <BR> avec inscription 'Le Dominiquain.' et
    'Le tableau fait par le dominicain, d'apr:s son dessein 2... est 2
    Rome, 2 l';glise Saint Andr; della Valle' sur le cadre<BR> craie noire,
    plume et encre brune, lavis brun rehauss; de blanc sur papier brun<BR>
    190 x 228 mm. (7 1/2 x 9 in.)

    Can anyone give me a lead on how to convert the entity references into
    something that will make it through to my method call?

  • Chris Curvey

    #2
    Re: xmlrpclib and decoding entity references

    yep, I'm using SimpleRPCServer , but something is getting messed up
    between the receipt of the XML stream and the delivery to my function.
    The "normal" entity references (like &lt; and &amp;) are handled OK,
    but the character references are not working. For instance,

    "Andr&#xe9; " is received by the server, but it's delivered to the
    function as "Andr;"

    I've figured out how to parse through the string to find all the
    character references and convert them back, but that seems to be
    causing a ProtocolError.

    Hopefully someone can lend me a clue; I really don't want to have to
    switch over to SOAP and end up in WSDL hell.

    Comment

    • Chris Curvey

      #3
      Re: xmlrpclib and decoding entity references

      Here is the solution. Incidentally, the client is Cold Fusion.

      import re
      import logging
      import logging.config
      import os
      import SimpleXMLRPCSer ver

      logging.config. fileConfig("log ging.ini")

      ############### ############### ############### ############### ############
      class
      LoggingXMLRPCRe questHandler(Si mpleXMLRPCServe r.CGIXMLRPCRequ estHandler):
      def __dereference(s elf, request_text):
      entityRe = re.compile("((? P<er>&#x)(?P<co de>..)(?P<semi> ;))")
      for m in re.finditer(ent ityRe, request_text):
      hexref = int(m.group(3), 16)
      charref = chr(hexref)
      request_text = request_text.re place(m.group(1 ), charref)

      return request_text


      #-------------------------------------------------------------------
      def handle_xmlrpc(s elf, request_text):
      logger = logging.getLogg er()
      #logger.debug(" *************** *************** ******")
      #logger.debug(r equest_text)
      try:
      #logger.debug("-------------------------------------")
      request_text = self.__derefere nce(request_tex t)
      #logger.debug(r equest_text)
      request_text = request_text.de code("latin-1").encode(' utf-8')
      #logger.debug(" *************** *************** ******")
      except Exception, e:
      logger.error(re quest_text)
      logger.error("h ad a problem dereferencing")
      logger.error(e)

      SimpleXMLRPCSer ver.CGIXMLRPCRe questHandler.ha ndle_xmlrpc(sel f,
      request_text)
      ############### ############### ############### ############### ############
      class Foo:
      def settings(self):
      return os.environ
      def echo(self, something):
      logger = logging.getLogg er()
      logger.debug(so mething)
      return something
      def greeting(self, name):
      return "hello, " + name

      # these are used to run as a CGI
      handler = LoggingXMLRPCRe questHandler()
      handler.registe r_instance(Foo( ))
      handler.handle_ request()

      Comment

      • Bengt Richter

        #4
        Re: xmlrpclib and decoding entity references

        On 3 May 2005 08:07:06 -0700, "Chris Curvey" <ccurvey@gmail. com> wrote:
        [color=blue]
        >I'm writing an XMLRPC server, which is receiving a request (from a
        >non-Python client) that looks like this (formatted for legibility):
        >
        ><?xml version="1.0"?>
        ><methodCall>
        ><methodName>ec ho</methodName>
        ><params>
        ><param>
        ><value>
        ><string>Le Martyre de Saint Andr&#xe9; &lt;BR&gt; avec inscription
        >&apos;Le Dominiquain.&ap os; et &apos;Le tableau fait par le dominicain,
        >d&apos;apr&#xe 8;s son dessein &#xe0;... est &#xe0; Rome, &#xe0;
        >l&apos;&#xe9;g lise Saint Andr&#xe9; della Valle&apos; sur le
        >cadre&lt;BR&gt ; craie noire, plume et encre brune, lavis brun
        >rehauss&#xe9 ; de blanc sur papier brun&lt;BR&gt; 190 x 228 mm. (7 1/2 x
        >9 in.)</string>
        ></value>
        ></param>
        ></params>
        ></methodCall>
        >
        >But when my "echo" method is invoked, the value of the string is:
        >
        >Le Martyre de Saint Andr; <BR> avec inscription 'Le Dominiquain.' et
        >'Le tableau fait par le dominicain, d'apr:s son dessein 2... est 2
        >Rome, 2 l';glise Saint Andr; della Valle' sur le cadre<BR> craie noire,
        >plume et encre brune, lavis brun rehauss; de blanc sur papier brun<BR>
        >190 x 228 mm. (7 1/2 x 9 in.)
        >
        >Can anyone give me a lead on how to convert the entity references into
        >something that will make it through to my method call?
        >[/color]
        I haven't used XMLRPC but superficially this looks like a quoting and/or encoding
        problem. IOW, your "request" is XML, and the <string>...</string> part is also XML
        which is part of the whole, not encapsulated in e.g. <![CDATA[...stuff...]]>
        (which would tell an XML parser to suspend markup interpretation of ...stuff...).

        So IWT you would at least need the <string>...</string> content to be converted to
        unicode to preserve all the represented characters. It wouldn't surprise me if the
        whole request is routinely converted to unicode, and the "value" you are showing
        above is a result of converting from unicode to an encoding that can't represent
        everything, and maybe just drops conversion errors. What do you
        get if you print repr(value)? (assuming value is passed to you echo method)

        If it is a unicode string, you will just have to choose an appropriate value.encode('a ppropriate')
        from available codecs. If it looks like e.g., a utf-8 encoding of unicode, you could try
        value.decode('u tf-8').encode('app ropriate')

        I'm just guessing here. But something is interpreting the basic XML, since
        &lt;BR&gt; is being converted to <BR>. Seems not unlikely that the rest are
        also being converted, and to unicode. You just wouldn't notice a glitch when
        unicode <BR> is converted to any usual western text encoding.

        OTOH, if the intent (which I doubt) of the non-python client were to pass through
        a block of pre-formatted XML as such (possibly for direct pasting into e.g. web page XHTML?)
        then a way to avoid escaping every & and < would be to use CDATA to encapsulate it. That
        would have to be fixed on that end.

        Regards,
        Bengt Richter

        Comment

        • Bengt Richter

          #5
          Re: xmlrpclib and decoding entity references

          On 4 May 2005 08:17:07 -0700, "Chris Curvey" <ccurvey@gmail. com> wrote:
          [color=blue]
          >Here is the solution. Incidentally, the client is Cold Fusion.
          >[/color]
          I suspect your solution may be not be general, though it would seem to
          satisfy your use case. It seems to be true for python's latin-1 that
          all the first 256 character codes are acceptable and match unicode 1:1,
          even though the windows character map for lucida sans unicode font
          with latin-1 codes shows undefined-char boxes for codes 0x7f-0x9f.
          [color=blue][color=green][color=darkred]
          >>> sum(chr(i).deco de('latin-1') == unichr(i) for i in xrange(256))[/color][/color][/color]
          256[color=blue][color=green][color=darkred]
          >>> sum(unichr(i).e ncode('latin-1') == chr(i) for i in xrange(256))[/color][/color][/color]
          256

          Not sure what to make of that. E.g. should unichr(0x7f).en code('latin-1')
          really be legal, or is it just expedient to have latin-1 serves as a kind of
          compressed utf_16_le? E.g., there's 256 Trues in these:
          [color=blue][color=green][color=darkred]
          >>> sum(unichr(i).e ncode('utf_16_l e')[0] == chr(i) for i in xrange(256))[/color][/color][/color]
          256[color=blue][color=green][color=darkred]
          >>> sum(unichr(i).e ncode('utf_16_l e')[1] == '\x00' for i in xrange(256))[/color][/color][/color]
          256

          Maybe we could have a 'u_as_str' or 'utf_16_le_lsby te' codec for that, so the above would be spelled[color=blue][color=green][color=darkred]
          >>> sum(unichr(i).e ncode('u_as_str ') == chr(i) for i in xrange(256)) # XXX faked, not implemented[/color][/color][/color]
          256

          Utf-8 only goes half way:[color=blue][color=green][color=darkred]
          >>> sum(unichr(i).e ncode('utf-8') == chr(i) for i in xrange(256))[/color][/color][/color]
          128


          <aside>
          What do you think, Martin? ;-)
          Maybe 'ubyte' or 'u256' would be a user-friendlier codec name? Or 'ustr'?
          </aside>
          [color=blue]
          >import re
          >import logging
          >import logging.config
          >import os
          >import SimpleXMLRPCSer ver
          >
          >logging.config .fileConfig("lo gging.ini")
          >
          >############## ############### ############### ############### #############
          >class
          >LoggingXMLRPCR equestHandler(S impleXMLRPCServ er.CGIXMLRPCReq uestHandler):
          > def __dereference(s elf, request_text):
          > entityRe = re.compile("((? P<er>&#x)(?P<co de>..)(?P<semi> ;))")[/color]
          What about entity &#x263a; ? Or the same in decimal: ☺
          :)[color=blue]
          > for m in re.finditer(ent ityRe, request_text):
          > hexref = int(m.group(3), 16)
          > charref = chr(hexref)[/color]
          unichr(hexref) would handle >= 256, if you used unicode.[color=blue]
          > request_text = request_text.re place(m.group(1 ), charref)
          >
          > return request_text
          >
          >
          >#-------------------------------------------------------------------
          > def handle_xmlrpc(s elf, request_text):
          > logger = logging.getLogg er()
          > #logger.debug(" *************** *************** ******")
          > #logger.debug(r equest_text)[/color]
          ^^^^^^^^^^^^ I would suggest repr(request_te xt) for debugging, unless you
          know that your logger is going to do that for you. Otherwise a '%s' format may hide things that you'd like to know.
          [color=blue]
          > try:
          > #logger.debug("-------------------------------------")
          > request_text = self.__derefere nce(request_tex t)
          > #logger.debug(r equest_text)
          > request_text = request_text.de code("latin-1").encode(' utf-8')[/color]
          AFAIK, XML can be encoded with many encodings other than latin-1, so you are essentially
          saying here that you know it's latin-1 somehow. Theoretically, your XML could
          start with something like <?xml encoding='UTF-8'?> and .decode("latin-1") is only going to
          "work" when the source is plain ascii. I wouldn't be surprised if that's what's happening
          up to the point where you __dereference, but str.replace doesn't care that you are potentially
          making a utf-8 encoding invalid by just replacing 8-bit characters with what is legal latin-1.
          after that, you are decoding your utf-8_clobbered_wit h_latin-1 as latin-1 anyway, so it "works".
          At least I think this is a consistent theory. See if you can get the client to send something
          with characters >128 that aren't represented as &#x..; to see if it's actually sending utf-8.

          [color=blue]
          > #logger.debug(" *************** *************** ******")
          > except Exception, e:
          > logger.error(re quest_text)[/color]
          again, suggest repr(request_te xt)[color=blue]
          > logger.error("h ad a problem dereferencing")
          > logger.error(e)
          >
          > SimpleXMLRPCSer ver.CGIXMLRPCRe questHandler.ha ndle_xmlrpc(sel f,
          >request_text )
          >############## ############### ############### ############### #############
          >class Foo:
          > def settings(self):
          > return os.environ
          > def echo(self, something):
          > logger = logging.getLogg er()
          > logger.debug(so mething)[/color]
          repr it, unless you know ;-)
          [color=blue]
          > return something
          > def greeting(self, name):
          > return "hello, " + name
          >
          ># these are used to run as a CGI
          >handler = LoggingXMLRPCRe questHandler()
          >handler.regist er_instance(Foo ())
          >handler.handle _request()
          >[/color]

          Regards,
          Bengt Richter

          Comment

          Working...