Charset decoding problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Dormilich
    Recognized Expert Expert
    • Aug 2008
    • 8694

    Charset decoding problem

    Hi,

    I've got a very strange problem with UTF-8 encoded data outside ASCII range.

    While on localhost all went smoothly, the same pages on the server show � (Latin-1 chars (ä, ö, ü, ß, ...)) and ? (above Latin-1 range (typographics)) . Even the support does not really have a clue (that could help me).

    reference: http://test.kulturbeutel-leipzig.net/main.php?f=presse

    Javascript on – all works fine (data are fetched directly from a MySQL DB via AJAX)
    Javascript off – (a bit more complicated) data are fetched from DB (stored there as WDDX serialized data) and deserialized into an object, which in turn is responsible for output.

    maybe there's some problem with the deserialization .....

    Does anyone have an idea, how I can find out the source of the problem?

    thanks

    PS: the DB should contain the same data, because I used a SQL dump of one to build the other.

    PPS: if you need class definitions, just ask (it would be too much to list all incorporated classes at once)

    local system: Darwin Melchior 9.6.0 Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386 i386 / PHP 5.2.8.
    (= Mac OS 10.5)

    public system: Linux Custom Build 64 Bit prohost.de XEON SMP x86_64 (Red Hat Enterprise Linux) / PHP 5.2.6.
  • Atli
    Recognized Expert Expert
    • Nov 2006
    • 5062

    #2
    Hi.

    I don't really know much about WDDX, but as I understand it, it is basically XML?
    I had similar problems when passing XML files around a while ago, where the server was sending stuff as Unicode, the browser was rendering using Unicode, but the output was all mangled.

    Turned out all I had to do to fix this was add:
    [code=xml]<?xml version="1.0" encoding="UTF-8" ?>[/code]
    And everybody suddenly started understanding each other.

    My mistake was to assume that the XML file would adopt the charset passed with a Content-Type header like HTML pages do.

    Perhaps you left this out as well?

    Comment

    • Dormilich
      Recognized Expert Expert
      • Aug 2008
      • 8694

      #3
      yepp, WDDX is XML (useful if you have your configuration stored as XML)

      but the XML header was there from the start.... and obviously Javascript has no problems at all with it.

      sample WDDX:
      Code:
      <?xml version="1.0" encoding="UTF-8" ?>
      <wddxPacket version='1.0'>
        <header>
          <comment>Zeitungsausschnitte (Text)</comment>
        </header>
        <data>
          <array length='4'>
            <string>Helena – von Äpfeln, Göttern und anderen Helden</string>
            <struct>
              <var name='php_class_name'>
                <string>wddx_presse</string>
              </var>
              <var name='name'>
                <string>p</string>
              </var>
              <var name='content'>
                <string>Auch 2004 erfreut die Schau*spiel*gruppe „Kultur*beutel“ wieder […]</string>
              </var>
      […]
            </struct>
          </array>
        </data>
      </wddxPacket>
      note * = soft hyphen (escaped by bytes' editor)

      Comment

      • Dormilich
        Recognized Expert Expert
        • Aug 2008
        • 8694

        #4
        there seems to be something wrong with the deserializer, after some testing I can say the problems occur right after deserialization .

        does anyone know, how I can determine the encoding/charset of a variable content? (that would be interesting to know)

        thanks

        Comment

        • Atli
          Recognized Expert Expert
          • Nov 2006
          • 5062

          #5
          PHP strings (until version 6) don't have any native support for Unicode, or any other charset for that matter.
          A string character is essentially the same as a byte.

          Try running the variable content through utf8_encode. See if that helps any.

          Comment

          • Dormilich
            Recognized Expert Expert
            • Aug 2008
            • 8694

            #6
            Originally posted by Atli
            Try running the variable content through utf8_encode. See if that helps any.
            Though it converts the Latin-1 characters, it's no help with the characters initially showing up as '?' („ “ – ’ … and the like)

            Comment

            • Dormilich
              Recognized Expert Expert
              • Aug 2008
              • 8694

              #7
              finally got the problem somehow sorted by converting all non-ascii characters using unicode entities and this little function: http://de2.php.net/manual/de/functio...code.php#75941

              Comment

              • xaxis
                New Member
                • Feb 2009
                • 15

                #8
                Originally posted by Dormilich
                does anyone know, how I can determine the encoding/charset of a variable content? (that would be interesting to know)
                Very interesting indeed. Interesting enough that I scoured the net and I believe this resource: http://www.mozilla.org/projects/intl...Detection.html to be the most detailed and closest any person/group has yet come to solving this extremely challenging problem.

                Comment

                Working...