UTF-8 not decoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • steve

    UTF-8 not decoding

    Hi,
    I am opening a stream that is UTF encoded. I use fgetc to read the
    stream- which is binary safe. I add every character read to a string.


    But when I look at the stream, I see some characters with a bunch of
    "?" question markets, and then utf8_decode has no effect on it
    either.

    How do you go about decoding utf. Does adding the characters to the
    string somehow mess it up. Please help. Running 4.3.4 PHP on Win.

    --
    http://www.dbForumz.com/ This article was posted by author's request
    Articles individually checked for conformance to usenet standards
    Topic URL: http://www.dbForumz.com/PHP-UTF-deco...ict138860.html
    Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=464220
  • Chung Leong

    #2
    Re: UTF-8 not decoding

    "steve" <UseLinkToEmail @dbForumz.com> wrote in message
    news:411ab511$1 _7@news.athenan ews.com...[color=blue]
    > Hi,
    > I am opening a stream that is UTF encoded. I use fgetc to read the
    > stream- which is binary safe. I add every character read to a string.
    >
    >
    > But when I look at the stream, I see some characters with a bunch of
    > "?" question markets, and then utf8_decode has no effect on it
    > either.[/color]

    Question marks means that there're Unicode characters that aren't found
    within the current codepage. Basically the characters are there, they're
    just represented by ?s.

    utf8_decode() does have an effect: it replaces characters outside of
    ISO-8859-1 with question marks.
    [color=blue]
    > How do you go about decoding utf. Does adding the characters to the
    > string somehow mess it up. Please help. Running 4.3.4 PHP on Win.[/color]

    The question is, what do you mean by decoding UTF8. Using fgetc on UTF8 text
    is not a good idea, since one Unicode character can span multiple bytes.


    Comment

    • steve

      #3
      Re: Re: UTF-8 not decoding

      "Chung Leong" wrote:[color=blue]
      > "steve" <UseLinkToEmail @dbForumz.com> wrote in message
      > news:411ab511[/color]
      _7@news.athenan ews.com...[color=blue][color=green]
      > > Hi,
      > > I am opening a stream that is UTF encoded. I use fgetc to read[/color]
      > the[color=green]
      > > stream- which is binary safe. I add every character read to a[/color]
      > string.[color=green]
      > >
      > >
      > > But when I look at the stream, I see some characters with a bunch[/color]
      > of[color=green]
      > > "?" question markets, and then utf8_decode has no effect on it
      > > either.[/color]
      >
      > Question marks means that there’re Unicode characters that
      > aren’t found
      > within the current codepage. Basically the characters are there,
      > they’re
      > just represented by ?s.
      >
      > utf8_decode() does have an effect: it replaces characters outside[/color]
      of[color=blue]
      > ISO-8859-1 with question marks.
      >[color=green]
      > > How do you go about decoding utf. Does adding the characters to[/color]
      > the[color=green]
      > > string somehow mess it up. Please help. Running 4.3.4 PHP on[/color]
      > Win.
      >
      > The question is, what do you mean by decoding UTF8. Using fgetc on
      > UTF8 text
      > is not a good idea, since one Unicode character can span multiple
      > bytes.[/color]

      Thanks, Chung. I am interested in decoding usenet message headers that
      look like this:
      "=?Utf-8?B?YmVsZGVyYXo =?="

      --
      http://www.dbForumz.com/ This article was posted by author's request
      Articles individually checked for conformance to usenet standards
      Topic URL: http://www.dbForumz.com/PHP-UTF-deco...ict138860.html
      Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=464367

      Comment

      • steve

        #4
        Re: Re: UTF-8 not decoding

        "steve" wrote:[color=blue]
        > [quote:eff0459c7 e="Chung Leong"]"steve"
        > <UseLinkToEmail @dbForumz.com> wrote in message
        > news:411ab511[/color]
        _7@news.athenan ews.com...[color=blue][color=green]
        > > Hi,
        > > I am opening a stream that is UTF encoded. I use fgetc to read[/color]
        > the[color=green]
        > > stream- which is binary safe. I add every character read to a[/color]
        > string.[color=green]
        > >
        > >
        > > But when I look at the stream, I see some characters with a bunch[/color]
        > of[color=green]
        > > "?" question markets, and then utf8_decode has no effect on it
        > > either.[/color]
        >
        > Question marks means that there’re Unicode characters that
        > aren’t found
        > within the current codepage. Basically the characters are there,
        > they’re
        > just represented by ?s.
        >
        > utf8_decode() does have an effect: it replaces characters outside[/color]
        of[color=blue]
        > ISO-8859-1 with question marks.
        >[color=green]
        > > How do you go about decoding utf. Does adding the characters to[/color]
        > the[color=green]
        > > string somehow mess it up. Please help. Running 4.3.4 PHP on[/color]
        > Win.
        >
        > The question is, what do you mean by decoding UTF8. Using fgetc on
        > UTF8 text
        > is not a good idea, since one Unicode character can span multiple
        > bytes.[/color]

        Thanks, Chung. I am interested in decoding usenet message headers that
        look like this:
        "=?Utf-8?B?YmVsZGVyYXo =?="[/quote:eff0459c7 e]

        Ok, figured it out. Take a string like this:
        $instr = "=?Utf-8?B?YmVsZGVyYXo =?="

        and feed it as argument to this function:
        function decode_subject( $instr ) {
        $enstr = $instr;
        while( preg_match(
        ’/^([^?]+)?=\?[^?]+\?(B|Q)\?([^?]+)=?=?\?=(.+)?$/i’, $enstr,
        $match ) ) {
        if( $match[2] == ’b’ || $match[2] == ’B’ )
        $enstr = $match[1] . base64_decode( $match[3] ) .
        (isset($match[4])?$match[4]:’’);
        else
        $enstr = $match[1] . quoted_printabl e_decode( $match[3] );
        }
        return( $enstr );
        }

        and it will return the ascii equivalent.

        The function is included in: PHP Newsreader
        Download PHP News Reader for free. A Web-based Usenet News Reader written by PHP, support NNTP/NNRP access to News Server. Authentication can be easily configured with flexibility.


        --
        http://www.dbForumz.com/ This article was posted by author's request
        Articles individually checked for conformance to usenet standards
        Topic URL: http://www.dbForumz.com/PHP-UTF-deco...ict138860.html
        Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=464416

        Comment

        Working...