how to tell server from PHP that charset is UTF-8??

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • lawrence

    how to tell server from PHP that charset is UTF-8??

    How do I get PHP to tell the server that when I echo text to the
    screen, I need for the text to be sent as UTF-8? How does Apache know
    the right encoding when all the text is being generated by PHP? If I
    build a content management system (I have) and I make sure that all
    input is encoded as UTF-8, how will the
    server know that the text in the MySql database is UTF-8?

    I'm taking all user input and using this function on the input:



    I'm doing this so I can output to XML without getting errors about
    "You should not sent plain text".

    But how will the server know how to serve these pages? How do I tell
    it from PHP? I realize I can send a http equiv tag, but that's rather
    weak, right?

    Is this enough? Any conflicts with Apache?

    $sent = headers_sent();
    if (!$sent) header("Content-type:text/html;charset:UT F-8");
  • Andy Hassall

    #2
    Re: how to tell server from PHP that charset is UTF-8??

    On 4 Sep 2004 09:08:41 -0700, lkrubner@geocit ies.com (lawrence) wrote:
    [color=blue]
    >How do I get PHP to tell the server that when I echo text to the
    >screen, I need for the text to be sent as UTF-8?[/color]

    Sent a content-type header with a charset attribute.
    [color=blue]
    > How does Apache know
    >the right encoding when all the text is being generated by PHP?[/color]

    It doesn't, nor does it need to - that information's just for the end user.
    [color=blue]
    > If I
    >build a content management system (I have) and I make sure that all
    >input is encoded as UTF-8, how will the
    >server know that the text in the MySql database is UTF-8?
    >
    >I'm taking all user input and using this function on the input:
    >
    >http://us4.php.net/manual/en/function.utf8-encode.php
    >
    >I'm doing this so I can output to XML without getting errors about
    >"You should not sent plain text".[/color]

    Don't know what you mean here. XML content doesn't have to be UTF-8 encoded,
    just properly escaped and the encoding set correctly.
    [color=blue]
    >But how will the server know how to serve these pages? How do I tell
    >it from PHP? I realize I can send a http equiv tag, but that's rather
    >weak, right?[/color]

    Yep.
    [color=blue]
    >Is this enough? Any conflicts with Apache?
    >
    > $sent = headers_sent();
    > if (!$sent) header("Content-type:text/html;charset:UT F-8");[/color]

    Shouldn't the : after charset be an = sign? i.e.

    Content-type: text/html; charset=utf-8

    That would be enough, provided it's actually sent (i.e. $sent is false).

    --
    Andy Hassall / <andy@andyh.co. uk> / <http://www.andyh.co.uk >
    <http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

    Comment

    • lawrence

      #3
      Re: how to tell server that charset is UTF-8??

      Andy Hassall <andy@andyh.co. uk> wrote in message news:<4iqjj0tae k5o0ni8ck050tam 046bq6tn8o@4ax. com>...[color=blue]
      > On 4 Sep 2004 09:08:41 -0700, lkrubner@geocit ies.com (lawrence) wrote:
      >[color=green]
      > >How do I get PHP to tell the server that when I echo text to the
      > >screen, I need for the text to be sent as UTF-8?[/color]
      >
      > Sent a content-type header with a charset attribute.
      >[color=green]
      > > How does Apache know
      > >the right encoding when all the text is being generated by PHP?[/color]
      >
      > It doesn't, nor does it need to - that information's just for the end user.[/color]

      I'm not sure if I follow you here. Yes, the information is for the end
      user, or rather, the web browser (or other ua) that the end user is
      using. But something has to send that information out from the
      webserver. Normally Apache has some idea what it is dealing with, and
      sends some kind of info, yes? A weaker solution is send a meta
      http-equiv tag specifying the charset. But something somewhere has to
      send that info. If the web server has no way to know the charset
      because all the characters are being generated by PHP, the PHP should
      send a charset header, yes?

      By the way, in general, when you use echo or print in PHP, what is the
      charset of the text being generated? Raw ASCII?





      [color=blue][color=green]
      > >I'm doing this so I can output to XML without getting errors about
      > >"You should not sent plain text".[/color]
      >
      > Don't know what you mean here. XML content doesn't have to be UTF-8 encoded,
      > just properly escaped and the encoding set correctly.[/color]

      Let's put it this way. Right now users can input whatever the hell
      they want. Sometimes they write an essay in Microsoft Word and then
      copy and paste the text to the input form, and input that as a weblog
      entry. That post then gets added to the RSS feed for that weblog. At
      first I tried to write my RSS output using Plain Text, but most
      validators throw an error at that (all but radioland's). So I need to
      give it a charset. So I decided to give all outgoing XML the charset
      of UTF-8. Then I immediately started getting errors because lots of
      users had input stuff that was not UTF-8. So what I need to do is take
      all input and cast it to UTF-8. If that happens to change some
      characters to garbage characters, that is fine - that throws the
      problem back at the user, which is where I want it. I merely need to
      let them see that they are being idiots. I'll tell them they need to
      save any text from Microsoft Word as plain text. Once they start doing
      that, then they won't get garbage characters and the software will
      output valid XML and RSS.





      [color=blue][color=green]
      > >Is this enough? Any conflicts with Apache?
      > >
      > > $sent = headers_sent();
      > > if (!$sent) header("Content-type:text/html;charset:UT F-8");[/color]
      >
      > Shouldn't the : after charset be an = sign? i.e.
      >
      > Content-type: text/html; charset=utf-8
      >
      > That would be enough, provided it's actually sent (i.e. $sent is false).[/color]

      Thanks for catching the bit about the equal sign.

      Comment

      • Tony Marston

        #4
        Re: how to tell server that charset is UTF-8??

        try header('content-type:text/html; charset=UTF-8');

        --
        Tony Marston

        This is Tony Marston's web site, containing personal information plus pages devoted to the Uniface 4GL development language, XML and XSL, PHP and MySQL, and a bit of COBOL



        "lawrence" <lkrubner@geoci ties.com> wrote in message
        news:da7e68e8.0 409121014.545f1 55d@posting.goo gle.com...[color=blue]
        > Andy Hassall <andy@andyh.co. uk> wrote in message
        > news:<4iqjj0tae k5o0ni8ck050tam 046bq6tn8o@4ax. com>...[color=green]
        >> On 4 Sep 2004 09:08:41 -0700, lkrubner@geocit ies.com (lawrence) wrote:
        >>[color=darkred]
        >> >How do I get PHP to tell the server that when I echo text to the
        >> >screen, I need for the text to be sent as UTF-8?[/color]
        >>
        >> Sent a content-type header with a charset attribute.
        >>[color=darkred]
        >> > How does Apache know
        >> >the right encoding when all the text is being generated by PHP?[/color]
        >>
        >> It doesn't, nor does it need to - that information's just for the end
        >> user.[/color]
        >
        > I'm not sure if I follow you here. Yes, the information is for the end
        > user, or rather, the web browser (or other ua) that the end user is
        > using. But something has to send that information out from the
        > webserver. Normally Apache has some idea what it is dealing with, and
        > sends some kind of info, yes? A weaker solution is send a meta
        > http-equiv tag specifying the charset. But something somewhere has to
        > send that info. If the web server has no way to know the charset
        > because all the characters are being generated by PHP, the PHP should
        > send a charset header, yes?
        >
        > By the way, in general, when you use echo or print in PHP, what is the
        > charset of the text being generated? Raw ASCII?
        >
        >
        >
        >
        >
        >[color=green][color=darkred]
        >> >I'm doing this so I can output to XML without getting errors about
        >> >"You should not sent plain text".[/color]
        >>
        >> Don't know what you mean here. XML content doesn't have to be UTF-8
        >> encoded,
        >> just properly escaped and the encoding set correctly.[/color]
        >
        > Let's put it this way. Right now users can input whatever the hell
        > they want. Sometimes they write an essay in Microsoft Word and then
        > copy and paste the text to the input form, and input that as a weblog
        > entry. That post then gets added to the RSS feed for that weblog. At
        > first I tried to write my RSS output using Plain Text, but most
        > validators throw an error at that (all but radioland's). So I need to
        > give it a charset. So I decided to give all outgoing XML the charset
        > of UTF-8. Then I immediately started getting errors because lots of
        > users had input stuff that was not UTF-8. So what I need to do is take
        > all input and cast it to UTF-8. If that happens to change some
        > characters to garbage characters, that is fine - that throws the
        > problem back at the user, which is where I want it. I merely need to
        > let them see that they are being idiots. I'll tell them they need to
        > save any text from Microsoft Word as plain text. Once they start doing
        > that, then they won't get garbage characters and the software will
        > output valid XML and RSS.
        >
        >
        >
        >
        >
        >[color=green][color=darkred]
        >> >Is this enough? Any conflicts with Apache?
        >> >
        >> > $sent = headers_sent();
        >> > if (!$sent) header("Content-type:text/html;charset:UT F-8");[/color]
        >>
        >> Shouldn't the : after charset be an = sign? i.e.
        >>
        >> Content-type: text/html; charset=utf-8
        >>
        >> That would be enough, provided it's actually sent (i.e. $sent is false).[/color]
        >
        > Thanks for catching the bit about the equal sign.[/color]


        Comment

        • lawrence

          #5
          Re: how to tell server that charset is UTF-8??

          "Tony Marston" <tony@NOSPAM.de mon.co.uk> wrote in message news:<ci279j$b7 s$1$830fa795@ne ws.demon.co.uk> ...[color=blue]
          > try header('content-type:text/html; charset=UTF-8');[/color]

          The only difference I see in what you wrote is that "content" starts
          with a lower case "c". Are you saying these headers are case
          sensitive?

          Comment

          • Tony Marston

            #6
            Re: how to tell server that charset is UTF-8??


            "lawrence" <lkrubner@geoci ties.com> wrote in message
            news:da7e68e8.0 409171649.34867 95a@posting.goo gle.com...[color=blue]
            > "Tony Marston" <tony@NOSPAM.de mon.co.uk> wrote in message
            > news:<ci279j$b7 s$1$830fa795@ne ws.demon.co.uk> ...[color=green]
            >> try header('content-type:text/html; charset=UTF-8');[/color]
            >
            > The only difference I see in what you wrote is that "content" starts
            > with a lower case "c". Are you saying these headers are case
            > sensitive?[/color]

            No, but that is what I use and it works.

            --
            Tony Marston

            This is Tony Marston's web site, containing personal information plus pages devoted to the Uniface 4GL development language, XML and XSL, PHP and MySQL, and a bit of COBOL




            Comment

            • Andy Hassall

              #7
              Re: how to tell server that charset is UTF-8??

              On 12 Sep 2004 11:14:10 -0700, lkrubner@geocit ies.com (lawrence) wrote:
              [color=blue]
              >Andy Hassall <andy@andyh.co. uk> wrote in message news:<4iqjj0tae k5o0ni8ck050tam 046bq6tn8o@4ax. com>...[color=green]
              >> On 4 Sep 2004 09:08:41 -0700, lkrubner@geocit ies.com (lawrence) wrote:
              >>[color=darkred]
              >> >How do I get PHP to tell the server that when I echo text to the
              >> >screen, I need for the text to be sent as UTF-8?[/color]
              >>
              >> Sent a content-type header with a charset attribute.
              >>[color=darkred]
              >>> How does Apache know
              >>>the right encoding when all the text is being generated by PHP?[/color]
              >>
              >> It doesn't, nor does it need to - that information's just for the end user.[/color]
              >
              >I'm not sure if I follow you here. Yes, the information is for the end
              >user, or rather, the web browser (or other ua) that the end user is
              >using. But something has to send that information out from the
              >webserver. Normally Apache has some idea what it is dealing with, and
              >sends some kind of info, yes?[/color]

              It may send Content-type determined by the MIME type for the extension, or
              looked up through mime-magic, but it generally doesn't know character set, and
              to my knowledge Apache itself won't send the character set part of the header
              itself - it just sends 'data' in a character-set agnostic way.

              You can set it up so that Apache sends a character set header with content
              negotiation settings, though, but you need to provide the server with more
              information in that case.
              [color=blue]
              >A weaker solution is send a meta
              >http-equiv tag specifying the charset. But something somewhere has to
              >send that info. If the web server has no way to know the charset
              >because all the characters are being generated by PHP, the PHP should
              >send a charset header, yes?[/color]

              Yes. There's an option in php.ini as to which character set to default to - I
              think the default default is iso8859-1. (Although really ought to be iso8859-15
              due to the Euro).
              [color=blue]
              >By the way, in general, when you use echo or print in PHP, what is the
              >charset of the text being generated? Raw ASCII?[/color]

              (ASCII only goes up to 127)

              Depends what Content-type header has been sent as to how the output is
              interpreted. PHP won't do any conversion from the binary representation of
              anything output, it's just sent as-is. (It might be image data, for example, if
              you've sent an image/jpeg content-type header.)
              [color=blue][color=green][color=darkred]
              >> >I'm doing this so I can output to XML without getting errors about
              >> >"You should not sent plain text".[/color]
              >>
              >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded,
              >> just properly escaped and the encoding set correctly.[/color]
              >
              >Let's put it this way. Right now users can input whatever the hell
              >they want. Sometimes they write an essay in Microsoft Word and then
              >copy and paste the text to the input form, and input that as a weblog
              >entry. That post then gets added to the RSS feed for that weblog. At
              >first I tried to write my RSS output using Plain Text, but most
              >validators throw an error at that (all but radioland's). So I need to
              >give it a charset. So I decided to give all outgoing XML the charset
              >of UTF-8. Then I immediately started getting errors because lots of
              >users had input stuff that was not UTF-8. So what I need to do is take
              >all input and cast it to UTF-8. If that happens to change some
              >characters to garbage characters, that is fine - that throws the
              >problem back at the user, which is where I want it. I merely need to
              >let them see that they are being idiots. I'll tell them they need to
              >save any text from Microsoft Word as plain text. Once they start doing
              >that, then they won't get garbage characters and the software will
              >output valid XML and RSS.[/color]

              OK, but might have a piece of the puzzle missing here - you need to determine
              what character set the user posted in in the first place, since it's impossible
              to convert from an encoding of one character set to an encoding of another one
              without knowing what the first character set encoding was.

              I *think* form data is always in the character set of the page containing the
              original form. I haven't got a reference to back that up, though.

              I also seem to recall that some browsers (e.g. IE) will send HTML entity
              encoded versions of characters pasted into a form whose character set does not
              support them; e.g. Chinese characters into an iso8859-15 form turn up in their
              &#xxxx; representation in the data.

              Once you know that, then the mbstring extension has a function for converting
              between encodings.

              --
              Andy Hassall / <andy@andyh.co. uk> / <http://www.andyh.co.uk >
              <http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

              Comment

              • Chris

                #8
                Re: how to tell server that charset is UTF-8??

                -----BEGIN PGP SIGNED MESSAGE-----
                Hash: SHA1

                lawrence wrote:
                [color=blue]
                > "Tony Marston" <tony@NOSPAM.de mon.co.uk> wrote in message
                > news:<ci279j$b7 s$1$830fa795@ne ws.demon.co.uk> ...[color=green]
                >> try header('content-type:text/html; charset=UTF-8');[/color]
                >
                > The only difference I see in what you wrote is that "content" starts
                > with a lower case "c". Are you saying these headers are case
                > sensitive?[/color]

                Hi,
                No, the difference between your code and Mr. Marston's is that yours
                uses a colon after the word "charset" and his uses an equals sign.
                The equals sign is correct.

                Chris
                -----BEGIN PGP SIGNATURE-----
                Version: GnuPG v1.2.4 (GNU/Linux)

                iD8DBQFBTclkgxS rXuMbw1YRAsXeAK C7qga5M8usyxZ2c mxLPPBEyIkTXwCe NVUx
                2R2Q7V9CuD+wDWI pWfIcBLQ=
                =mhr2
                -----END PGP SIGNATURE-----

                Comment

                • lawrence

                  #9
                  Re: how to tell server that charset is UTF-8??

                  Andy Hassall <andy@andyh.co. uk> wrote in message news:<p4eok059t 03jj84ssu1n6tkg ped5dfijhv@4ax. com>...[color=blue]
                  > It may send Content-type determined by the MIME type for the extension, or
                  > looked up through mime-magic, but it generally doesn't know character set, and
                  > to my knowledge Apache itself won't send the character set part of the header
                  > itself - it just sends 'data' in a character-set agnostic way.
                  >
                  > You can set it up so that Apache sends a character set header with content
                  > negotiation settings, though, but you need to provide the server with more
                  > information in that case.
                  >[color=green]
                  > >A weaker solution is send a meta
                  > >http-equiv tag specifying the charset. But something somewhere has to
                  > >send that info. If the web server has no way to know the charset
                  > >because all the characters are being generated by PHP, the PHP should
                  > >send a charset header, yes?[/color]
                  >
                  > Yes. There's an option in php.ini as to which character set to default to - I
                  > think the default default is iso8859-1. (Although really ought to be iso8859-15
                  > due to the Euro).[/color]

                  Okay, I don't get this at all. What sends the character encoding
                  information? If you have a set of static HTML files sitting on a
                  server, what is responsible for sending the character encoding? If I,
                  as a web-designer, am not supposed to use http-equiv meta tags,
                  because they are weak, then the information is not inside of the HTML
                  file. So the information needs to be outside of the HMTL file. And
                  what is outside of the HTML file? If Apache remains agnostic about
                  character encoding, then at what point does character encoding get
                  sent? Where is the information stored, and how is it sent out to web
                  browsers?

                  Every character has an encoding by default, right? If no encoding is
                  given, then there are a series of possible defaults, right? An Apache
                  server may have a default, or PHP may have a default encoding set in
                  the php.ini file, right? If not default is set anywhere then the
                  characters are basically raw text, right? In other words, ASCII? Or do
                  I have it all wrong?









                  [color=blue][color=green][color=darkred]
                  > >> >I'm doing this so I can output to XML without getting errors about
                  > >> >"You should not sent plain text".
                  > >>
                  > >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded,
                  > >> just properly escaped and the encoding set correctly.[/color][/color][/color]

                  Sorry, I meant RSS. Most RSS validators throw an error if you try to
                  set up an RSS feed using plain text.





                  [color=blue][color=green]
                  > >Let's put it this way. Right now users can input whatever the hell
                  > >they want. Sometimes they write an essay in Microsoft Word and then
                  > >copy and paste the text to the input form, and input that as a weblog
                  > >entry. That post then gets added to the RSS feed for that weblog. At
                  > >first I tried to write my RSS output using Plain Text, but most
                  > >validators throw an error at that (all but radioland's). So I need to
                  > >give it a charset. So I decided to give all outgoing XML the charset
                  > >of UTF-8. Then I immediately started getting errors because lots of
                  > >users had input stuff that was not UTF-8. So what I need to do is take
                  > >all input and cast it to UTF-8. If that happens to change some
                  > >characters to garbage characters, that is fine - that throws the
                  > >problem back at the user, which is where I want it. I merely need to
                  > >let them see that they are being idiots. I'll tell them they need to
                  > >save any text from Microsoft Word as plain text. Once they start doing
                  > >that, then they won't get garbage characters and the software will
                  > >output valid XML and RSS.[/color]
                  >
                  > OK, but might have a piece of the puzzle missing here - you need to determine
                  > what character set the user posted in in the first place, since it's impossible
                  > to convert from an encoding of one character set to an encoding of another one
                  > without knowing what the first character set encoding was.
                  >
                  > I *think* form data is always in the character set of the page containing the
                  > original form. I haven't got a reference to back that up, though.[/color]

                  Yes, we had quite a conversation about that over on another newsgroup.
                  It was quite informative. You can read it here, if you've any
                  interest:


                  Comment

                  • Andy Hassall

                    #10
                    Re: how to tell server that charset is UTF-8??

                    On 21 Sep 2004 11:30:45 -0700, lkrubner@geocit ies.com (lawrence) wrote:
                    [color=blue]
                    >Andy Hassall <andy@andyh.co. uk> wrote in message news:<p4eok059t 03jj84ssu1n6tkg ped5dfijhv@4ax. com>...[color=green]
                    >> It may send Content-type determined by the MIME type for the extension, or
                    >> looked up through mime-magic, but it generally doesn't know character set, and
                    >> to my knowledge Apache itself won't send the character set part of the header
                    >> itself - it just sends 'data' in a character-set agnostic way.
                    >>
                    >> You can set it up so that Apache sends a character set header with content
                    >> negotiation settings, though, but you need to provide the server with more
                    >> information in that case.
                    >>[color=darkred]
                    >> >A weaker solution is send a meta
                    >> >http-equiv tag specifying the charset. But something somewhere has to
                    >> >send that info. If the web server has no way to know the charset
                    >> >because all the characters are being generated by PHP, the PHP should
                    >> >send a charset header, yes?[/color]
                    >>
                    >> Yes. There's an option in php.ini as to which character set to default to - I
                    >> think the default default is iso8859-1. (Although really ought to be iso8859-15
                    >> due to the Euro).[/color]
                    >
                    >Okay, I don't get this at all. What sends the character encoding
                    >information? If you have a set of static HTML files sitting on a
                    >server, what is responsible for sending the character encoding?[/color]

                    Done a bit more digging, and there's this in my httpd.conf:

                    #
                    # Specify a default charset for all pages sent out. This is
                    # always a good idea and opens the door for future internationalis ation
                    # of your web site, should you ever want it. Specifying it as
                    # a default does little harm; as the standard dictates that a page
                    # is in iso-8859-1 (latin1) unless specified otherwise i.e. you
                    # are merely stating the obvious. There are also some security
                    # reasons in browsers, related to javascript and URL parsing
                    # which encourage you to always set a default char set.
                    #
                    AddDefaultChars et ISO-8859-1


                    OK, so Apache sends out a character set heading under the recommended
                    configuration - although it's effectively hardcoded; it doesn't 'detect' the
                    encoding of the file since that's basically impossible in isolation.

                    To get Apache to send out a character set header for a specific file, you'd
                    then need to use Apache content negotiation if you wanted to select a different
                    character set for a particular file - either with a type-map or I believe it
                    can base it off suffixes of the filename (index.html.iso 8859-p15 and so on).

                    Consider the following response from Apache:

                    andyh@server:~/public_html$ touch utf8.html.utf8
                    andyh@server:~/public_html$ telnet localhost 80
                    Trying 127.0.0.1...
                    Connected to localhost.
                    Escape character is '^]'.
                    HEAD /~andyh/utf8.html HTTP/1.0

                    HTTP/1.1 200 OK
                    Date: Tue, 21 Sep 2004 19:19:03 GMT
                    Server: Apache/2.0.51 (Unix) PHP/5.0.1 DAV/2 SVN/1.0.6
                    Content-Location: utf8.html.utf8
                    Vary: negotiate
                    TCN: choice
                    Last-Modified: Tue, 21 Sep 2004 19:18:47 GMT
                    ETag: "3811f-0-7f9b93c0;7f9b93 c0"
                    Accept-Ranges: bytes
                    Connection: close
                    Content-Type: text/html; charset=utf-8

                    Connection closed by foreign host.

                    OK - so a filename of utf8.html.utf8 means that a request for utf8.html comes
                    out in utf8 encoding. (I've got content negotiation enabled on my server).

                    Presumably in the case of multiple encodings for the same URI then the
                    browser's Accept-charset header comes into play for Apache to pick which to
                    serve.
                    [color=blue]
                    > If I,
                    >as a web-designer, am not supposed to use http-equiv meta tags,
                    >because they are weak, then the information is not inside of the HTML
                    >file. So the information needs to be outside of the HMTL file. And
                    >what is outside of the HTML file? If Apache remains agnostic about
                    >character encoding, then at what point does character encoding get
                    >sent? Where is the information stored, and how is it sent out to web
                    >browsers?[/color]

                    Either a type map, or encoded in the filename. (can't speak for other servers
                    apart from Apache).
                    [color=blue]
                    >Every character has an encoding by default, right? If no encoding is
                    >given, then there are a series of possible defaults, right? An Apache
                    >server may have a default, or PHP may have a default encoding set in
                    >the php.ini file, right?[/color]

                    Right.
                    [color=blue]
                    > If not default is set anywhere then the
                    >characters are basically raw text, right? In other words, ASCII?[/color]

                    Ah, but even ASCII isn't raw text, depending on your definition of raw - it's
                    the ASCII encoding of a small-ish character set.

                    'Binary' is the usual definition of completely raw data - it's just a stream
                    of bytes with no defined correspondence to characters.

                    As to what the default in HTTP is - time to dig out the HTTP standards.

                    RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1
                    <ftp://ftp.isi.edu/in-notes/rfc2616.txt>

                    "
                    3.4.1 Missing Charset

                    Some HTTP/1.0 software has interpreted a Content-Type header without
                    charset parameter incorrectly to mean "recipient should guess."
                    Senders wishing to defeat this behavior MAY include a charset
                    parameter even when the charset is ISO-8859-1 and SHOULD do so when
                    it is known that it will not confuse the recipient.

                    Unfortunately, some older HTTP/1.0 clients did not deal properly with
                    an explicit charset parameter. HTTP/1.1 recipients MUST respect the
                    charset label provided by the sender; and those user agents that have
                    a provision to "guess" a charset MUST use the charset from the
                    content-type field if they support that charset, rather than the
                    recipient's preference, when initially displaying a document. See
                    section 3.7.1.
                    "

                    "
                    3.7.1 Canonicalizatio n and Text Defaults

                    [...]

                    The "charset" parameter is used with some media types to define the
                    character set (section 3.4) of the data. When no explicit charset
                    parameter is provided by the sender, media subtypes of the "text"
                    type are defined to have a default charset value of "ISO-8859-1" when
                    received via HTTP. Data in character sets other than "ISO-8859-1" or
                    its subsets MUST be labeled with an appropriate charset value. See
                    section 3.4.1 for compatibility problems.
                    "

                    OK - so we officially default to ISO-8859-1, at least for text/* content
                    types, which is a superset of ASCII, but definitely a well-defined character
                    set and not just a raw stream of bytes. Makes sense.
                    [color=blue]
                    >Or do I have it all wrong?[/color]

                    Definitely sounds like you've got the idea.
                    [color=blue][color=green][color=darkred]
                    >> >> >I'm doing this so I can output to XML without getting errors about
                    >> >> >"You should not sent plain text".
                    >> >>
                    >> >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded,
                    >> >> just properly escaped and the encoding set correctly.[/color][/color]
                    >
                    >Sorry, I meant RSS. Most RSS validators throw an error if you try to
                    >set up an RSS feed using plain text.[/color]

                    Oh, is this just a case of the wrong Content-type though - text/plain or
                    text/html vs. text/xml or whatever it is?

                    [snip]
                    [color=blue][color=green]
                    >> I *think* form data is always in the character set of the page containing the
                    >> original form. I haven't got a reference to back that up, though.[/color]
                    >
                    >Yes, we had quite a conversation about that over on another newsgroup.
                    >It was quite informative. You can read it here, if you've any
                    >interest:
                    >
                    >http://groups.google.com/groups?hl=e...%3D10%26sa%3DN[/color]

                    Hm - Netscape 4 as ever is a complete mess then! Does anyone actually use NN4
                    any more? It's well past time it was blasted out of existence - does it do
                    _anything_ right?

                    --
                    Andy Hassall / <andy@andyh.co. uk> / <http://www.andyh.co.uk >
                    <http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

                    Comment

                    • Daniel Tryba

                      #11
                      Re: how to tell server that charset is UTF-8??

                      Andy Hassall <andy@andyh.co. uk> wrote:[color=blue]
                      > OK - so we officially default to ISO-8859-1, at least for text/* content
                      > types, which is a superset of ASCII, but definitely a well-defined character
                      > set and not just a raw stream of bytes. Makes sense.[/color]

                      Completely true... almost. text/html has unicode as it characterset
                      accoding to w3c[1], the charset header is nothing more than the encoding
                      used to transport the data. iso-8859-1 is the best choice if you need
                      upto the first 256 characters in unicode. If one needs more characters
                      the utf-x encodings should be used.

                      [1] http://www.w3.org/TR/html401/charset.html

                      --

                      Daniel Tryba

                      Comment

                      • John Dunlop

                        #12
                        Re: how to tell server that charset is UTF-8??

                        Daniel Tryba wrote:
                        [color=blue]
                        > Andy Hassall <andy@andyh.co. uk> wrote:[/color]
                        [color=blue][color=green]
                        > > OK - so we officially default to ISO-8859-1, at least for text/* content
                        > > types, which is a superset of ASCII, but definitely a well-defined character
                        > > set and not just a raw stream of bytes. Makes sense.[/color]
                        >
                        > Completely true... almost. text/html has unicode as it characterset
                        > accoding to w3c[1],[/color]

                        'Character set', with or without a space, breeds confusion.



                        If by 'characterset' you meant HTML4.01's document character
                        set, you're right. But HTML's document character set is
                        unrelated to this discussion. If however you meant
                        character encoding, you're wrong, because any encoding is
                        allowed. Did you mean something else?

                        RFC2854 sec. 6 lists sources that specify the default when a
                        text/html document is served without explicitly declaring
                        its character encoding. Despite RFC2616 defining text/*'s
                        default character encoding as ISO-8859-1, HTML4.01
                        conforming user-agents mustn't assume any default value:

                        'The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-
                        8859-1 as a default character encoding when the "charset"
                        parameter is absent from the "Content-Type" header field. In
                        practice, this recommendation has proved useless because
                        some servers don't allow a "charset" parameter to be sent,
                        and others may not be configured to send the parameter.
                        Therefore, user agents must not assume any default value for
                        the "charset" parameter.' (HTML4.01 sec. 5.2.2.)

                        So it'd be absurd to heed the advice given in RFC2616 sec.
                        19.3, which says that 'not labelling the entity is preferred
                        over labelling the entity with the labels US-ASCII or ISO-
                        8859-1'. The usual ciwa* recommendation stands, discord
                        notwithstanding : send a charset parameter.

                        [ ... ]

                        Roll on the weekend!

                        --
                        Jock

                        Comment

                        • lawrence

                          #13
                          Re: how to tell server that charset is UTF-8??

                          Andy Hassall <andy@andyh.co. uk> wrote in message news:<36v0l0d9s m2t2f1e0n9s51f3 ajc692boda@4ax. com>...[color=blue]
                          > OK, so Apache sends out a character set heading under the recommended
                          > configuration - although it's effectively hardcoded; it doesn't 'detect' the
                          > encoding of the file since that's basically impossible in isolation.
                          >
                          > To get Apache to send out a character set header for a specific file, you'd
                          > then need to use Apache content negotiation if you wanted to select a different
                          > character set for a particular file - either with a type-map or I believe it
                          > can base it off suffixes of the filename (index.html.iso 8859-p15 and so on).
                          >
                          > Consider the following response from Apache:
                          >
                          > andyh@server:~/public_html$ touch utf8.html.utf8
                          > andyh@server:~/public_html$ telnet localhost 80
                          > Trying 127.0.0.1...
                          > Connected to localhost.
                          > Escape character is '^]'.
                          > HEAD /~andyh/utf8.html HTTP/1.0
                          >
                          > HTTP/1.1 200 OK
                          > Date: Tue, 21 Sep 2004 19:19:03 GMT
                          > Server: Apache/2.0.51 (Unix) PHP/5.0.1 DAV/2 SVN/1.0.6
                          > Content-Location: utf8.html.utf8
                          > Vary: negotiate
                          > TCN: choice
                          > Last-Modified: Tue, 21 Sep 2004 19:18:47 GMT
                          > ETag: "3811f-0-7f9b93c0;7f9b93 c0"
                          > Accept-Ranges: bytes
                          > Connection: close
                          > Content-Type: text/html; charset=utf-8
                          >
                          > Connection closed by foreign host.
                          >
                          > OK - so a filename of utf8.html.utf8 means that a request for utf8.html comes
                          > out in utf8 encoding. (I've got content negotiation enabled on my server).
                          >
                          > Presumably in the case of multiple encodings for the same URI then the
                          > browser's Accept-charset header comes into play for Apache to pick which to
                          > serve.[/color]

                          That's very interesting. Thanks for doing that bit of digging.

                          I'm sorry to say I've temporarily been handed responsibility for
                          keeping an Apache server going, though I don't know much about Apache.
                          We're hosting about 30 different domains on this machine. Most of
                          those domains have individuals who are handling all the web design for
                          that domain. If I set a default charset for Apache, how do the
                          individual web designers override the decision, if they need to? An
                          ..htaccess file? http-equiv meta tags?

                          Just curious.

                          Comment

                          Working...