Is this HttpWebRequest correct?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Nightcrawler

    Is this HttpWebRequest correct?

    I am currently using the HttpWebRequest and HttpWebResponse to pull
    webpages down from a few urls.

    string url = "some url";
    HttpWebRequest httpWebRequest =
    (HttpWebRequest )WebRequest.Cre ate(url);

    using (HttpWebRespons e httpWebResponse =
    (HttpWebRespons e)httpWebReques t.GetResponse() )
    {
    string html = string.Empty;

    StreamReader responseReader = new
    StreamReader(ht tpWebResponse.G etResponseStrea m(), Encoding.UTF7);
    html = responseReader. ReadToEnd();
    }

    My code works but my question is, am I doing it the right way
    (especially the encoding part)? Some of the websites I pull content
    from have charachters in them that do not exist in the english
    alphabet and currently the only way for these to be read correctly by
    my streamreader is if I am using UTF7 encoding. Is this really the
    only way?

    Before I move forward in the project I would like to understand if
    this indeed is the way to do this or if I am missing anything?

    Any help is appreciated.

    Thanks
  • Martin Honnen

    #2
    Re: Is this HttpWebRequest correct?

    Nightcrawler wrote:
    I am currently using the HttpWebRequest and HttpWebResponse to pull
    webpages down from a few urls.
    >
    string url = "some url";
    HttpWebRequest httpWebRequest =
    (HttpWebRequest )WebRequest.Cre ate(url);
    >
    using (HttpWebRespons e httpWebResponse =
    (HttpWebRespons e)httpWebReques t.GetResponse() )
    {
    string html = string.Empty;
    >
    StreamReader responseReader = new
    StreamReader(ht tpWebResponse.G etResponseStrea m(), Encoding.UTF7);
    html = responseReader. ReadToEnd();
    }
    >
    My code works but my question is, am I doing it the right way
    (especially the encoding part)? Some of the websites I pull content
    from have charachters in them that do not exist in the english
    alphabet and currently the only way for these to be read correctly by
    my streamreader is if I am using UTF7 encoding. Is this really the
    only way?
    You should check the HTTP response header Content-Type for a charset
    parameter and use that to create the stream reader. So for instance if
    the server sends a header
    Content-Type: text/html; charset=Windows-1252
    then you would use
    new StreamReader(ht tpWebResponse.G etResponseStrea m(),
    Encoding.GetEnc oding("Windows-1252"))

    On the other hand on the wild wild web the server often does not send a
    charset parameter and the author of the HTML document only includes the
    charset in a meta element e.g.
    <meta http-equiv="Content-Type" content="text/html;
    charset=Windows-1252">
    Therefore user agents like browsers put in a lot of effort to try to
    read enough of the document to find and parse that meta element to then
    be able to decode the rest of the document.


    --

    Martin Honnen --- MVP XML

    Comment

    • Nightcrawler

      #3
      Re: Is this HttpWebRequest correct?

      So what you basically are saying is that my best bet is to look for
      the meta tags in the page to determine the encoding to use and don't
      rely on the HTTP response header.

      Most of the sites I read using the streamreader say: <meta http-
      equiv="Content-type" content="text/html; charset=UTF-8" /but there
      are a few that do not have that meta tag included in their code. How
      should I approach those? Is there a way for the streamreader to detect
      what encoding the page is using?

      Thanks for you help!

      Comment

      • Nightcrawler

        #4
        Re: Is this HttpWebRequest correct?

        What is even more annoying is that one of the websites I read is
        stating it's using UTF-8 and my streamreader still does not translate
        the charachters correctly. I get little square boxes instead of the
        charachters.

        Comment

        • Peter Duniho

          #5
          Re: Is this HttpWebRequest correct?

          On Fri, 03 Oct 2008 10:28:21 -0700, Nightcrawler
          <thomas.zaleski @gmail.comwrote :
          What is even more annoying is that one of the websites I read is
          stating it's using UTF-8 and my streamreader still does not translate
          the charachters correctly. I get little square boxes instead of the
          charachters.
          "little square boxes" might, but does not necessarily, mean that the
          characters are being decoded incorrectly. It may simply be that the
          characters are not displaying with whatever font you're using to show them.

          How are you determining that the StreamReader doesn't correctly decode the
          characters? How are you specifying, if at all, that the encoding used by
          the StreamReader is UTF-8?

          Pete

          Comment

          • Nightcrawler

            #6
            Re: Is this HttpWebRequest correct?

            If I view the very same page in my browser it shows up correctly.

            The meta tag states it's using UTF-8 but when I use:

            StreamReader responseReader = new
            StreamReader(ht tpWebResponse.G etResponseStrea m(), Encoding.UTF8);

            The charachters are still unreadable. However, if I use UTF7 instead
            the charachters show up correctly BUT, when I try to convert the page
            to XML I get an error saying "hexadecima l value 0xD85E, is an invalid
            character". I am very confused with all this. Seems a little like the
            wild wild west.

            Any further help is highly appreciated.

            Thanks

            Comment

            • Nightcrawler

              #7
              Re: Is this HttpWebRequest correct?

              I guess another interesting point is that when I change the code to
              use: "ISO-8859-1" instead of UTF-8 like the website claims it uses, it
              seems that it actuallly is reading the charachters correctly AND the
              string translates into XML without any issues. Why? I have no idea and
              I wish I understood it better. Again, any insight to this problem is
              appreciated.

              Thanks

              Comment

              • Peter Duniho

                #8
                Re: Is this HttpWebRequest correct?

                On Fri, 03 Oct 2008 10:43:19 -0700, Nightcrawler
                <thomas.zaleski @gmail.comwrote :
                If I view the very same page in my browser it shows up correctly.
                Unless your own code is using the same fonts to display the text that the
                browser uses, that's not a relevant test.

                As for the other behaviors you've noticed, it does sound to me as though
                it's possible that the page is not encoded in UTF-8, but rather
                ISO-8859-1. But it's hard to know for sure, since we don't have the
                actual data to look at.

                Pete

                Comment

                • Nightcrawler

                  #9
                  Re: Is this HttpWebRequest correct?

                  Pete,

                  You can see the page if you go to the link below. It's iTunes
                  linkmaker page:



                  As you can see they claim they use utf-8 but when you read it using a
                  streamreader with that encoding, it does not read "foreign"
                  charachters correctly. However, when I tried the ISO-8859-1 encoding
                  it seemed to work.

                  Thanks

                  Comment

                  • Peter Duniho

                    #10
                    Re: Is this HttpWebRequest correct?

                    On Fri, 03 Oct 2008 13:47:43 -0700, Nightcrawler
                    <thomas.zaleski @gmail.comwrote :
                    Pete,
                    >
                    You can see the page if you go to the link below. It's iTunes
                    linkmaker page:
                    >

                    >
                    As you can see they claim they use utf-8 but when you read it using a
                    streamreader with that encoding, it does not read "foreign"
                    charachters correctly. However, when I tried the ISO-8859-1 encoding
                    it seemed to work.
                    What data in the page are you having trouble with? Can you be more
                    specific about what's not being shown correctly?

                    I haven't spend a huge amount of time with the file. But a cursory look
                    at it shows that it appears, at least to me, to have ISO-8859-1 data
                    embedded within the page itself, in certain URLs.

                    It seems possible to me that the page encoding is technically UTF-8, but
                    using only the subset of UTF-8 that is the same as ISO-8859-1, and that
                    the page also has data that's not supposed to be interpreted as text
                    within the HTML, but rather should be decoded as ISO-8859-1.

                    That would explain why the page claims to be encoded as UTF-8 but there
                    are still characters that don't display correctly unless you read the HTML
                    as ISO-8859-1.

                    Or maybe the meta tag really is wrong. I'm not completely sure. :)

                    Pete

                    Comment

                    • =?ISO-8859-1?Q?Arne_Vajh=F8j?=

                      #11
                      Re: Is this HttpWebRequest correct?

                      Nightcrawler wrote:
                      I am currently using the HttpWebRequest and HttpWebResponse to pull
                      webpages down from a few urls.
                      >
                      string url = "some url";
                      HttpWebRequest httpWebRequest =
                      (HttpWebRequest )WebRequest.Cre ate(url);
                      >
                      using (HttpWebRespons e httpWebResponse =
                      (HttpWebRespons e)httpWebReques t.GetResponse() )
                      {
                      string html = string.Empty;
                      >
                      StreamReader responseReader = new
                      StreamReader(ht tpWebResponse.G etResponseStrea m(), Encoding.UTF7);
                      html = responseReader. ReadToEnd();
                      }
                      >
                      My code works but my question is, am I doing it the right way
                      (especially the encoding part)? Some of the websites I pull content
                      from have charachters in them that do not exist in the english
                      alphabet and currently the only way for these to be read correctly by
                      my streamreader is if I am using UTF7 encoding. Is this really the
                      only way?
                      I am a bit surprised by the UTF-7, that is a rare encoding - at least
                      where I surf.

                      But else Martin Honnen is correct - you need to look at HTTP header
                      and HTML META tag.

                      See the code attached below for a starting point.

                      Arne

                      =============== =============== =============== ============

                      public class HttpDownloadCha rset
                      {
                      private static Regex encpat = new
                      Regex("charset= ([A-Za-z0-9-]+)", RegexOptions.Ig noreCase |
                      RegexOptions.Co mpiled);
                      private static string ParseContentTyp e(string contenttype)
                      {
                      Match m = encpat.Match(co ntenttype);
                      if(m.Success)
                      {
                      return m.Groups[1].Value;
                      }
                      else
                      {
                      return "ISO-8859-1";
                      }
                      }
                      private static Regex metaencpat = new
                      Regex("<META\\s +HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s* =\\s*[\"']([^\"']*)[\"']>",
                      RegexOptions.Ig noreCase | RegexOptions.Co mpiled);
                      private static string ParseMetaConten tType(String html, String
                      defenc)
                      {
                      Match m = metaencpat.Matc h(html);
                      if(m.Success)
                      {
                      return ParseContentTyp e(m.Groups[1].Value);
                      } else {
                      return defenc;
                      }
                      }
                      private const int DEFAULT_BUFSIZ = 1000000;
                      public static string Download(string urlstr)
                      {
                      HttpWebRequest req = (HttpWebRequest )WebRequest.Cre ate(urlstr);
                      using(HttpWebRe sponse resp =
                      (HttpWebRespons e)req.GetRespon se())
                      {
                      if (resp.StatusCod e == HttpStatusCode. OK)
                      {
                      string enc = ParseContentTyp e(resp.ContentT ype);
                      int bufsiz = (int)resp.Conte ntLength;
                      if(bufsiz < 0) {
                      bufsiz = DEFAULT_BUFSIZ;
                      }
                      byte[] buf = new byte[bufsiz];
                      Stream stm = resp.GetRespons eStream();
                      int ix = 0;
                      int n;
                      while((n = stm.Read(buf, ix, buf.Length - ix)) 0) {
                      ix += n;
                      }
                      stm.Close();
                      string temp = Encoding.ASCII. GetString(buf);
                      enc = ParseMetaConten tType(temp, enc);
                      return Encoding.GetEnc oding(enc).GetS tring(buf);
                      }
                      else
                      {
                      throw new ArgumentExcepti on("URL " + urlstr + "
                      returned " + resp.StatusDesc ription);
                      }
                      }
                      }
                      }

                      Comment

                      • Nightcrawler

                        #12
                        Re: Is this HttpWebRequest correct?

                        Peter,

                        Thanks for your feedback. One example of data that I was having
                        trouble with would be the 6th row from the bottom (Love & Happiness
                        (Yemaya y Ochùn) [12' Club Mix]). The special "u" charachter in the
                        word Ochun was coming out wrong when I used UTF-8 encoding. Once I
                        changed it to ISO-8859-1 I was able to parse it out correctly.

                        I really would like to understand encodings and why I was running into
                        this problem. Are there any articles or websites you can recommend
                        that will allow me to learn a bit more about this. I hate "solving" a
                        problem and moving on without really knowing why it works.

                        Thanks again.

                        Comment

                        • Nightcrawler

                          #13
                          Re: Is this HttpWebRequest correct?

                          Arne,

                          Thanks for the code. I will give this a try.

                          Comment

                          • Jon Skeet [C# MVP]

                            #14
                            Re: Is this HttpWebRequest correct?

                            On Oct 6, 3:15 pm, Nightcrawler <thomas.zale... @gmail.comwrote :
                            Thanks for your feedback. One example of data that I was having
                            trouble with would be the 6th row from the bottom (Love & Happiness
                            (Yemaya y Ochùn) [12' Club Mix]). The special "u" charachter in the
                            word Ochun was coming out wrong when I used UTF-8 encoding. Once I
                            changed it to ISO-8859-1 I was able to parse it out correctly.
                            >
                            I really would like to understand encodings and why I was running into
                            this problem. Are there any articles or websites you can recommend
                            that will allow me to learn a bit more about this. I hate "solving" a
                            problem and moving on without really knowing why it works.
                            I have an article on Unicode at http://pobox.com/~skeet/csharp/unicode.html
                            Whether it contains anything you don't already know is a different
                            matter...

                            Jon

                            Comment

                            • Nightcrawler

                              #15
                              Re: Is this HttpWebRequest correct?

                              Jon,

                              Thanks, I will check this out.

                              Comment

                              Working...