parse http header

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dschu012
    New Member
    • Jul 2008
    • 39

    parse http header

    I am having trouble figuring out how to parse a http header into a map<string,stri ng>
    Code:
    POST /blah HTTP/1.1
    Host: example.com
    Accept-Language: en-us,en;q=0.5
    Accept-Encoding: gzip,deflate
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Content-Type: application/x-www-form-urlencoded
    Content-Length: 25
    My idea was to find the first index of ':' on each line and use the text before that as the key. Then use everything before the \r\n as the value. The problem is the spec (http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html) says that the values can span multiple lines. Any ideas on how I would handle those? Can the value include a colon (the specs don't seem to specify)?
  • johny10151981
    Top Contributor
    • Jan 2010
    • 1059

    #2
    first separate all lines. every line ends with "\r\n"
    First line is fixed. It always the same. GET/POST/DELETE/HEADER or something else now i cant recall.
    Form Second line you can separate by ":"

    If the request is a POST request then after getting "\r\n\r\n" you get the URI in one line.

    Comment

    • johny10151981
      Top Contributor
      • Jan 2010
      • 1059

      #3
      Your Posted header is missing one line.

      Your posted header says it is a POST request and its content length is 25. But I cant see the URI. I guess your program stopped after getting "\r\n\r\n" which is not right.

      Comment

      • Oralloy
        Recognized Expert Contributor
        • Jun 2010
        • 988

        #4
        Parsing is always non-trivial. The line break information you need is described in section 2.2 of the spec, however.

        Why are you writing this parser, rather than using one that's already been implemented?

        Comment

        • dschu012
          New Member
          • Jul 2008
          • 39

          #5
          @Oralloy
          I was using a small http client sample code. In the sample they were just ignoring the header data and going straight to the \r\n\r\n to get the response data. I am only doing GETs and one of the URLs has a redirect which is why I needed the header data.

          @johny10151981
          It was only an example not real data and I am only doing GETs. Your suggestion doesn't help for multi line values.

          Thanks for referring me to section 2.2. It answered my question.

          Comment

          • Oralloy
            Recognized Expert Contributor
            • Jun 2010
            • 988

            #6
            dschu012,

            If you're using a very light-weight sample as your starting point, then you're going to have to do some work to parse the headers. Since you're only interested in one of the headers, you can cheat by reading the headers one line at a time and processing them. If you find the redirect, you're done, if you don't you have the content.

            Pseudo code:
            Code:
            looking = inHeaders = true;
            while (looking && inHeaders)
              read line
              if (eof)
                inHeaders = false
              else if (line = "")
                inHeaders = false
              else if (strncmp(line, "Redirect:", 9))
                ; // noOp - not found
              else
                looking = false // found the header we want
            end while
            
            if(!looking)
              process redirect
            else
              process result
            If you read the specification, then you have a good idea of how messy parsing the headers can be.

            Rather than re-inventing the wheel to parse headers, it might be worth some time to go find a little better example to start with.

            If you're not stuck with C++, there is a really good Perl module for accessing web servers.

            On the other hand, if you're stuck with C++, I'd say use a combination of Lex and Yacc to really simplify the work.

            Another good option would be to read the entire mess into a single buffer and parse it using regex. I'm pretty sure that it'll be fairly easy to write regular expressions to parse the headers. Start by dividing the headers from the content at the first occurance of "\r\n\r\n". Then tear the headers off of the header block one at a time using one regex, repeatedly.

            Failing that, I'd write a simple state machine/recognizer for general headers. The problem is that by the time you're done implementing all the quoting forms and comments, you're going to have a rather complex bit of software.

            See section 2 of the document you sent...

            Comment

            Working...