parsing internet page using C

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • manontheedge
    New Member
    • Oct 2006
    • 175

    parsing internet page using C

    I've been trying to parse data on a web page in C, but after a several hours of searching the internet for help with writing this code I'm still lost.

    Can anyone give me any tips, code, or any other sort of help with this? Nothing I try seems to work and I'm just extremely frustrated and would appreciate any help.

    What I'm trying to do is go to a web site, look for various keywords, and based on what it finds, pull that data out and either put it in variables in the code or put it in a text document.
  • RedSon
    Recognized Expert Expert
    • Jan 2007
    • 4980

    #2
    Try searching for web robots or web spiders.

    Comment

    • Jan Spatina
      New Member
      • Mar 2007
      • 9

      #3
      Originally posted by manontheedge
      I've been trying to parse data on a web page in C, but after a several hours of searching the internet for help with writing this code I'm still lost.

      Can anyone give me any tips, code, or any other sort of help with this? Nothing I try seems to work and I'm just extremely frustrated and would appreciate any help.

      What I'm trying to do is go to a web site, look for various keywords, and based on what it finds, pull that data out and either put it in variables in the code or put it in a text document.
      Hi,
      you need to establish http connection using sockets, then send something like "GET /web/index.html HTTP/1.0\nhost: www.sapik.cz\n\ n" and get the html code in reply and parse. This won't work for php sites. Maybe I can write some example if this is what you need.

      -jan-

      Comment

      • manontheedge
        New Member
        • Oct 2006
        • 175

        #4
        that is what I need...and they are html pages. If you can post some code that would be a great help...I've gotten pretty much nowhere with this. I was about to settle for opening the page, copying it to excel and sorting through it there due to my frustration, so yes I would appreciate the help very much.

        Comment

        • Jan Spatina
          New Member
          • Mar 2007
          • 9

          #5
          Originally posted by manontheedge
          that is what I need...and they are html pages. If you can post some code that would be a great help...I've gotten pretty much nowhere with this. I was about to settle for opening the page, copying it to excel and sorting through it there due to my frustration, so yes I would appreciate the help very much.
          Hi, this is the code:
          Code:
           #include <iostream>
          #include <string>
          #include <algorithm>
          #include <fstream>
          #include <windows.h>
          
          using namespace std;
          
          int t_sockets::GetWeb()
          {
              #define BUFSIZE 1000000
              WORD wVersionRequested = MAKEWORD(1,1); 
              WSADATA data;                           // library
              string text("GET /forum/thread624279.html HTTP/1.0\nhost: www.thescripts.com\n\n");
              hostent *host;                          // remote machine
              sockaddr_in serverSock;                 // remote socket
              int mySocket;                           // Socket    
              char buf[BUFSIZE];                      // input buffer
              int size, totalSize = 0;                // number of recieved and sent bytes
              ofstream output(".\\download.html");    //write html data 
              // get sockets ready
              if (WSAStartup(wVersionRequested, &data) != 0)
              {
                  cerr << "error in inicialization of sockets" << endl;
                  
                  return -1;
              }    
              // get info about remote machine
              if ((host = gethostbyname("www.builder.cz")) == NULL)
              {
                  cerr << "Wrong address" << endl;
                  WSACleanup();
                  return -1;
              }
              // Creation of a socket
              if ((mySocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) == -1)
              {
                  cerr << "Can't create socket" << endl;
                  WSACleanup();
                  return -1;
              }
              // fill in sockaddr_in
              // 1) protocol family
              serverSock.sin_family = AF_INET;
              // 2) port to connect (http)
              serverSock.sin_port = htons(80);
              // 3) IP address to connect to
              memcpy(&(serverSock.sin_addr), host->h_addr, host->h_length);
              // connect the socket
              if (connect(mySocket, (sockaddr *)&serverSock, sizeof(serverSock)) == -1)
              {
                  cerr << "can't connect" << endl;
                  WSACleanup();
                  return -1;
              }
              // send data
              if ((size = send(mySocket, text.c_str(), text.size() + 1, 0)) == -1)
              {
                  cerr << "Can't send data" << endl;
                  WSACleanup();
                  return -1;
              }
              cout << "sent " << size << endl;
              // recieve data
              text = "";
              while (((size = recv(mySocket, buf, BUFSIZE - 1, 0)) != 0) && (size != -1))
              {
                  buf[size] = '\0';
                  totalSize += size;
                  text += buf;
              }
              if (size == -1)
              {
                  cout << "Can't recieve data" << endl;
              }    
              // close connection
              closesocket(mySocket);
              WSACleanup();
              cout << "Accepted: " << totalSize << " bytes" << endl << "HTTP Header:" << endl << endl;
              // get http reply...
              int offset = text.find("\r\n\r\n");
              copy(text.begin(), text.begin() + offset, ostream_iterator<char>(cout,""));
              copy(text.begin() + offset, text.end(), ostream_iterator<char>(output,""));
              return 0;
          }
          This is done for VC++ v.6

          Comment

          • nmadct
            Recognized Expert New Member
            • Jan 2007
            • 83

            #6
            First, you're way better off doing this in Perl than in C, as there's a huge amount of ready-made code that will do most of the work for you. I think Python also has good libraries for this.

            If you want a robust way to access web pages, you might try libwww, although I've found it's not the easiest thing to use.

            The easiest way to grab web pages from a C program is probably to invoke the wget or curl program to fetch the page as a file, then open that file.

            As for parsing the file, that's an entirely different question. There is plenty of literature out there on writing parsers, you can search Google for it. If you're not interested in fully parsing HTML, but rather just getting some info out of the page, your job might be considerably easier.

            Comment

            • manontheedge
              New Member
              • Oct 2006
              • 175

              #7
              Code:
              #include<stdio.h>
              #include<winsock2.h>
              
              #pragma comment(lib, "ws2_32.lib")
              #define STRING_MAX 65536
              #define MAX 8388608
              
              char *get_http(char *targetip, int port, char *file)
               {
                    WSADATA wsaData;
                   WORD wVersionRequested;
                   struct hostent*          target_ptr;
                   struct sockaddr_in      sock;
                   SOCKET MySock;
              
              
                   wVersionRequested = MAKEWORD(2, 2);
                   if (WSAStartup(wVersionRequested, &wsaData) < 0)
                   {
                           printf("################# ERROR! ###################\n");
                           printf("Your ws2_32.dll is too old to use this application.    \n");
                           printf("Go to microsofts web site to download the most recent \n");
                           printf("version of ws2_32.dll.\n");
              
              
                           WSACleanup();
                           exit(1);
                   }
                   MySock = socket(AF_INET, SOCK_STREAM, 0);
                   if(MySock==INVALID_SOCKET)
                   {
                           printf("Socket error!\r\n");
              
                           closesocket(MySock);
                           WSACleanup();
                           exit(1);
                   }
                   if ((target_ptr = gethostbyname(targetip)) == NULL)
                   {
                           printf("Resolve of %s failed, please try again.\n", targetip);
              
                           closesocket(MySock);
                           WSACleanup();
                           exit(1);
                   }
                   memcpy(&sock.sin_addr.s_addr, target_ptr->h_addr, target_ptr->h_length);
                   sock.sin_family = AF_INET;
                   sock.sin_port = htons((USHORT)port);
              
                   if ( (connect(MySock, (struct sockaddr *)&sock, sizeof (sock) )))
                   {
                           printf("Couldn't connect to host.\n");
              
                           closesocket(MySock);
                           WSACleanup();
                           exit(1);
                   }
                   char sendfile[STRING_MAX];
                   strcpy(sendfile, "GET ");
                   strcat(sendfile, file);
                   strcat(sendfile, " HTTP/1.1 \r\n" );
                   strcat(sendfile, "Host: localhost\r\n\r\n");
                   if (send(MySock, sendfile, sizeof(sendfile)-1, 0) == -1)
                   {
                           printf("Error sending Packet\r\n");
                           closesocket(MySock);
                           WSACleanup();
                           exit(1);
                   }
                   send(MySock, sendfile, sizeof(sendfile)-1, 0);
              
              
              	     char *recvString = new char[MAX];
                   int nret;
                   nret = recv(MySock, recvString, MAX + 1, 0);
              
              
                   char *output= new char[nret];
                   strcpy(output, "");
                   if (nret == SOCKET_ERROR)
                   {
                           printf("Attempt to receive data FAILED. \n");
                   }
                   else
                   {
                           strncat(output, recvString, nret);
                           delete [ ] recvString;
                   }
                   closesocket(MySock);
                   WSACleanup();
                   return (output);
                   delete [ ] output;
               }
              
              int main(int argc, char *argv[])
              {
                  int port = 80;
                  char* targetip;
                  
                  if (argc < 2)
                  {
                     printf("WebGrab usage:\r\n");
                     printf("%s <TargetIP> [port]\r\n", argv[0]);
                     return(0);
                  }
                  
                  targetip = argv[1];
                  char* output;
                  
                  if(argc >= 3)
                  {
                     port = atoi(argv[2]);
                  }
                  
                  if(argc >= 4)
                  {
                     output = get_http(targetip, port, argv[3]);
                  }
                  
                  else
                  {
                     output = get_http(targetip, port, "/");
                  }
                  
                  printf("%s", output);
                  
                  return(0);
              }
              I looked in to everything you guys recommended, and I started researching sockets. In theory it makes sense. I have some code here that is suppose to get a web page and print the data to the console. But it keeps stopping at the "Couldn't connect to host" part, and it's got me confused.

              I'm loading the program in the command prompt, and there are 3 arguments, one madatory...the IP address or a fully qualified domain name. Each time I run the program, about 5-10 seconds later I get that error. I'd like to know why this is happening.

              I'm actually trying to learn this stuff and how it works so I can use it more in the future, so any help or guidance with this would help. Thanks for the help so far as well.

              Comment

              • nmadct
                Recognized Expert New Member
                • Jan 2007
                • 83

                #8
                I noticed that your "memcpy" line doesn't match the one in Jan's code. I wonder if that's causing a problem. After you've tried to connect and encountered an error, you can call WSAGetLastError () to get an error message, which might be helpful. Also, it might be helpful if you could print out the values of the arguments to connect() just before making the call, to verify what it's trying to do.

                I don't know what kind of reference you're using, but this page is helpful: http://www.sockets.com/winsock.htm
                (It's old, but I don't think the API has changed much since then. Things have been added but the original functions should work the same.)

                Comment

                • EDevMachine
                  New Member
                  • Apr 2007
                  • 1

                  #9
                  Another option if a C++/MFC app works is good for you...

                  // Just returning the whoooole webpage as a string for parsing.
                  CString GetSourceHtml(C String theUrl)
                  {
                  // this first block does the actual work
                  CString szOutput;
                  szOutput = "";
                  CInternetSessio n session;
                  CInternetFile* file = NULL;
                  try
                  {
                  // try to connect to the URL
                  file = (CInternetFile* ) session.OpenURL (theUrl);
                  }
                  catch (CInternetExcep tion* m_pException)
                  {
                  // set file to NULL if there's an error
                  file = NULL;
                  m_pException->Delete();
                  }

                  // most of the following deals with storing the html to a file
                  CStdioFile dataStore;

                  if (file)
                  {

                  CString somecode;

                  // continue fetching code until there is no more
                  while (file->ReadString(som ecode) != NULL)
                  {
                  szOutput = szOutput + somecode;
                  }

                  file->Close();
                  delete file;
                  }

                  return szOutput.Trim() ;

                  }

                  Comment

                  Working...