Retrieving web page data with the microsoft winsock control.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • AaronL
    New Member
    • Jan 2007
    • 99

    Retrieving web page data with the microsoft winsock control.

    Hello,

    I am currently working on a project that has me in sort of a bind. What I want to do is retrieve web pages from the internet, and strip them down to just text. I'll get using Regular Expressions to strip out the HTML code itself, the problem is actually getting the web pages from the internet.

    I tried using the Microsoft Internet Transfer Control but my client was experiencing problems with some web pages not downloading, this particular page would just get to the <body> tag and then stop. My client reported to me that there are issues reported with the ITC so we determined to find an alternative.

    Before I tried the ITC, I used the Microsoft Winsock Control, but with that control, I had problems with web pages being truncated causing my HTML strip out routine to malfunction. We don't want to use Microsoft Internet Controls either at the request of my client.

    I feel the Microsoft Winsock Control is the best way to go as to my knowledge is the low level way of communicating with servers on the internet. I feel that it's my lack of understanding on how the Winsock Control works is what is causing my problems. So can someone look at this code and tell me what I'm doing wrong or tell me how I can use the Winsock Control for this project? My job is kind of on the line at this point.

    Here is code that I put in a module:
    Code:
    'Data variables
    Public DataIn As String
    Public DataOut As String
    Public ErrMsg As String
    
    'Network variables
    Public URL As String
    Public Port As Integer
    
    'Functional Variables
    Public NetExecuting As Boolean
    Public NetTimer As Long
    Public Function GetHTML(URL As String, Port As Integer)
    
    NetTimer = 0
    
    'If winsock is still executing, wait
    If NetExecuting = True Then
        Do
            DoEvents
        Loop Until NetExecuting = False
    End If
    
    DataIn = ""
    
    If frmWinSock.Winsock.State <> sckClosed Then
        frmWinSock.Winsock.Close
    End If
    
    'If port is 0 then set it to 80
    If Port <> 0 Then
        frmWinSock.Winsock.RemotePort = Port
    Else
        frmWinSock.Winsock.RemotePort = 80
    End If
    
    'Send connection string compatible with Internet Explorer
    DataOut = "GET / HTTP/1.0" _
    & vbCrLf & "Accept: text/html" _
    & vbCrLf & "Host: " & URL _
    & vbCrLf & "Connection: open " _
    & vbCrLf & "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)" _
    & vbCrLf & "Referer: " _
    & vbCrLf & "Cookie: " _
    & vbCrLf & vbCrLf
    
    frmWinSock.Winsock.RemoteHost = URL
    
    frmWinSock.Winsock.Connect
    Do
        DoEvents
    Loop Until frmWinSock.Winsock.State = sckClosed And DataIn <> ""
    
    GetHTML = DataIn
    
    End Function
    Here is my code that runs the winsock functions:
    Code:
    Private Sub timNetTimeOut_Timer()
    Winsock.Close
    End Sub
    
    Private Sub Winsock_Close()
    NetExecuting = False
    End Sub
    
    Private Sub Winsock_Connect()
    timNetTimeOut.Enabled = True
    NetExecuting = True
    ErrMsg = ""
    Winsock.SendData DataOut
    End Sub
    
    Private Sub Winsock_DataArrival(ByVal bytesTotal As Long)
    Dim DataArrived As String
    On Error Resume Next
    Winsock.GetData DataArrived
    DataIn = DataIn & DataArrived
    End Sub
    
    Private Sub Winsock_Error(ByVal Number As Integer, Description As String, ByVal Scode As Long, ByVal Source As String, ByVal HelpFile As String, ByVal HelpContext As Long, CancelDisplay As Boolean)
    NetExecuting = False
    ErrMsg = "A network error has occurred: " & Number & " " & Description
    End Sub
    Now I set the timeout timer I made to 10 seconds which should be plenty of time to retrieve a web page, but for example http://news.yahoo.com gets truncated.
  • AaronL
    New Member
    • Jan 2007
    • 99

    #2
    I found that winsock has a problem with receiving large data streams. If anyone knows a way around this, I could use the help. Thanks!

    Comment

    • use an buffer

      #3
      I found your code clean and I liked it, thanks.

      It works out for me to use a buffer and save it away periodically during the reception of data.

      Comment

      • USE BUFFER

        #4
        Code:
        Private Sub Wsck_DataArrival(ByVal bytesTotal As Long)
        
         Wsck.GetData Response
         Buffer = Buffer & Response
                
         If InStr(1, Buffer, "</HTML>", vbTextCompare) > 0 Then
           Open App.Path & "/folder/" & i & ".txt" For Output As #1
               Print #1, Buffer
               Close #1       
         End If
        
        End Sub
        Last edited by MMcCarthy; Oct 17 '10, 10:42 PM. Reason: added code tags

        Comment

        Working...