Hello,
I am currently working on a project that has me in sort of a bind. What I want to do is retrieve web pages from the internet, and strip them down to just text. I'll get using Regular Expressions to strip out the HTML code itself, the problem is actually getting the web pages from the internet.
I tried using the Microsoft Internet Transfer Control but my client was experiencing problems with some web pages not downloading, this particular page would just get to the <body> tag and then stop. My client reported to me that there are issues reported with the ITC so we determined to find an alternative.
Before I tried the ITC, I used the Microsoft Winsock Control, but with that control, I had problems with web pages being truncated causing my HTML strip out routine to malfunction. We don't want to use Microsoft Internet Controls either at the request of my client.
I feel the Microsoft Winsock Control is the best way to go as to my knowledge is the low level way of communicating with servers on the internet. I feel that it's my lack of understanding on how the Winsock Control works is what is causing my problems. So can someone look at this code and tell me what I'm doing wrong or tell me how I can use the Winsock Control for this project? My job is kind of on the line at this point.
Here is code that I put in a module:
Here is my code that runs the winsock functions:
Now I set the timeout timer I made to 10 seconds which should be plenty of time to retrieve a web page, but for example http://news.yahoo.com gets truncated.
I am currently working on a project that has me in sort of a bind. What I want to do is retrieve web pages from the internet, and strip them down to just text. I'll get using Regular Expressions to strip out the HTML code itself, the problem is actually getting the web pages from the internet.
I tried using the Microsoft Internet Transfer Control but my client was experiencing problems with some web pages not downloading, this particular page would just get to the <body> tag and then stop. My client reported to me that there are issues reported with the ITC so we determined to find an alternative.
Before I tried the ITC, I used the Microsoft Winsock Control, but with that control, I had problems with web pages being truncated causing my HTML strip out routine to malfunction. We don't want to use Microsoft Internet Controls either at the request of my client.
I feel the Microsoft Winsock Control is the best way to go as to my knowledge is the low level way of communicating with servers on the internet. I feel that it's my lack of understanding on how the Winsock Control works is what is causing my problems. So can someone look at this code and tell me what I'm doing wrong or tell me how I can use the Winsock Control for this project? My job is kind of on the line at this point.
Here is code that I put in a module:
Code:
'Data variables Public DataIn As String Public DataOut As String Public ErrMsg As String 'Network variables Public URL As String Public Port As Integer 'Functional Variables Public NetExecuting As Boolean Public NetTimer As Long Public Function GetHTML(URL As String, Port As Integer) NetTimer = 0 'If winsock is still executing, wait If NetExecuting = True Then Do DoEvents Loop Until NetExecuting = False End If DataIn = "" If frmWinSock.Winsock.State <> sckClosed Then frmWinSock.Winsock.Close End If 'If port is 0 then set it to 80 If Port <> 0 Then frmWinSock.Winsock.RemotePort = Port Else frmWinSock.Winsock.RemotePort = 80 End If 'Send connection string compatible with Internet Explorer DataOut = "GET / HTTP/1.0" _ & vbCrLf & "Accept: text/html" _ & vbCrLf & "Host: " & URL _ & vbCrLf & "Connection: open " _ & vbCrLf & "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)" _ & vbCrLf & "Referer: " _ & vbCrLf & "Cookie: " _ & vbCrLf & vbCrLf frmWinSock.Winsock.RemoteHost = URL frmWinSock.Winsock.Connect Do DoEvents Loop Until frmWinSock.Winsock.State = sckClosed And DataIn <> "" GetHTML = DataIn End Function
Code:
Private Sub timNetTimeOut_Timer() Winsock.Close End Sub Private Sub Winsock_Close() NetExecuting = False End Sub Private Sub Winsock_Connect() timNetTimeOut.Enabled = True NetExecuting = True ErrMsg = "" Winsock.SendData DataOut End Sub Private Sub Winsock_DataArrival(ByVal bytesTotal As Long) Dim DataArrived As String On Error Resume Next Winsock.GetData DataArrived DataIn = DataIn & DataArrived End Sub Private Sub Winsock_Error(ByVal Number As Integer, Description As String, ByVal Scode As Long, ByVal Source As String, ByVal HelpFile As String, ByVal HelpContext As Long, CancelDisplay As Boolean) NetExecuting = False ErrMsg = "A network error has occurred: " & Number & " " & Description End Sub
Comment