A screen scraper is a program that removes text only from a web site.
I pinched this one from the web:
Public Class Form1
Private Sub Form1_Load(ByVa l sender As System.Object, _
ByVal e As System.EventArg s) Handles MyBase.Load
Me.TextBox1.Mul tiline = True
Me.TextBox1.Scr ollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDoc ument2
Doc = New mshtml.HTMLDocu mentClass
Dim wbReq As Net.HttpWebRequ est = _
DirectCast(Net. WebRequest.Crea te("http://
start.csail.mit .edu/startfarm.cgi?q uery=USA"), _
Net.HttpWebRequ est)
Dim wbResp As Net.HttpWebResp onse = _
DirectCast(wbRe q.GetResponse() , Net.HttpWebResp onse)
Dim wbHCol As Net.WebHeaderCo llection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetRespo nseStream()
Dim myreader As New IO.StreamReader (myStream)
Doc.write(myrea der.ReadToEnd() )
Doc.close()
wbResp.Close()
'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.
Dim sb As New System.Text.Str ingBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLEle ment = _
DirectCast(Doc. all.item(i), mshtml.IHTMLEle ment)
Select Case hElm.tagName.To Lower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <"" Then
sb.Append(hElm. innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub
the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.
K.
I pinched this one from the web:
Public Class Form1
Private Sub Form1_Load(ByVa l sender As System.Object, _
ByVal e As System.EventArg s) Handles MyBase.Load
Me.TextBox1.Mul tiline = True
Me.TextBox1.Scr ollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDoc ument2
Doc = New mshtml.HTMLDocu mentClass
Dim wbReq As Net.HttpWebRequ est = _
DirectCast(Net. WebRequest.Crea te("http://
start.csail.mit .edu/startfarm.cgi?q uery=USA"), _
Net.HttpWebRequ est)
Dim wbResp As Net.HttpWebResp onse = _
DirectCast(wbRe q.GetResponse() , Net.HttpWebResp onse)
Dim wbHCol As Net.WebHeaderCo llection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetRespo nseStream()
Dim myreader As New IO.StreamReader (myStream)
Doc.write(myrea der.ReadToEnd() )
Doc.close()
wbResp.Close()
'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.
Dim sb As New System.Text.Str ingBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLEle ment = _
DirectCast(Doc. all.item(i), mshtml.IHTMLEle ment)
Select Case hElm.tagName.To Lower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <"" Then
sb.Append(hElm. innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub
the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.
K.
Comment