Screen Scraper

**Cor Ligthert[MVP]** · Aug 12 '08, 04:45 AM

Re: Screen Scraper

Kronecker,

The HttpRequest gives you only back the HTML content of the document that is
in the URL, that is not a page as you see it.

If you want to do as I understand you need to use the DOM (Document Object
Model) represented by MSHTML and learn what MSHTML is (in fact it has all
elements from DHTML).

As you know that, then you can use the Document property from the WebBrowser
to get that HTML. Be aware that one page can be made from more Frames and so
called IFrames. As it is like that, you have to evaluate all documents
(every frame contains a document). Therefore the AXWebbrowser has a
document.comple te event and a download.comple te event (for the webbrowser
there is an other way).

If you look in at the bottom of IE, you see that downloading happens,
because images and more things as like flash are also seperated downloaded.

Working with MSHTML is not an easy thing, because it has classes, which
should be often casted and sometimes even very deep, because the casted
class uses members which too should be casted.

The last thing is that most webcreaters are not always as correct as it
should be and there are on many pages, including from very profesional
companies, often many errors. Often they are created like: "As it works on
my screen then it is correct".

Cor

<kronecker@yaho o.co.ukschreef in bericht
news:028fcd78-a619-4135-8f4f-b29504d4d305@k3 6g2000pri.googl egroups.com...

>A screen scraper is a program that removes text only from a web site.
I pinched this one from the web:
>
Public Class Form1
Private Sub Form1_Load(ByVa l sender As System.Object, _
ByVal e As System.EventArg s) Handles MyBase.Load
Me.TextBox1.Mul tiline = True
Me.TextBox1.Scr ollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDoc ument2
Doc = New mshtml.HTMLDocu mentClass
Dim wbReq As Net.HttpWebRequ est = _
DirectCast(Net. WebRequest.Crea te("http://
start.csail.mit .edu/startfarm.cgi?q uery=USA"), _
Net.HttpWebRequ est)
Dim wbResp As Net.HttpWebResp onse = _
DirectCast(wbRe q.GetResponse() , Net.HttpWebResp onse)
Dim wbHCol As Net.WebHeaderCo llection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetRespo nseStream()
Dim myreader As New IO.StreamReader (myStream)
Doc.write(myrea der.ReadToEnd() )
Doc.close()
wbResp.Close()
>
'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.
>
Dim sb As New System.Text.Str ingBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLEle ment = _
DirectCast(Doc. all.item(i), mshtml.IHTMLEle ment)
Select Case hElm.tagName.To Lower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <"" Then
sb.Append(hElm. innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub
>
the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.
>
K.

Screen Scraper

Screen Scraper

Comment