Screen Scraper

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • kronecker@yahoo.co.uk

    Screen Scraper

    A screen scraper is a program that removes text only from a web site.
    I pinched this one from the web:

    Public Class Form1
    Private Sub Form1_Load(ByVa l sender As System.Object, _
    ByVal e As System.EventArg s) Handles MyBase.Load
    Me.TextBox1.Mul tiline = True
    Me.TextBox1.Scr ollBars = ScrollBars.Both
    'above only for showing the sample
    Dim Doc As mshtml.IHTMLDoc ument2
    Doc = New mshtml.HTMLDocu mentClass
    Dim wbReq As Net.HttpWebRequ est = _
    DirectCast(Net. WebRequest.Crea te("http://
    start.csail.mit .edu/startfarm.cgi?q uery=USA"), _
    Net.HttpWebRequ est)
    Dim wbResp As Net.HttpWebResp onse = _
    DirectCast(wbRe q.GetResponse() , Net.HttpWebResp onse)
    Dim wbHCol As Net.WebHeaderCo llection = wbResp.Headers
    Dim myStream As IO.Stream = wbResp.GetRespo nseStream()
    Dim myreader As New IO.StreamReader (myStream)
    Doc.write(myrea der.ReadToEnd() )
    Doc.close()
    wbResp.Close()

    'the part below is not completly done for all tags.
    'it can (will) be necessary to tailor that to your needs.

    Dim sb As New System.Text.Str ingBuilder
    For i As Integer = 0 To Doc.all.length - 1
    Dim hElm As mshtml.IHTMLEle ment = _
    DirectCast(Doc. all.item(i), mshtml.IHTMLEle ment)
    Select Case hElm.tagName.To Lower
    Case "body" '"html" ' "head" ' "form"
    Case Else
    If hElm.innerText <"" Then
    sb.Append(hElm. innerText & vbCrLf)
    End If
    End Select
    Next
    TextBox1.Text = sb.ToString
    End Sub

    the trouble is that it gives text out that is duplicated in multiple
    lines of the same info.
    I explored this in a separate thread where I tried to fix it by
    writing it to a text file and looking for duplicates. however, it
    would be far easier to fix teh scraper itself.
    I am unfamiliar with mshtml coding but essentially it is looking for
    Tags "body of text html,head etc. Any suggestions as to why it
    duplicates would be great.

    K.
  • Cor Ligthert[MVP]

    #2
    Re: Screen Scraper

    Kronecker,

    The HttpRequest gives you only back the HTML content of the document that is
    in the URL, that is not a page as you see it.

    If you want to do as I understand you need to use the DOM (Document Object
    Model) represented by MSHTML and learn what MSHTML is (in fact it has all
    elements from DHTML).

    As you know that, then you can use the Document property from the WebBrowser
    to get that HTML. Be aware that one page can be made from more Frames and so
    called IFrames. As it is like that, you have to evaluate all documents
    (every frame contains a document). Therefore the AXWebbrowser has a
    document.comple te event and a download.comple te event (for the webbrowser
    there is an other way).

    If you look in at the bottom of IE, you see that downloading happens,
    because images and more things as like flash are also seperated downloaded.

    Working with MSHTML is not an easy thing, because it has classes, which
    should be often casted and sometimes even very deep, because the casted
    class uses members which too should be casted.

    The last thing is that most webcreaters are not always as correct as it
    should be and there are on many pages, including from very profesional
    companies, often many errors. Often they are created like: "As it works on
    my screen then it is correct".

    Cor

    <kronecker@yaho o.co.ukschreef in bericht
    news:028fcd78-a619-4135-8f4f-b29504d4d305@k3 6g2000pri.googl egroups.com...
    >A screen scraper is a program that removes text only from a web site.
    I pinched this one from the web:
    >
    Public Class Form1
    Private Sub Form1_Load(ByVa l sender As System.Object, _
    ByVal e As System.EventArg s) Handles MyBase.Load
    Me.TextBox1.Mul tiline = True
    Me.TextBox1.Scr ollBars = ScrollBars.Both
    'above only for showing the sample
    Dim Doc As mshtml.IHTMLDoc ument2
    Doc = New mshtml.HTMLDocu mentClass
    Dim wbReq As Net.HttpWebRequ est = _
    DirectCast(Net. WebRequest.Crea te("http://
    start.csail.mit .edu/startfarm.cgi?q uery=USA"), _
    Net.HttpWebRequ est)
    Dim wbResp As Net.HttpWebResp onse = _
    DirectCast(wbRe q.GetResponse() , Net.HttpWebResp onse)
    Dim wbHCol As Net.WebHeaderCo llection = wbResp.Headers
    Dim myStream As IO.Stream = wbResp.GetRespo nseStream()
    Dim myreader As New IO.StreamReader (myStream)
    Doc.write(myrea der.ReadToEnd() )
    Doc.close()
    wbResp.Close()
    >
    'the part below is not completly done for all tags.
    'it can (will) be necessary to tailor that to your needs.
    >
    Dim sb As New System.Text.Str ingBuilder
    For i As Integer = 0 To Doc.all.length - 1
    Dim hElm As mshtml.IHTMLEle ment = _
    DirectCast(Doc. all.item(i), mshtml.IHTMLEle ment)
    Select Case hElm.tagName.To Lower
    Case "body" '"html" ' "head" ' "form"
    Case Else
    If hElm.innerText <"" Then
    sb.Append(hElm. innerText & vbCrLf)
    End If
    End Select
    Next
    TextBox1.Text = sb.ToString
    End Sub
    >
    the trouble is that it gives text out that is duplicated in multiple
    lines of the same info.
    I explored this in a separate thread where I tried to fix it by
    writing it to a text file and looking for duplicates. however, it
    would be far easier to fix teh scraper itself.
    I am unfamiliar with mshtml coding but essentially it is looking for
    Tags "body of text html,head etc. Any suggestions as to why it
    duplicates would be great.
    >
    K.

    Comment

    Working...