Scraping Just Images in C#

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • artlean
    New Member
    • Apr 2009
    • 2

    Scraping Just Images in C#

    I'm trying to create a page where users can just enter a url into a textbox, click search, and then we list all of the images on that exact page.

    So far I've managed to get it to scrape the entire destination page, and using a regular expression, kind of extract the image. The problem is that the images are sometimes relative paths, which means they won't display.

    Unless I'm missing something very obvious, does anyone know of any solutions for this kind of thing? I'm also hoping to do the same with embedded tags, so I can list up things like flash videos, etc.

    I do realise that a lot of this code needs tidying up, but if anyone has anything I'd be very grateful.

    For example, if I searched for www.google.com, I would want it to list the main google image, but instead of getting the path like this:
    http://www.google.co.u k/intl/en_uk/images/logo.gif

    I get the path like this:
    /intl/en_uk/images/logo.gif

    Code:
        public void doSearch(object sender, EventArgs e)
        {
            results_tbl.Rows.Clear();
            string reqURL = url_searchBox_txt.Text;
    
            if (!reqURL.StartsWith("http://"))
            {
                reqURL = "http://" + reqURL;
            }
       
            WebRequest req = WebRequest.Create(reqURL);
            WebResponse resp = req.GetResponse();
    
            Stream s = resp.GetResponseStream();
            StreamReader sr = new StreamReader(s,Encoding.ASCII);
    
            string st = sr.ReadToEnd();
    
            Regex r = new Regex(@"<img([^>]+)>",RegexOptions.IgnoreCase | RegexOptions.Compiled);
            Match m = r.Match(st);
            while (m.Success)
            {
                    TableRow tr = new TableRow();
                    TableCell tc1 = new TableCell();    //Item
    
                    tc1.Text = "<img " + m.Groups[1].Value + "/>";
    
                    tr.Cells.Add(tc1);
                    tr.Cells.Add(tc2);
    
                    results_tbl.Rows.Add(tr);
                    m = m.NextMatch();
            }
        }
  • Bassem
    Contributor
    • Dec 2008
    • 344

    #2
    You have solved one of three, not one of two!!

    Pay attention to that:
    The src attribute - of the img element - content is a link to a URL so its contents type is one of these:
    1. Fully qualified URL.
    2. Absolute.
    3. Relative.

    You have solved the first type, it remains two more.

    Anyway, consider this method:
    1. You have "url_searchBox_ txt.Text" it contains the URL has a type of three, but all contain the domain name (host name), you can split it.
    2. Extract the img's src property, compare the value if it begins with the domain name... so it is type #1.
    Else if it begins with "/" slash... so it is type #2.
    Else... it is type #3.
    3. For type #1: go on.
    For type #2: insert the domain name into the start of the value. That's it, very simple.
    For type #3: Oh, now you got a problem, you will need to search in the website directories and I have no idea how to solve this.

    Thanks,
    Bassem

    Comment

    • swapan das

      #3
      The problem is so simple.Look,A web page can import image or media file from its local server or remote server.When the page import image from external server the image url looks like:
      <img src="http://www.domain.com/01.jpg></img>
      But when the page import image from local server then the image reference looks like:
      <img src="/images/01.jpg".
      So to fix the problem,just add the http url path at the begining looks: "htt://www.google.com/"+img_resul t

      Comment

      Working...