Instrumented web proxy

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andrew McLean

    Instrumented web proxy

    I would like to write a web (http) proxy which I can instrument to
    automatically extract information from certain web sites as I browse
    them. Specifically, I would want to process URLs that match a particular
    regexp. For those URLs I would have code that parsed the content and
    logged some of it.

    Think of it as web scraping under manual control.

    I found this list of Python web proxies



    Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
    many lines of code)



    It does what it's supposed to, but I'm a bit at a loss as where to
    intercept the traffic. I suspect it should be quite straightforward , but
    I'm finding the code a bit opaque.

    Any suggestions?

    Andrew
  • Miki

    #2
    Re: Instrumented web proxy

    Hello Andrew,
    Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
    many lines of code)
    >

    >
    It does what it's supposed to, but I'm a bit at a loss as where to
    intercept the traffic. I suspect it should be quite straightforward , but
    I'm finding the code a bit opaque.
    >
    Any suggestions?
    From a quick look at the code, you need to either hook to do_GET where
    you have the URL (see the urlunparse line).
    If you want the actual content of the page, you'll need to hook to
    _read_write (data = i.recv(8192)).

    HTH,
    --
    Miki <miki.tebeka@gm ail.com>
    If it won't be simple, it simply won't be. [Hire me, source code]


    Comment

    • Paul Rubin

      #3
      Re: Instrumented web proxy

      Andrew McLean <andrew-news@andros.org .ukwrites:
      I would like to write a web (http) proxy which I can instrument to
      automatically extract information from certain web sites as I browse
      them. Specifically, I would want to process URLs that match a
      particular regexp. For those URLs I would have code that parsed the
      content and logged some of it.
      >
      Think of it as web scraping under manual control.
      I've used Proxy 3 for this, a very cool program with powerful
      capabilities for on the fly html rewriting.


      Comment

      • Andrew McLean

        #4
        Re: Instrumented web proxy

        Paul Rubin wrote:
        Andrew McLean <andrew-news@andros.org .ukwrites:
        >I would like to write a web (http) proxy which I can instrument to
        >automaticall y extract information from certain web sites as I browse
        >them. Specifically, I would want to process URLs that match a
        >particular regexp. For those URLs I would have code that parsed the
        >content and logged some of it.
        >>
        >Think of it as web scraping under manual control.
        >
        I've used Proxy 3 for this, a very cool program with powerful
        capabilities for on the fly html rewriting.
        >
        http://theory.stanford.edu/~amitp/proxy.html
        This looks very useful. Unfortunately I can't seem to get it to run
        under Windows (specifically Vista) using Python 1.5.2, 2.2.3 or 2.5.2.
        I'll try Linux if I get a chance.

        Comment

        Working...