I would like to write a web (http) proxy which I can instrument to
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a particular
regexp. For those URLs I would have code that parsed the content and
logged some of it.
Think of it as web scraping under manual control.
I found this list of Python web proxies
Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
many lines of code)
It does what it's supposed to, but I'm a bit at a loss as where to
intercept the traffic. I suspect it should be quite straightforward , but
I'm finding the code a bit opaque.
Any suggestions?
Andrew
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a particular
regexp. For those URLs I would have code that parsed the content and
logged some of it.
Think of it as web scraping under manual control.
I found this list of Python web proxies
Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
many lines of code)
It does what it's supposed to, but I'm a bit at a loss as where to
intercept the traffic. I suspect it should be quite straightforward , but
I'm finding the code a bit opaque.
Any suggestions?
Andrew
Comment