Python Google Server

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • fuzzyman@gmail.com

    Python Google Server

    I've hacked together a 'GoogleCacheSer ver'. It is based on
    SimpleHTTPServe r. Run the following script (hopefully google groups
    won't mangle the indentation) and set your browser proxy settings to
    'localhost:8000 '. It will let you browse the internet using google's
    cache. Obviously you'll miss images, javascript, css files, etc.

    See the world as google sees it !

    (This is actually an 'inventive' short term measure to get round a
    restrictive internet policy at work :-) I'll probably put it in the
    Python Cookbook as it's quite fun (so if line lengths or indentation is
    mangled here, try there). Tested on Windows XP, with Python 2.3 and IE.



    # Copyright Michael Foord, 2004 & 2005.
    # Released subject to the BSD License
    # Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt

    # For information about bugfixes, updates and support, please join the
    Pythonutils mailing list.
    # http://voidspace.org.uk/mailman/list...idspace.org.uk
    # Comments, suggestions and bug reports welcome.
    # Scripts maintained at http://www.voidspace.org.uk/python/index.shtml
    # E-mail fuzzyman@voidsp ace.org.uk

    import google
    import BaseHTTPServer
    import shutil
    from StringIO import StringIO
    import urlparse

    __version__ = '0.1.0'


    """
    This is a simple implementation of a server that fetches web pages
    from the google cache.

    It lets you explore the internet from your browser, using the google
    cache.

    Run this script and then set your browser proxy settings to
    localhost:8000

    Needs google.py (and a google license key).
    See http://pygoogle.sourceforge.net/
    and http://www.google.com/apis/
    """

    cached_types = ['txt', 'html', 'htm', 'shtml', 'shtm', 'cgi', 'pl',
    'py']
    google.setLicen se(google.getLi cense())
    googlemarker = '''<i>Google is not affiliated with the authors of this
    page nor responsible for its
    content.</i></font></center></td></tr></table></td></tr></table>\n<hr>\n' ''
    markerlen = len(googlemarke r)

    class googleCacheHand ler(BaseHTTPSer ver.BaseHTTPReq uestHandler):
    server_version = "googleCach e/" + __version__
    cached_types = cached_types
    googlemarker = googlemarker
    markerlen = markerlen

    def do_GET(self):
    f = self.send_head( )
    if f:
    self.copyfile(f , self.wfile)
    f.close()

    def send_head(self) :
    """Common code for GET and HEAD commands.

    This sends the response code and MIME headers.

    Return value is either a file object (which has to be copied
    to the outputfile by the caller unless the command was HEAD,
    and must be closed by the caller under all circumstances), or
    None, in which case the caller has nothing further to do.

    """
    print self.path
    url = urlparse.urlpar se(self.path)[2]
    dotloc = url.find('.') + 1
    if dotloc and url[dotloc:] not in self.cached_typ es:
    return None # not a cached type - don't even try

    thepage = google.doGetCac hedPage(self.pa th)
    headerpos = thepage.find(se lf.googlemarker )
    if headerpos != -1: # remove the google header
    pos = self.markerlen + headerpos
    thepage = thepage[pos:]

    f = StringIO(thepag e)

    self.send_respo nse(200)
    self.send_heade r("Content-type", 'text/html')
    self.send_heade r("Content-Length", str(len(thepage )))
    self.end_header s()
    return f

    def copyfile(self, source, outputfile):
    shutil.copyfile obj(source, outputfile)


    def test(HandlerCla ss = googleCacheHand ler,
    ServerClass = BaseHTTPServer. HTTPServer):
    BaseHTTPServer. test(HandlerCla ss, ServerClass)


    if __name__ == '__main__':
    test()

  • vegetax

    #2
    Re: Python Google Server

    fuzzyman@gmail. com wrote:

    lol ,cool hack!! make a slashdot article about it!!
    [color=blue]
    > I've hacked together a 'GoogleCacheSer ver'. It is based on
    > SimpleHTTPServe r. Run the following script (hopefully google groups
    > won't mangle the indentation) and set your browser proxy settings to
    > 'localhost:8000 '. It will let you browse the internet using google's
    > cache. Obviously you'll miss images, javascript, css files, etc.
    >
    > See the world as google sees it !
    >
    > (This is actually an 'inventive' short term measure to get round a
    > restrictive internet policy at work :-) I'll probably put it in the
    > Python Cookbook as it's quite fun (so if line lengths or indentation is
    > mangled here, try there). Tested on Windows XP, with Python 2.3 and IE.
    >
    >
    >
    > # Copyright Michael Foord, 2004 & 2005.
    > # Released subject to the BSD License
    > # Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt
    >
    > # For information about bugfixes, updates and support, please join the
    > Pythonutils mailing list.
    > # http://voidspace.org.uk/mailman/list...idspace.org.uk
    > # Comments, suggestions and bug reports welcome.
    > # Scripts maintained at http://www.voidspace.org.uk/python/index.shtml
    > # E-mail fuzzyman@voidsp ace.org.uk
    >
    > import google
    > import BaseHTTPServer
    > import shutil
    > from StringIO import StringIO
    > import urlparse
    >
    > __version__ = '0.1.0'
    >
    >
    > """
    > This is a simple implementation of a server that fetches web pages
    > from the google cache.
    >
    > It lets you explore the internet from your browser, using the google
    > cache.
    >
    > Run this script and then set your browser proxy settings to
    > localhost:8000
    >
    > Needs google.py (and a google license key).
    > See http://pygoogle.sourceforge.net/
    > and http://www.google.com/apis/
    > """
    >
    > cached_types = ['txt', 'html', 'htm', 'shtml', 'shtm', 'cgi', 'pl',
    > 'py']
    > google.setLicen se(google.getLi cense())
    > googlemarker = '''<i>Google is not affiliated with the authors of this
    > page nor responsible for its
    >[/color]
    content.</i></font></center></td></tr></table></td></tr></table>\n<hr>\n' ''[color=blue]
    > markerlen = len(googlemarke r)
    >
    > class googleCacheHand ler(BaseHTTPSer ver.BaseHTTPReq uestHandler):
    > server_version = "googleCach e/" + __version__
    > cached_types = cached_types
    > googlemarker = googlemarker
    > markerlen = markerlen
    >
    > def do_GET(self):
    > f = self.send_head( )
    > if f:
    > self.copyfile(f , self.wfile)
    > f.close()
    >
    > def send_head(self) :
    > """Common code for GET and HEAD commands.
    >
    > This sends the response code and MIME headers.
    >
    > Return value is either a file object (which has to be copied
    > to the outputfile by the caller unless the command was HEAD,
    > and must be closed by the caller under all circumstances), or
    > None, in which case the caller has nothing further to do.
    >
    > """
    > print self.path
    > url = urlparse.urlpar se(self.path)[2]
    > dotloc = url.find('.') + 1
    > if dotloc and url[dotloc:] not in self.cached_typ es:
    > return None # not a cached type - don't even try
    >
    > thepage = google.doGetCac hedPage(self.pa th)
    > headerpos = thepage.find(se lf.googlemarker )
    > if headerpos != -1: # remove the google header
    > pos = self.markerlen + headerpos
    > thepage = thepage[pos:]
    >
    > f = StringIO(thepag e)
    >
    > self.send_respo nse(200)
    > self.send_heade r("Content-type", 'text/html')
    > self.send_heade r("Content-Length", str(len(thepage )))
    > self.end_header s()
    > return f
    >
    > def copyfile(self, source, outputfile):
    > shutil.copyfile obj(source, outputfile)
    >
    >
    > def test(HandlerCla ss = googleCacheHand ler,
    > ServerClass = BaseHTTPServer. HTTPServer):
    > BaseHTTPServer. test(HandlerCla ss, ServerClass)
    >
    >
    > if __name__ == '__main__':
    > test()
    >[/color]


    Comment

    • vegetax

      #3
      Re: Python Google Server

      it works on opera and firefox on linux, but you cant search in the cached
      google! it would be more usefull if you could somehow search "only" in the
      cache instead of putting the straight link. maybe you could put a magic url
      to search in the cache, like search:"search terms"

      fuzzyman@gmail. com wrote:
      [color=blue]
      > I've hacked together a 'GoogleCacheSer ver'. It is based on
      > SimpleHTTPServe r. Run the following script (hopefully google groups
      > won't mangle the indentation) and set your browser proxy settings to
      > 'localhost:8000 '. It will let you browse the internet using google's
      > cache. Obviously you'll miss images, javascript, css files, etc.
      >
      > See the world as google sees it !
      >
      > (This is actually an 'inventive' short term measure to get round a
      > restrictive internet policy at work :-) I'll probably put it in the
      > Python Cookbook as it's quite fun (so if line lengths or indentation is
      > mangled here, try there). Tested on Windows XP, with Python 2.3 and IE.
      >
      >
      >
      > # Copyright Michael Foord, 2004 & 2005.
      > # Released subject to the BSD License
      > # Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt
      >
      > # For information about bugfixes, updates and support, please join the
      > Pythonutils mailing list.
      > # http://voidspace.org.uk/mailman/list...idspace.org.uk
      > # Comments, suggestions and bug reports welcome.
      > # Scripts maintained at http://www.voidspace.org.uk/python/index.shtml
      > # E-mail fuzzyman@voidsp ace.org.uk
      >
      > import google
      > import BaseHTTPServer
      > import shutil
      > from StringIO import StringIO
      > import urlparse
      >
      > __version__ = '0.1.0'
      >
      >
      > """
      > This is a simple implementation of a server that fetches web pages
      > from the google cache.
      >
      > It lets you explore the internet from your browser, using the google
      > cache.
      >
      > Run this script and then set your browser proxy settings to
      > localhost:8000
      >
      > Needs google.py (and a google license key).
      > See http://pygoogle.sourceforge.net/
      > and http://www.google.com/apis/
      > """
      >
      > cached_types = ['txt', 'html', 'htm', 'shtml', 'shtm', 'cgi', 'pl',
      > 'py']
      > google.setLicen se(google.getLi cense())
      > googlemarker = '''<i>Google is not affiliated with the authors of this
      > page nor responsible for its
      >[/color]
      content.</i></font></center></td></tr></table></td></tr></table>\n<hr>\n' ''[color=blue]
      > markerlen = len(googlemarke r)
      >
      > class googleCacheHand ler(BaseHTTPSer ver.BaseHTTPReq uestHandler):
      > server_version = "googleCach e/" + __version__
      > cached_types = cached_types
      > googlemarker = googlemarker
      > markerlen = markerlen
      >
      > def do_GET(self):
      > f = self.send_head( )
      > if f:
      > self.copyfile(f , self.wfile)
      > f.close()
      >
      > def send_head(self) :
      > """Common code for GET and HEAD commands.
      >
      > This sends the response code and MIME headers.
      >
      > Return value is either a file object (which has to be copied
      > to the outputfile by the caller unless the command was HEAD,
      > and must be closed by the caller under all circumstances), or
      > None, in which case the caller has nothing further to do.
      >
      > """
      > print self.path
      > url = urlparse.urlpar se(self.path)[2]
      > dotloc = url.find('.') + 1
      > if dotloc and url[dotloc:] not in self.cached_typ es:
      > return None # not a cached type - don't even try
      >
      > thepage = google.doGetCac hedPage(self.pa th)
      > headerpos = thepage.find(se lf.googlemarker )
      > if headerpos != -1: # remove the google header
      > pos = self.markerlen + headerpos
      > thepage = thepage[pos:]
      >
      > f = StringIO(thepag e)
      >
      > self.send_respo nse(200)
      > self.send_heade r("Content-type", 'text/html')
      > self.send_heade r("Content-Length", str(len(thepage )))
      > self.end_header s()
      > return f
      >
      > def copyfile(self, source, outputfile):
      > shutil.copyfile obj(source, outputfile)
      >
      >
      > def test(HandlerCla ss = googleCacheHand ler,
      > ServerClass = BaseHTTPServer. HTTPServer):
      > BaseHTTPServer. test(HandlerCla ss, ServerClass)
      >
      >
      > if __name__ == '__main__':
      > test()
      >[/color]


      Comment

      • Fuzzyman

        #4
        Re: Python Google Server


        vegetax wrote:[color=blue]
        > it works on opera and firefox on linux, but you cant search in the[/color]
        cached[color=blue]
        > google! it would be more usefull if you could somehow search "only"[/color]
        in the[color=blue]
        > cache instead of putting the straight link. maybe you could put a[/color]
        magic url[color=blue]
        > to search in the cache, like search:"search terms"
        >[/color]

        Thanks for the report. I've also tried it with firefox on windows.

        Yeah - google search results aren't cached !! Perhaps anything in a
        google domain ought to pass straight through. That could be done by
        testing the domain and using urllib2 to fetch the page.

        Have just tested the following which works.

        Add the follwoing two lines to the start of the code :

        import urllib2
        txheaders = { 'User-agent' : 'Mozilla/4.0 (compatible; MSIE 6.0;
        Windows NT 5.1; SV1; .NET CLR 1.1.4322)' }

        Then change the start of the send_head method to this :

        def send_head(self) :
        """Only GET implemented for this.
        This sends the response code and MIME headers.
        Return value is a file object, or None.
        """
        print 'Request :', self.path # traceback to sys.stdout
        url_tuple = urlparse.urlpar se(self.path)
        url = url_tuple[2]
        domain = url_tuple[1]
        if domain.find('.g oogle.') != -1: # bypass the cache for
        google domains
        req = urllib2.Request (self.path, None, txheaders)
        return urllib2.urlopen (req)

        [color=blue]
        > fuzzyman@gmail. com wrote:
        >[color=green]
        > > I've hacked together a 'GoogleCacheSer ver'. It is based on
        > > SimpleHTTPServe r. Run the following script (hopefully google groups
        > > won't mangle the indentation) and set your browser proxy settings[/color][/color]
        to[color=blue][color=green]
        > > 'localhost:8000 '. It will let you browse the internet using[/color][/color]
        google's[color=blue][color=green]
        > > cache. Obviously you'll miss images, javascript, css files, etc.
        > >
        > > See the world as google sees it ![/color][/color]
        [snip..]

        Comment

        • Fuzzyman

          #5
          Re: Python Google Server

          Another change - change the line `dotloc = url.find('.') + 1` to
          `dotloc = url.rfind('.') + 1`

          This makes it find the last '.' in the url

          Best Regards,

          Fuzzy


          Comment

          • vegetax

            #6
            Re: Python Google Server

            Fuzzyman wrote:

            [color=blue]
            > Add the follwoing two lines to the start of the code :
            >
            > import urllib2
            > txheaders = { 'User-agent' : 'Mozilla/4.0 (compatible; MSIE 6.0;
            > Windows NT 5.1; SV1; .NET CLR 1.1.4322)' }
            >
            > Then change the start of the send_head method to this :
            >
            > def send_head(self) :
            > """Only GET implemented for this.
            > This sends the response code and MIME headers.
            > Return value is a file object, or None.
            > """
            > print 'Request :', self.path # traceback to sys.stdout
            > url_tuple = urlparse.urlpar se(self.path)
            > url = url_tuple[2]
            > domain = url_tuple[1]
            > if domain.find('.g oogle.') != -1: # bypass the cache for
            > google domains
            > req = urllib2.Request (self.path, None, txheaders)
            > return urllib2.urlopen (req)[/color]


            Doesnt work,the browsers keeps asking me to save the page.

            this one works =)
            [color=blue][color=green]
            >> def send_head(self) :[/color][/color]
            print 'Request :', self.path #| traceback| to| sys.stdout
            url_tuple = urlparse.urlpar se(self.path)
            url = url_tuple[2]
            domain = url_tuple[1]
            if domain.find('.g oogle.') != -1: # bypass the cache for google domains
            req = urllib2.Request (self.path, None, txheaders)
            self.send_respo nse(200)
            self.send_heade r("Content-type", 'text/html')
            self.end_header s()
            return urllib2.urlopen (req)[color=blue][color=green]
            >> dotloc = url.rfind('.') + 1[/color][/color]



            Comment

            • Fuzzyman

              #7
              Re: Python Google Server

              Of course - sorry. Thanks for the fix. Out of interest - why are you
              using this... just for curiosity, or is it helpful ?

              Regards,


              Fuzzy


              Comment

              • Paul Rubin

                #8
                Re: Python Google Server

                fuzzyman@gmail. com writes:[color=blue]
                > (This is actually an 'inventive' short term measure to get round a
                > restrictive internet policy at work :-)[/color]

                If that means what I think, you're better off setting up a
                url-rewriting proxy server on some other machine, that uses SSL on the
                browser side. There's one written in perl at:



                Presumably you're surfing through some oppressive firewall, and the
                SSL going into the proxy prevents the firewall from logging all the
                destination URL's going past it (and the content too, for that matter).

                Comment

                • Fuzzyman

                  #9
                  Re: Python Google Server

                  Note - there are a couple of *minor* chanegs to this. See the online
                  python cookbok, the thread on comp.lang.pytho n or


                  Comment

                  • Fuzzyman

                    #10
                    Re: Python Google Server

                    The difficulty is 'on some other machine'... there's a fantastic python
                    CGI proxy called approx -


                    The trouble is the current policy is 'whitelist only'... so I need the
                    proxy installed on a server that is *on the whitelist*... which will
                    take a little time to arrange.

                    Best Regards,

                    Fuzzy


                    Comment

                    • vegetax

                      #11
                      Re: Python Google Server

                      Fuzzyman wrote:
                      [color=blue]
                      > Of course - sorry. Thanks for the fix. Out of interest - why are you
                      > using this... just for curiosity, or is it helpful ?[/color]

                      because is fun to surf on the google cache, =)


                      Comment

                      • Fuzzyman

                        #12
                        Re: Python Google Server


                        vegetax wrote:[color=blue]
                        > Fuzzyman wrote:
                        >[color=green]
                        > > Of course - sorry. Thanks for the fix. Out of interest - why are[/color][/color]
                        you[color=blue][color=green]
                        > > using this... just for curiosity, or is it helpful ?[/color]
                        >
                        > because is fun to surf on the google cache, =)[/color]

                        Ha - cool ! The bizarre thing is, that for me it's actually useful. I
                        doubt anyone else will be in the same situation though.

                        Best Regards,

                        Fuzzy


                        Comment

                        • Benji York

                          #13
                          Re: Python Google Server

                          Fuzzyman wrote:[color=blue]
                          > The trouble is the current policy is 'whitelist only'... so I need the
                          > proxy installed on a server that is *on the whitelist*... which will
                          > take a little time to arrange.[/color]

                          If you construct a noop translation (English to English for example)
                          Google becomes a (HTML only) proxy. Here's an example:


                          --
                          Benji York

                          Comment

                          • Fuzzyman

                            #14
                            Re: Python Google Server

                            Thanks Benji,

                            It returns the results using an ip address - not the google domain.
                            This means IPCop bans it :-(

                            Thanks for the suggestion though. In actual fact the googleCacheServ er
                            works quite well.

                            Best Regards,

                            Fuzzy


                            Comment

                            Working...