Using mechanize to do website authentication

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • trihaitran
    New Member
    • Feb 2008
    • 7

    Using mechanize to do website authentication

    I am trying to write a web scraper and am having trouble accessing pages that require authentication. I am attempting to utilise the mechanize library, but am having difficulties. The site I am trying to login is http://www.princetonre view.com/Login3.aspx?uid badge=

    user: bugmenot2008@ya hoo.com
    pass: letmeinalready

    Previously I did something similar to another site: schoolfinder.co m. Here is my code for that:

    Code:
    import cookielib
    import urllib
    import urllib2
    
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    resp = opener.open('http://schoolfinder.com') # save a cookie
    
    theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
    body={'usr':'greenman','pwd':'greenman'}
    txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
    txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration
    
    
    try:
        req = urllib2.Request(theurl, txdata, txheaders) # create a request object
        handle = opener.open(req) # and open it to return a handle on the url
        HTMLSource = handle.read()
        f = file('test.html', 'w')
        f.write(HTMLSource)
        f.close()
    
    except IOError, e:
        print 'We failed to open "%s".' % theurl
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :", e.reason
            print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
            sys.exit()
    
    else:
        print 'Here are the headers of the page :'
        print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
    This method does not work on the Princeton Review site however. Interestingly I cannot even get mechanize to access the schoolfinder.co m site. Here is the code I am using:

    Code:
    #!/usr/bin/env python
    # -*- coding: UTF-8 -*-
    import mechanize
    
    theurl = 'http://www.princetonreview.com/Login3.aspx?uidbadge='
    mech = mechanize.Browser()
    mech.open(theurl)
    
    mech.select_form(nr=0)
    mech["ctl00$MasterMainBodyContent$txtUsername"] = "bugmenot2008@yahoo.com"
    mech["ctl00$MasterMainBodyContent$txtPassword"] = "letmeinalready"
    results = mech.submit().read()
    
    f = file('test.html', 'w')
    f.write(results) # write to a test file
    f.close()
    This code is so short and I just cannot figure out what I am doing wrong. What is incorrect about this? Thank you in advance.
Working...