I am trying to write a web scraper and am having trouble accessing pages that require authentication. I am attempting to utilise the mechanize library, but am having difficulties. The site I am trying to login is http://www.princetonre view.com/Login3.aspx?uid badge=
user: bugmenot2008@ya hoo.com
pass: letmeinalready
Previously I did something similar to another site: schoolfinder.co m. Here is my code for that:
This method does not work on the Princeton Review site however. Interestingly I cannot even get mechanize to access the schoolfinder.co m site. Here is the code I am using:
This code is so short and I just cannot figure out what I am doing wrong. What is incorrect about this? Thank you in advance.
user: bugmenot2008@ya hoo.com
pass: letmeinalready
Previously I did something similar to another site: schoolfinder.co m. Here is my code for that:
Code:
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
resp = opener.open('http://schoolfinder.com') # save a cookie
theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
body={'usr':'greenman','pwd':'greenman'}
txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration
try:
req = urllib2.Request(theurl, txdata, txheaders) # create a request object
handle = opener.open(req) # and open it to return a handle on the url
HTMLSource = handle.read()
f = file('test.html', 'w')
f.write(HTMLSource)
f.close()
except IOError, e:
print 'We failed to open "%s".' % theurl
if hasattr(e, 'code'):
print 'We failed with error code - %s.' % e.code
elif hasattr(e, 'reason'):
print "The error object has the following 'reason' attribute :", e.reason
print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
sys.exit()
else:
print 'Here are the headers of the page :'
print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
Code:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import mechanize
theurl = 'http://www.princetonreview.com/Login3.aspx?uidbadge='
mech = mechanize.Browser()
mech.open(theurl)
mech.select_form(nr=0)
mech["ctl00$MasterMainBodyContent$txtUsername"] = "bugmenot2008@yahoo.com"
mech["ctl00$MasterMainBodyContent$txtPassword"] = "letmeinalready"
results = mech.submit().read()
f = file('test.html', 'w')
f.write(results) # write to a test file
f.close()