I'm importing a script that I made and it's literally take 10+mins to to run or import into PythonWin.
I've put the script at the bottom. But i'm also having a problem with it.
What i'm trying to do:
1.) Go the the SEC's website and look for recently filed 10-q's (that's a financial report)
2.) collect all the links for these new 10-q's
3.) add the link to the end of what i call pageroot (which is www.sec.gov)
4.) on the newly formed full web address go one page at a time and look for a piece in the source code that is " <td nowrap="nowrap" ><a href= " which will lead me to the next linked addres i need. (to navigate to the actual 10-q its 2 or 3 links away from the original search)
5.) also write these 2nd linked addresses to a file, so that i can check to make sure that it is working the intended way
6.) clean up the linked addresses with a bunch of regex
now once I get that working i'll add more, but my problem is this...
it seems to be reading to do this: (purely for example)
"google, apple, ebay, and IBM filed 10-q's, now lets collect a history of 10-qs filed for just google"
and again it should be
"google, apple, ebay and IBM filed 10-q's, now lets collect the link for each of them so that I can redirect my scrape to the actual 10-q"
if anyone could help i'd be very appreciative.
here's the code.
I've put the script at the bottom. But i'm also having a problem with it.
What i'm trying to do:
1.) Go the the SEC's website and look for recently filed 10-q's (that's a financial report)
2.) collect all the links for these new 10-q's
3.) add the link to the end of what i call pageroot (which is www.sec.gov)
4.) on the newly formed full web address go one page at a time and look for a piece in the source code that is " <td nowrap="nowrap" ><a href= " which will lead me to the next linked addres i need. (to navigate to the actual 10-q its 2 or 3 links away from the original search)
5.) also write these 2nd linked addresses to a file, so that i can check to make sure that it is working the intended way
6.) clean up the linked addresses with a bunch of regex
now once I get that working i'll add more, but my problem is this...
it seems to be reading to do this: (purely for example)
"google, apple, ebay, and IBM filed 10-q's, now lets collect a history of 10-qs filed for just google"
and again it should be
"google, apple, ebay and IBM filed 10-q's, now lets collect the link for each of them so that I can redirect my scrape to the actual 10-q"
if anyone could help i'd be very appreciative.
here's the code.
Code:
import urllib import re page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent' raw = [] for line in urllib.urlopen(page): if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line: raw.append(line) codestring = ' '.join(raw) pattern = re.compile('/\S+') results = re.findall(pattern, codestring) pageroot = 'http://www.sec.gov' count= len(results) fn = open("c://Python25/tmp.txt", 'w') line10q = [] number = 0 while number < count: newpage = pageroot + results[number] for line in urllib.urlopen(newpage): if '<td nowrap="nowrap"><a href="' in line: line10q.append(line) fn.write(line) number += 1 fn.close() line10qstring = ' '.join(line10q) pattern2 = re.compile('="/\S+">') results10q = re.findall(pattern, line10qstring) newstring = ' '.join(results10q) pattern3 = re.compile('/\S+.htm') linkresults = re.findall(pattern3, newstring) pattern4 = re.compile('/\S+.[a-z]{3}"') linktest2 = ' '.join(linkresults) link2 = re.findall(pattern4, linktest2) link2string = ' '.join(link2) pattern5 = re.compile('/\S+.htm') link4 = re.findall(pattern5, link2string) link4string = ' '.join(link4) linkNumber = len(link4)
Comment