slow import

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Patrick C
    New Member
    • Apr 2007
    • 54

    slow import

    I'm importing a script that I made and it's literally take 10+mins to to run or import into PythonWin.

    I've put the script at the bottom. But i'm also having a problem with it.
    What i'm trying to do:
    1.) Go the the SEC's website and look for recently filed 10-q's (that's a financial report)
    2.) collect all the links for these new 10-q's
    3.) add the link to the end of what i call pageroot (which is www.sec.gov)
    4.) on the newly formed full web address go one page at a time and look for a piece in the source code that is " <td nowrap="nowrap" ><a href= " which will lead me to the next linked addres i need. (to navigate to the actual 10-q its 2 or 3 links away from the original search)
    5.) also write these 2nd linked addresses to a file, so that i can check to make sure that it is working the intended way
    6.) clean up the linked addresses with a bunch of regex

    now once I get that working i'll add more, but my problem is this...
    it seems to be reading to do this: (purely for example)
    "google, apple, ebay, and IBM filed 10-q's, now lets collect a history of 10-qs filed for just google"
    and again it should be
    "google, apple, ebay and IBM filed 10-q's, now lets collect the link for each of them so that I can redirect my scrape to the actual 10-q"

    if anyone could help i'd be very appreciative.

    here's the code.
    Code:
    import urllib
    import re
    page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'
    raw = []
    for line in urllib.urlopen(page):
        if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line:
            raw.append(line)
    
    codestring = ' '.join(raw)
    pattern = re.compile('/\S+') 
    results = re.findall(pattern, codestring)
    
    pageroot = 'http://www.sec.gov' 
    count= len(results) 
    
    fn = open("c://Python25/tmp.txt", 'w')
    
    line10q = []
    number = 0
    while number < count:
        newpage = pageroot + results[number]
        for line in urllib.urlopen(newpage):
            if '<td nowrap="nowrap"><a href="' in line:
                line10q.append(line)
    	    fn.write(line)
        number += 1
    
    fn.close()
    
    line10qstring = ' '.join(line10q)
    pattern2 = re.compile('="/\S+">')
    results10q = re.findall(pattern, line10qstring) 
    
    newstring = ' '.join(results10q)
    pattern3 = re.compile('/\S+.htm')
    linkresults = re.findall(pattern3, newstring) 
    
    pattern4 = re.compile('/\S+.[a-z]{3}"')
    linktest2 = ' '.join(linkresults)
    link2 = re.findall(pattern4, linktest2) 
    
    link2string = ' '.join(link2)
    pattern5 = re.compile('/\S+.htm')
    link4 = re.findall(pattern5, link2string) 
    link4string = ' '.join(link4)
    
    linkNumber = len(link4)
  • William Manley
    New Member
    • Mar 2007
    • 56

    #2
    It's because your code is executed everytime it is imported. enclosing it in a function would fix the problem.

    so change

    [CODE=python]
    import urllib
    import re
    page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=& CIK=&type=10-Q&owner=include &count=100&acti on=getcurrent'
    raw = []
    for line in urllib.urlopen( page):
    if '<td bgcolor="#E6E6E 6" valign="top" align="left"><a href="' in line:
    raw.append(line )

    codestring = ' '.join(raw)
    pattern = re.compile('/\S+')
    results = re.findall(patt ern, codestring)

    pageroot = 'http://www.sec.gov'
    count= len(results)

    fn = open("c://Python25/tmp.txt", 'w')

    line10q = []
    number = 0
    while number < count:
    newpage = pageroot + results[number]
    for line in urllib.urlopen( newpage):
    if '<td nowrap="nowrap" ><a href="' in line:
    line10q.append( line)
    fn.write(line)
    number += 1

    fn.close()

    line10qstring = ' '.join(line10q)
    pattern2 = re.compile('="/\S+">')
    results10q = re.findall(patt ern, line10qstring)

    newstring = ' '.join(results1 0q)
    pattern3 = re.compile('/\S+.htm')
    linkresults = re.findall(patt ern3, newstring)

    pattern4 = re.compile('/\S+.[a-z]{3}"')
    linktest2 = ' '.join(linkresu lts)
    link2 = re.findall(patt ern4, linktest2)

    link2string = ' '.join(link2)
    pattern5 = re.compile('/\S+.htm')
    link4 = re.findall(patt ern5, link2string)
    link4string = ' '.join(link4)

    linkNumber = len(link4)

    [/CODE]

    to

    [CODE=python]
    import urlib
    import re

    def myfunc():
    page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=& CIK=&type=10-Q&owner=include &count=100&acti on=getcurrent'
    # rest of code....
    [/CODE]

    That way you just do:
    [CODE=python]
    import myscript
    myscript.myfunc ()
    [/code]
    and your done!
    Last edited by William Manley; Sep 1 '07, 01:16 AM. Reason: adding a few words to make it clearer

    Comment

    Working...