how would you...?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sanoski

    how would you...?

    I'm pretty new to programming. I've just been studying a few weeks off
    and on. I know a little, and I'm learning as I go. Programming is so
    much fun! I really wish I would have gotten into it years ago, but
    here's my question. I have a longterm project in mind, and I wont to
    know if it's feasible and how difficult it will be.

    There's an XML feed for my school that some other class designed. It's
    just a simple idea that lists all classes, room number, and the person
    with the highest GPA. The feed is set up like this. Each one of the
    following lines would also be a link to more information about the
    class, etc.

    Economics, Room 216, James Faker, 3.4
    Social Studies, Room 231, Brain Fictitious, 3.5

    etc, etc

    The student also has a picture reference that depicts his GPA based on
    the number. The picture is basically just a graph. I just want to
    write a program that uses the information on this feed.

    I want it to reach out to this XML feed, record each instance of the
    above format along with the picture reference of the highest GPA
    student, download it locally, and then be able to use that information
    in various was. I figured I'll start by counting each instance. For
    example, the above would be 2 instances.

    Eventually, I want it to be able to cross reference data you've
    already downloaded, and be able to compare GPA's, etc. It would have a
    GUI and everything too, but I am trying to keep it simple right now,
    and just build onto it as I learn.

    So lets just say this. How do you grab information from the web, in
    this case a feed, and then use that in calculations? How would you
    implement such a project? Would you save the information into a text
    file? Or would you use something else? Should I study up on SQLite?
    Maybe I should study classes. I'm just not sure. What would be the
    most effective technique?
  • Sanoski

    #2
    Re: how would you...?

    The reason I ask about text files is the need to save the data
    locally, and have it stored in a way where backups can easily be made.
    Then if your computer crashes and you lose everything, but you have
    the data files it uses backed up, you can just download the program,
    extract the backed up data to a specific directory, and then it works
    exactly the way it did before you lost it. I suppose a SQLite database
    might solve this, but I'm not sure. I'm just getting started, and I
    don't know too much about it yet.

    I'm also still not sure how to download and associate the pictures
    that each entry has for it. The main thing for me now is getting
    started. It needs to get information from the web. In this case, it's
    a simple XML feed. The one thing that seems that would make it easier
    is every post to the feed is very consistent. Each header starts with
    the letter A, which stands for Alpike Tech, follow by the name of the
    class, the room number, the leading student, and his GPA. All that is
    one line of text. But it's also a link to more information. For
    example:

    A Economics, 312, John Carbroil, 4.0
    That's one whole post to the feed. Like I say, it's very simple and
    consistent. Which should make this easier.

    Eventually I want it to follow that link and grab information from
    there too, but I'll worry about that later. Technically, if I figure
    this first part out, that problem should take care of itself.





    On May 17, 1:08 am, Mensanator <mensana...@aol .comwrote:
    On May 16, 11:43�pm, Sanoski <Joshuajr...@gm ail.comwrote:
    >
    >
    >
    I'm pretty new to programming. I've just been studying a few weeks off
    and on. I know a little, and I'm learning as I go. Programming is so
    much fun! I really wish I would have gotten into it years ago, but
    here's my question. I have a longterm project in mind, and I wont to
    know if it's feasible and how difficult it will be.
    >
    There's an XML feed for my school that some other class designed. It's
    just a simple idea that lists all classes, room number, and the person
    with the highest GPA. The feed is set up like this. Each one of the
    following lines would also be a link to more information about the
    class, etc.
    >
    Economics, Room 216, James Faker, 3.4
    Social Studies, Room 231, Brain Fictitious, 3.5
    >
    etc, etc
    >
    The student also has a picture reference that depicts his GPA based on
    the number. The picture is basically just a graph. I just want to
    write a program that uses the information on this feed.
    >
    I want it to reach out to this XML feed, record each instance of the
    above format along with the picture reference of the highest GPA
    student, download it locally, and then be able to use that information
    in various was. I figured I'll start by counting each instance. For
    example, the above would be 2 instances.
    >
    Eventually, I want it to be able to cross reference data you've
    already downloaded, and be able to compare GPA's, etc. It would have a
    GUI and everything too, but I am trying to keep it simple right now,
    and just build onto it as I learn.
    >
    So lets just say this. How do you grab information from the web,
    >
    Depends on the web page.
    >
    in this case a feed,
    >
    Haven't tried that, just a simple CGI.
    >
    and then use that in calculations?
    >
    The key is some type of structure be it database records,
    or a list of lists or whatever. Something that you can iterate
    through, sort, find max element, etc.
    >
    How would you
    implement such a project?
    >
    The example below uses BeautifulSoup. I'm posting it not
    because it matches your problem, but to give you an idea of
    the techniques involved.
    >
    Would you save the information into a text file?
    >
    Possibly, but generally no. Text files aren't very useful
    except as a data exchange media.
    >
    Or would you use something else?
    >
    Your application lends itself to a database approach.
    Note in my example the database part of the code is disabled.
    Not every one has MS-Access on Windows.
    >
    Should I study up on SQLite?
    >
    Yes. The MS-Access code I have can be easily changed to SQLlite.
    >
    Maybe I should study classes.
    >
    I don't know, but I've always gotten along without them.
    >
    I'm just not sure. What would be the most effective technique?
    >
    Don't know that either as I've only done it once, as follows:
    >
    ##  I was looking in my database of movie grosses I regulary copy
    ##  from the Internet Movie Database and noticed I was _only_ 120
    ##  weeks behind in my updates.
    ##
    ##  Ouch.
    ##
    ##  Copying a web page, pasting into a text file, running a perl
    ##  script to convert it into a csv file and manually importing it
    ##  into Access isn't so bad when you only have a couple to do at
    ##  a time. Still, it's a labor intensive process and 120 isn't
    ##  anything to look forwards to.
    ##
    ##  But I abandoned perl years ago when I took up Python, so I
    ##  can use Python to completely automate the process now.
    ##
    ##  Just have to figure out how.
    ##
    ##  There's 3 main tasks: capture the web page, parse the web page
    ##  to extract the data and insert the data into the database.
    ##
    ##  But I only know how to do the last step, using the odnc tools
    ##  from win32,
    >
    ####import dbi
    ####import odbc
    import re
    >
    ##  so I snoop around comp.lang.pytho n to pick up some
    ##  hints and keywords on how to do the other two tasks.
    ##
    ##  Documentation on urllib2 was a bit vague, but got the web page
    ##  after only a ouple mis-steps.
    >
    import urllib2
    >
    ##  Unfortunately, HTMLParser remained beyond my grasp (is it
    ##  my imagination or is the quality of the examples in the
    ##  doumentation inversely proportional to the subject
    ##  difficulty?)
    ##
    ##  Luckily, my bag of hints had a reference to Beautiful Soup,
    ##  whose web site proclaims:
    ##      Beautiful Soup is a Python HTML/XML parser
    ##      designed for quick turnaround projects like
    ##      screen-scraping.
    ##  Looks like just what I need, maybe I can figure it out after all.
    >
    from BeautifulSoup import BeautifulSoup
    >
    target_dates = [['4','6','2008', 'April']]
    >
    ####con = odbc.odbc("IMDB ")  # connect to MS-Access database
    ####cursor = con.cursor()
    >
    for d in target_dates:
      #
      # build url (with CGI parameters) from list of dates needing
    updating
      #
      the_year = d[2]
      the_date = '/'.join([d[0],d[1],d[2]])
      print '%10s scraping IMDB:'  % (the_date),
      the_url = ''.join([r'http://www.imdb.com/BusinessThisDay ?
    day=',d[1],'&month=',d[3]])
      req = urllib2.Request (url=the_url)
      f = urllib2.urlopen (req)
      www = f.read()
      #
      # ok, page captured. now make a BeatifulSoup object from it
      #
      soup = BeautifulSoup(w ww)
      #
      # that was easy, much more so than HTMLParser
      #
      # now, _all_ I have to do is figure out how to parse it
      #
      # ouch again. this is a lot harder than it looks in the
      # documentation. I need to get the data from cells of a
      # table nested inside another table and that's hard to
      # extrapolate from the examples showing how to find all
      # the comments on a web page.
      #
      # but this looks promising. if I grab all the table rows
      # (tr tags), each complete nested table is inside a cell
      # of the outer table (whose table tags are lost, but aren't
      # needed and whose absence makes extracting the nested
      # tables easier (when you do it the stupid way, but hey,
      # it works, so I'm sticking with it))
      #
      tr = soup.tr                          # table rows
      tr.extract()
      #
      # now, I only want the third nested table. how do I get it?
      # can't seem to get past the first one, should I be using
      # NextSibling or something? <scratches head...>
      #
      # but wait...I don't need the first two tables, so I can
      # simply extract and discard them. and since .extract()
      # CUTS the tables, after two extractions the table I want
      # IS the first one.
      #
      the_table = tr.find('table' )          # discard
      the_table.extra ct()
      the_table = tr.find('table' )          # discard
      the_table.extra ct()
      the_table = tr.find('table' )          # weekly gross
      the_table.extra ct()
      #
      # of course, the data doesn't start in the first row,
      # there's formatting, header rows, etc. looks like it starts
      # in tr number [3]
      #
      ##  >>the_table.con tents[3].td
      ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
    (2000)</a</td>
      #
      # and since tags always imply the first one, the above
      # is equivalent to
      #
      ##  >>the_table.con tents[3].contents[0]
      ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
    (2000)</a</td>
      #
      # and since the title is the first of three cells, the
      # reporting year is
      #
      ##  >>the_table.con tents[3].contents[1]
      ##  <td<a href="/Sections/Years/2001">2001</a</td>
      #
      # finally, the 3rd cell must contain the gross
      #
      ##  >>the_table.con tents[3].contents[2]
      ##  <td align="RIGHT"25 9,674,120</td>
      #
      # but the contents of the first two cells are anchor tags.
      # to get the actual title string, I need the contents of the
      # contents. but that's not exactly what I want either,
      # I don't want a list, I need a string. and the string isn't
      # always in the same place in the list
      #
      # summarizing, what I need is
      #
      ##  print the_table.conte nts[3].contents[0].contents[0].contents,
      ##  print the_table.conte nts[3].contents[1].contents[1].contents,
      ##  print the_table.conte nts[3].contents[2].contents
      #
      # and that almost works, just a couple more tweaks and I can
      # shove it into the database
    >
      parsed = []
    >
      for rec in the_table.conte nts[3:]:
        the_rec_type = type(rec)                      # some rec are
    NavSrings, skip
        if str(the_rec_typ e) == "<type 'instance'>":
          #
          # ok, got a real data row
          #
          TITLE_DATE = rec.contents[0].contents[0].contents   # a list
    inside a tuple
          #
          # and that means we still have to index the contents
          # of the contents of the contents of the contents by
          # adding [0][0] to TITLE_DATE
          #
          YEAR =  rec.contents[1].contents[1].contents        # ditto
          #
          # this won't go into the database, just used as a filter to grab
          # the records associated with the posting date and discard
          # the others (which should already be in the database)
          #
          GROSS = rec.contents[2].contents                    # just a
    list
          #
          # one other minor glitch, that film date is part of the title
          # (which is of no use in the database, so it has to be pulled
    out
          # and put in a separate field
          #
    #      temp_title = re.search('(.*? )( \()([0-9]{4}.*)(\))
    (.*)',str(TITLE _DATE[0][0]))
          temp_title = re.search('(.*? )( \()([0-9]{4}.*)(\))
    (.*)',str(TITLE _DATE))
          #
          # which works 99% of the time. unfortunately, the IMDB
          # consitency is somewhat dubious. the date is _supposed_
          # to be at the end of the string, but sometimes it's not.
          # so, usually, there are only 5 groups, but you have to
          # allow for the fact that there may be 6
          #
          try:
            the_title = temp_title.grou p(1) + temp_title..gro up(5)
          except:
            the_title = temp_title.grou p(1)
          the_gross = str(GROSS[0])
          #
          # and for some unexplained reason, dates will occasionally
          # be 2001/I instead of 2001, so we want to discard the trailing
          # crap, if any
          #
          the_film_year = temp_title.grou p(3)[:4]
    #      if str(YEAR[0][0])==the_year:
          if str(YEAR[0])==the_year:
            parsed.append([the_date,the_ti tle,the_film_ye ar,the_gross])
    >
      print '%3d records found ' % (len(parsed))
      #
      # wow, now just have to insert all the update...
    >
    read more »

    Comment

    • Mensanator

      #3
      Re: how would you...?

      On May 17, 4:02�am, Sanoski <Joshuajr...@gm ail.comwrote:
      The reason I ask about text files is the need to save the data
      locally, and have it stored in a way where backups can easily
      be made.
      Sure, you can always do that if you want. But if your target
      is SQLlite or MS-Access, those are files also, so can be
      backed up as easily as text files.
      >
      Then if your computer crashes and you lose everything, but
      you have the data files it uses backed up, you can just
      download the program, extract the backed up data to a
      specific directory, and then it works exactly the way it
      did before you lost it. I suppose a SQLite database might
      solve this, but I'm not sure.
      It will. Remember, once in a database, you have value-added
      features like filtering, sorting, etc. that you would have
      to do yourself if you simply read in text files.
      I'm just getting started, and I
      don't know too much about it yet.
      Trust me, a database is the way to go.
      My preference is MS-Access, because I need it for work.
      It is a great tool for learning databases because it's
      visual inteface can make you productive BEFORE you learn
      SQL.
      >
      I'm also still not sure how to download and associate the pictures
      that each entry has for it.
      See example at end of post.
      The main thing for me now is getting
      started. It needs to get information from the web. In this case,
      it's a simple XML feed.
      BeautifulSoup also has an XML parser. Got to their
      web page and read the documentation.
      The one thing that seems that would
      make it easier is every post to the feed is very consistent.
      Each header starts with the letter A, which stands for Alpike
      Tech, follow by the name of the class, the room number, the
      leading student, and his GPA. All that is one line of text.
      But it's also a link to more information. For example:
      >
      A Economics, 312, John Carbroil, 4.0
      That's one whole post to the feed. Like I say, it's very
      simple and consistent. Which should make this easier.
      That's what you want for parsing, how to seperate
      a composite set of data. Simple can sometimes be
      done with split(), complex with regular expressions.
      >
      Eventually I want it to follow that link and grab information
      from there too, but I'll worry about that later. Technically,
      if I figure this first part out, that problem should take
      care of itself.
      >

      A sample picture scraper:

      from BeautifulSoup import BeautifulSoup
      import urllib2
      import urllib

      #
      # start by scraping the web page
      #
      the_url="http://members.aol.com/mensanator/OHIO/TheCobs.htm"
      req = urllib2.Request (url=the_url)
      f = urllib2.urlopen (req)
      www = f.read()
      soup = BeautifulSoup(w ww)
      print soup.prettify()

      #
      # a simple page with pictures
      #
      ##<html>
      ## <head>
      ## <title>
      ## Ohio - The Cobs!
      ## </title>
      ## </head>
      ## <body>
      ## <h1>
      ## Ohio Vacation Pictures - The Cobs!
      ## </h1>
      ## <hr />
      ## <img src="AUT_2784.J PG" />
      ## <br />
      ## WTF?
      ## <p>
      ## <img src="AUT_2764.J PG" />
      ## <br />
      ## This is surreal.
      ## </p>
      ## <p>
      ## <img src="AUT_2765.J PG" />
      ## <br />
      ## Six foot tall corn cobs made of concrete.
      ## </p>
      ## <p>
      ## <img src="AUT_2766.J PG" />
      ## <br />
      ## 109 of them, laid out like a modern Stonehenge.
      ## </p>
      ## <p>
      ## <img src="AUT_2769.J PG" />
      ## <br />
      ## With it's own Druid worshippers.
      ## </p>
      ## <p>
      ## <img src="AUT_2781.J PG" />
      ## <br />
      ## Cue the
      ## <i>
      ## Also Sprach Zarathustra
      ## </i>
      ## soundtrack.
      ## </p>
      ## <p>
      ## <img src="100_0887.J PG" />
      ## <br />
      ## Air & Space Museums are a dime a dozen.
      ## <br />
      ## But there's only
      ## <b>
      ## one
      ## </b>
      ## Cobs!
      ## </p>
      ## <p>
      ## </p>
      ## </body>
      ##</html>

      #
      # parse the page to find all the pictures (image tags)
      #
      the_pics = soup.findAll('i mg')

      for i in the_pics:
      print i

      ##<img src="AUT_2784.J PG" />
      ##<img src="AUT_2764.J PG" />
      ##<img src="AUT_2765.J PG" />
      ##<img src="AUT_2766.J PG" />
      ##<img src="AUT_2769.J PG" />
      ##<img src="AUT_2781.J PG" />
      ##<img src="100_0887.J PG" />

      #
      # the picutres have no path, so must be in the
      # same directory as the web page
      #
      the_jpg_path="h ttp://members.aol.com/mensanator/OHIO/"

      #
      # now with urllib, copy the picture files to the local
      # hard drive renaming with sequence id at the same time
      #
      for i,j in enumerate(the_p ics):
      p = the_jpg_path + j['src']
      q = 'C:\\scrape\\' + 'pic' + str(i).zfill(4) + '.jpg'
      urllib.urlretri eve(p,q)

      #
      # and here's the captured files
      #
      ## C:\>dir scrape
      ## Volume in drive C has no label.
      ## Volume Serial Number is D019-C60D
      ##
      ## Directory of C:\scrape
      ##
      ## 05/17/2008 07:06 PM <DIR .
      ## 05/17/2008 07:06 PM <DIR ..
      ## 05/17/2008 07:05 PM 69,877 pic0000.jpg
      ## 05/17/2008 07:05 PM 71,776 pic0001.jpg
      ## 05/17/2008 07:05 PM 70,958 pic0002.jpg
      ## 05/17/2008 07:05 PM 69,261 pic0003.jpg
      ## 05/17/2008 07:05 PM 70,653 pic0004.jpg
      ## 05/17/2008 07:05 PM 70,564 pic0005.jpg
      ## 05/17/2008 07:05 PM 113,356 pic0006.jpg
      ## 7 File(s) 536,445 bytes
      ## 2 Dir(s) 27,823,570,944 bytes free

      Comment

      • inhahe

        #4
        Re: how would you...?


        "Sanoski" <Joshuajruss@gm ail.comwrote in message
        news:1449c36e-10ce-42f4-bded-99d53a0a2569@a1 g2000hsb.google groups.com...
        I'm pretty new to programming. I've just been studying a few weeks off
        and on. I know a little, and I'm learning as I go. Programming is so
        much fun! I really wish I would have gotten into it years ago, but
        here's my question. I have a longterm project in mind, and I wont to
        know if it's feasible and how difficult it will be.
        >
        There's an XML feed for my school that some other class designed. It's
        just a simple idea that lists all classes, room number, and the person
        with the highest GPA. The feed is set up like this. Each one of the
        following lines would also be a link to more information about the
        class, etc.
        >
        Economics, Room 216, James Faker, 3.4
        Social Studies, Room 231, Brain Fictitious, 3.5
        >
        etc, etc
        >
        The student also has a picture reference that depicts his GPA based on
        the number. The picture is basically just a graph. I just want to
        write a program that uses the information on this feed.
        >
        I want it to reach out to this XML feed, record each instance of the
        above format along with the picture reference of the highest GPA
        student, download it locally, and then be able to use that information
        in various was. I figured I'll start by counting each instance. For
        example, the above would be 2 instances.
        >
        Eventually, I want it to be able to cross reference data you've
        already downloaded, and be able to compare GPA's, etc. It would have a
        GUI and everything too, but I am trying to keep it simple right now,
        and just build onto it as I learn.
        >
        So lets just say this. How do you grab information from the web, in
        this case a feed, and then use that in calculations? How would you
        implement such a project? Would you save the information into a text
        file? Or would you use something else? Should I study up on SQLite?
        Maybe I should study classes. I'm just not sure. What would be the
        most effective technique?

        People usually say BeautifulSoup for getting stuff from the web. I think I
        tried it once and had some problem and gave up. But since this is XML I
        think all you'd need is an xml.dom.minidom or xml.etree.Eleme ntTree. i'm not
        sure which is easier. see doc\python.chm in the Python directory to study
        up on those. To grab the webpage to begin with you'd use urllib2. That
        takes around one line of code. I wouldn't save it to a text file, because
        they're not that good for random access. Or storing images. I'd save it in
        a database. There are other database modules than SQLite, but SQLite is a
        good one, for simple projects like that where you're just going to be
        running one instance of the program at a time. SQLite is fast and it's the
        only one that doesn't require a separate database engine to be installed and
        running.
        Classes are just a way of organizing code (and a little more, but they don't
        have a lot to do with saving stuff to file).
        I'm not clear on whether the GPA is available as text and an image, or just
        an image. If it's just available as an image you're going to want to use
        PIL (Python Image Library). Btw, use float() to convert a textual GPA to a
        number.
        You'll have to learn some basics of the SQL language (that applies to any
        database). Or maybe not with SQLObject or SQLAlchemy, but I don't know how
        easy those are to learn. Or if you don't want to learn SQL you could use a
        text file with fixed-length fields and perhaps references to individual
        filenames that store the pictures, and I could tell you how to do that. But
        a database is a lot more flexible, and you wouldn't need to learn much SQL
        for the same purposes.
        Btw, I used SQLite version 2 before and it didn't allow me to return query
        results as dictionaries (i.e., indexable by field name), just lists of
        values, except by using some strange code I found somewhere. But a list is
        also easy to use. But if version 3 doesn't do it either and you want the
        code I have it.


        Comment

        • Martin Sand Christensen

          #5
          Re: how would you...?

          >>>>"inhahe" == inhahe <inhahe@gmail.c omwrites:
          inhaheBtw, use float() to convert a textual GPA to a number.

          It would be much better to use Decimal() instead of float(). A GPA of
          3.6000000000000 001 probably doesn't make much sense; this problem
          doesn't arise when using the Decimal type.

          Martin

          Comment

          Working...