how would you...?

**Sanoski** · Jun 27 '08, 04:25 PM

Re: how would you...?

The reason I ask about text files is the need to save the data
locally, and have it stored in a way where backups can easily be made.
Then if your computer crashes and you lose everything, but you have
the data files it uses backed up, you can just download the program,
extract the backed up data to a specific directory, and then it works
exactly the way it did before you lost it. I suppose a SQLite database
might solve this, but I'm not sure. I'm just getting started, and I
don't know too much about it yet.

I'm also still not sure how to download and associate the pictures
that each entry has for it. The main thing for me now is getting
started. It needs to get information from the web. In this case, it's
a simple XML feed. The one thing that seems that would make it easier
is every post to the feed is very consistent. Each header starts with
the letter A, which stands for Alpike Tech, follow by the name of the
class, the room number, the leading student, and his GPA. All that is
one line of text. But it's also a link to more information. For
example:

A Economics, 312, John Carbroil, 4.0
That's one whole post to the feed. Like I say, it's very simple and
consistent. Which should make this easier.

Eventually I want it to follow that link and grab information from
there too, but I'll worry about that later. Technically, if I figure
this first part out, that problem should take care of itself.

On May 17, 1:08Â am, Mensanator <mensana...@aol .comwrote:

On May 16, 11:43ï¿½pm, Sanoski <Joshuajr...@gm ail.comwrote:
>
>
>

I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.

>

There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.

>

Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5

>

etc, etc

>

The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.

>

I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.

>

Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.

>

So lets just say this. How do you grab information from the web,

>
Depends on the web page.
>

in this case a feed,

>
Haven't tried that, just a simple CGI.
>

and then use that in calculations?

>
The key is some type of structure be it database records,
or a list of lists or whatever. Something that you can iterate
through, sort, find max element, etc.
>

How would you
implement such a project?

>
The example below uses BeautifulSoup. I'm posting it not
because it matches your problem, but to give you an idea of
the techniques involved.
>

Would you save the information into a text file?

>
Possibly, but generally no. Text files aren't very useful
except as a data exchange media.
>

Or would you use something else?

>
Your application lends itself to a database approach.
Note in my example the database part of the code is disabled.
Not every one has MS-Access on Windows.
>

Should I study up on SQLite?

>
Yes. The MS-Access code I have can be easily changed to SQLlite.
>

Maybe I should study classes.

>
I don't know, but I've always gotten along without them.
>

I'm just not sure. What would be the most effective technique?

>
Don't know that either as I've only done it once, as follows:
>
## Â I was looking in my database of movie grosses I regulary copy
## Â from the Internet Movie Database and noticed I was _only_ 120
## Â weeks behind in my updates.
##
## Â Ouch.
##
## Â Copying a web page, pasting into a text file, running a perl
## Â script to convert it into a csv file and manually importing it
## Â into Access isn't so bad when you only have a couple to do at
## Â a time. Still, it's a labor intensive process and 120 isn't
## Â anything to look forwards to.
##
## Â But I abandoned perl years ago when I took up Python, so I
## Â can use Python to completely automate the process now.
##
## Â Just have to figure out how.
##
## Â There's 3 main tasks: capture the web page, parse the web page
## Â to extract the data and insert the data into the database.
##
## Â But I only know how to do the last step, using the odnc tools
## Â from win32,
>
####import dbi
####import odbc
import re
>
## Â so I snoop around comp.lang.pytho n to pick up some
## Â hints and keywords on how to do the other two tasks.
##
## Â Documentation on urllib2 was a bit vague, but got the web page
## Â after only a ouple mis-steps.
>
import urllib2
>
## Â Unfortunately, HTMLParser remained beyond my grasp (is it
## Â my imagination or is the quality of the examples in the
## Â doumentation inversely proportional to the subject
## Â difficulty?)
##
## Â Luckily, my bag of hints had a reference to Beautiful Soup,
## Â whose web site proclaims:
## Â Â Â Beautiful Soup is a Python HTML/XML parser
## Â Â Â designed for quick turnaround projects like
## Â Â Â screen-scraping.
## Â Looks like just what I need, maybe I can figure it out after all.
>
from BeautifulSoup import BeautifulSoup
>
target_dates = [['4','6','2008', 'April']]
>
####con = odbc.odbc("IMDB ") Â # connect to MS-Access database
####cursor = con.cursor()
>
for d in target_dates:
Â #
Â # build url (with CGI parameters) from list of dates needing
updating
Â #
Â the_year = d[2]
Â the_date = '/'.join([d[0],d[1],d[2]])
Â print '%10s scraping IMDB:' Â % (the_date),
Â the_url = ''.join([r'http://www.imdb.com/BusinessThisDay ?
day=',d[1],'&month=',d[3]])
Â req = urllib2.Request (url=the_url)
Â f = urllib2.urlopen (req)
Â www = f.read()
Â #
Â # ok, page captured. now make a BeatifulSoup object from it
Â #
Â soup = BeautifulSoup(w ww)
Â #
Â # that was easy, much more so than HTMLParser
Â #
Â # now, _all_ I have to do is figure out how to parse it
Â #
Â # ouch again. this is a lot harder than it looks in the
Â # documentation. I need to get the data from cells of a
Â # table nested inside another table and that's hard to
Â # extrapolate from the examples showing how to find all
Â # the comments on a web page.
Â #
Â # but this looks promising. if I grab all the table rows
Â # (tr tags), each complete nested table is inside a cell
Â # of the outer table (whose table tags are lost, but aren't
Â # needed and whose absence makes extracting the nested
Â # tables easier (when you do it the stupid way, but hey,
Â # it works, so I'm sticking with it))
Â #
Â tr = soup.tr Â Â Â Â Â Â Â Â Â Â Â Â Â # table rows
Â tr.extract()
Â #
Â # now, I only want the third nested table. how do I get it?
Â # can't seem to get past the first one, should I be using
Â # NextSibling or something? <scratches head...>
Â #
Â # but wait...I don't need the first two tables, so I can
Â # simply extract and discard them. and since .extract()
Â # CUTS the tables, after two extractions the table I want
Â # IS the first one.
Â #
Â the_table = tr.find('table' ) Â Â Â Â Â # discard
Â the_table.extra ct()
Â the_table = tr.find('table' ) Â Â Â Â Â # discard
Â the_table.extra ct()
Â the_table = tr.find('table' ) Â Â Â Â Â # weekly gross
Â the_table.extra ct()
Â #
Â # of course, the data doesn't start in the first row,
Â # there's formatting, header rows, etc. looks like it starts
Â # in tr number [3]
Â #
Â ## Â >>the_table.con tents[3].td
Â ## Â <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a</td>
Â #
Â # and since tags always imply the first one, the above
Â # is equivalent to
Â #
Â ## Â >>the_table.con tents[3].contents[0]
Â ## Â <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a</td>
Â #
Â # and since the title is the first of three cells, the
Â # reporting year is
Â #
Â ## Â >>the_table.con tents[3].contents[1]
Â ## Â <td<a href="/Sections/Years/2001">2001</a</td>
Â #
Â # finally, the 3rd cell must contain the gross
Â #
Â ## Â >>the_table.con tents[3].contents[2]
Â ## Â <td align="RIGHT"25 9,674,120</td>
Â #
Â # but the contents of the first two cells are anchor tags.
Â # to get the actual title string, I need the contents of the
Â # contents. but that's not exactly what I want either,
Â # I don't want a list, I need a string. and the string isn't
Â # always in the same place in the list
Â #
Â # summarizing, what I need is
Â #
Â ## Â print the_table.conte nts[3].contents[0].contents[0].contents,
Â ## Â print the_table.conte nts[3].contents[1].contents[1].contents,
Â ## Â print the_table.conte nts[3].contents[2].contents
Â #
Â # and that almost works, just a couple more tweaks and I can
Â # shove it into the database
>
Â parsed = []
>
Â for rec in the_table.conte nts[3:]:
Â Â the_rec_type = type(rec) Â Â Â Â Â Â Â Â Â Â Â # some rec are
NavSrings, skip
Â Â if str(the_rec_typ e) == "<type 'instance'>":
Â Â Â #
Â Â Â # ok, got a real data row
Â Â Â #
Â Â Â TITLE_DATE = rec.contents[0].contents[0].contents Â # a list
inside a tuple
Â Â Â #
Â Â Â # and that means we still have to index the contents
Â Â Â # of the contents of the contents of the contents by
Â Â Â # adding [0][0] to TITLE_DATE
Â Â Â #
Â Â Â YEAR = Â rec.contents[1].contents[1].contents Â Â Â Â # ditto
Â Â Â #
Â Â Â # this won't go into the database, just used as a filter to grab
Â Â Â # the records associated with the posting date and discard
Â Â Â # the others (which should already be in the database)
Â Â Â #
Â Â Â GROSS = rec.contents[2].contents Â Â Â Â Â Â Â Â Â Â # just a
list
Â Â Â #
Â Â Â # one other minor glitch, that film date is part of the title
Â Â Â # (which is of no use in the database, so it has to be pulled
out
Â Â Â # and put in a separate field
Â Â Â #
# Â Â Â temp_title = re.search('(.*? )( \()([0-9]{4}.*)(\))
(.*)',str(TITLE _DATE[0][0]))
Â Â Â temp_title = re.search('(.*? )( \()([0-9]{4}.*)(\))
(.*)',str(TITLE _DATE))
Â Â Â #
Â Â Â # which works 99% of the time. unfortunately, the IMDB
Â Â Â # consitency is somewhat dubious. the date is _supposed_
Â Â Â # to be at the end of the string, but sometimes it's not.
Â Â Â # so, usually, there are only 5 groups, but you have to
Â Â Â # allow for the fact that there may be 6
Â Â Â #
Â Â Â try:
Â Â Â Â the_title = temp_title.grou p(1) + temp_title..gro up(5)
Â Â Â except:
Â Â Â Â the_title = temp_title.grou p(1)
Â Â Â the_gross = str(GROSS[0])
Â Â Â #
Â Â Â # and for some unexplained reason, dates will occasionally
Â Â Â # be 2001/I instead of 2001, so we want to discard the trailing
Â Â Â # crap, if any
Â Â Â #
Â Â Â the_film_year = temp_title.grou p(3)[:4]
# Â Â Â if str(YEAR[0][0])==the_year:
Â Â Â if str(YEAR[0])==the_year:
Â Â Â Â parsed.append([the_date,the_ti tle,the_film_ye ar,the_gross])
>
Â print '%3d records found ' % (len(parsed))
Â #
Â # wow, now just have to insert all the update...
>
read more Â»

**Mensanator** · Jun 27 '08, 04:25 PM

Re: how would you...?

On May 17, 4:02ï¿½am, Sanoski <Joshuajr...@gm ail.comwrote:

The reason I ask about text files is the need to save the data
locally, and have it stored in a way where backups can easily
be made.

Sure, you can always do that if you want. But if your target
is SQLlite or MS-Access, those are files also, so can be
backed up as easily as text files.

>
Then if your computer crashes and you lose everything, but
you have the data files it uses backed up, you can just
download the program, extract the backed up data to a
specific directory, and then it works exactly the way it
did before you lost it. I suppose a SQLite database might
solve this, but I'm not sure.

It will. Remember, once in a database, you have value-added
features like filtering, sorting, etc. that you would have
to do yourself if you simply read in text files.

I'm just getting started, and I
don't know too much about it yet.

Trust me, a database is the way to go.
My preference is MS-Access, because I need it for work.
It is a great tool for learning databases because it's
visual inteface can make you productive BEFORE you learn
SQL.

>
I'm also still not sure how to download and associate the pictures
that each entry has for it.

See example at end of post.

The main thing for me now is getting
started. It needs to get information from the web. In this case,
it's a simple XML feed.

BeautifulSoup also has an XML parser. Got to their
web page and read the documentation.

The one thing that seems that would
make it easier is every post to the feed is very consistent.
Each header starts with the letter A, which stands for Alpike
Tech, follow by the name of the class, the room number, the
leading student, and his GPA. All that is one line of text.
But it's also a link to more information. For example:
>
A Economics, 312, John Carbroil, 4.0
That's one whole post to the feed. Like I say, it's very
simple and consistent. Which should make this easier.

That's what you want for parsing, how to seperate
a composite set of data. Simple can sometimes be
done with split(), complex with regular expressions.

>
Eventually I want it to follow that link and grab information
from there too, but I'll worry about that later. Technically,
if I figure this first part out, that problem should take
care of itself.
>

A sample picture scraper:

from BeautifulSoup import BeautifulSoup
import urllib2
import urllib

#
# start by scraping the web page
#
the_url="http://members.aol.com/mensanator/OHIO/TheCobs.htm"
req = urllib2.Request (url=the_url)
f = urllib2.urlopen (req)
www = f.read()
soup = BeautifulSoup(w ww)
print soup.prettify()

#
# a simple page with pictures
#
##<html>
## <head>
## <title>
## Ohio - The Cobs!
## </title>
## </head>
## <body>
## <h1>
## Ohio Vacation Pictures - The Cobs!
## </h1>
## <hr />
## <img src="AUT_2784.J PG" />
## 
## WTF?
## 
## <img src="AUT_2764.J PG" />
## 
## This is surreal.
## 
## 
## <img src="AUT_2765.J PG" />
## 
## Six foot tall corn cobs made of concrete.
## 
## 
## <img src="AUT_2766.J PG" />
## 
## 109 of them, laid out like a modern Stonehenge.
## 
## 
## <img src="AUT_2769.J PG" />
## 
## With it's own Druid worshippers.
## 
## 
## <img src="AUT_2781.J PG" />
## 
## Cue the
## 
## Also Sprach Zarathustra
## 
## soundtrack.
## 
## 
## <img src="100_0887.J PG" />
## 
## Air & Space Museums are a dime a dozen.
## 
## But there's only
## 
## one
## 
## Cobs!
## 
## 
## 
## </body>
##</html>

#
# parse the page to find all the pictures (image tags)
#
the_pics = soup.findAll('i mg')

for i in the_pics:
print i

##<img src="AUT_2784.J PG" />
##<img src="AUT_2764.J PG" />
##<img src="AUT_2765.J PG" />
##<img src="AUT_2766.J PG" />
##<img src="AUT_2769.J PG" />
##<img src="AUT_2781.J PG" />
##<img src="100_0887.J PG" />

#
# the picutres have no path, so must be in the
# same directory as the web page
#
the_jpg_path="h ttp://members.aol.com/mensanator/OHIO/"

#
# now with urllib, copy the picture files to the local
# hard drive renaming with sequence id at the same time
#
for i,j in enumerate(the_p ics):
p = the_jpg_path + j['src']
q = 'C:\\scrape\\' + 'pic' + str(i).zfill(4) + '.jpg'
urllib.urlretri eve(p,q)

#
# and here's the captured files
#
## C:\>dir scrape
## Volume in drive C has no label.
## Volume Serial Number is D019-C60D
##
## Directory of C:\scrape
##
## 05/17/2008 07:06 PM <DIR .
## 05/17/2008 07:06 PM <DIR ..
## 05/17/2008 07:05 PM 69,877 pic0000.jpg
## 05/17/2008 07:05 PM 71,776 pic0001.jpg
## 05/17/2008 07:05 PM 70,958 pic0002.jpg
## 05/17/2008 07:05 PM 69,261 pic0003.jpg
## 05/17/2008 07:05 PM 70,653 pic0004.jpg
## 05/17/2008 07:05 PM 70,564 pic0005.jpg
## 05/17/2008 07:05 PM 113,356 pic0006.jpg
## 7 File(s) 536,445 bytes
## 2 Dir(s) 27,823,570,944 bytes free

**inhahe** · Jun 27 '08, 04:25 PM

Re: how would you...?

"Sanoski" <Joshuajruss@gm ail.comwrote in message
news:1449c36e-10ce-42f4-bded-99d53a0a2569@a1 g2000hsb.google groups.com...

I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.
>
There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.
>
Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5
>
etc, etc
>
The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.
>
I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.
>
Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.
>
So lets just say this. How do you grab information from the web, in
this case a feed, and then use that in calculations? How would you
implement such a project? Would you save the information into a text
file? Or would you use something else? Should I study up on SQLite?
Maybe I should study classes. I'm just not sure. What would be the
most effective technique?

People usually say BeautifulSoup for getting stuff from the web. I think I
tried it once and had some problem and gave up. But since this is XML I
think all you'd need is an xml.dom.minidom or xml.etree.Eleme ntTree. i'm not
sure which is easier. see doc\python.chm in the Python directory to study
up on those. To grab the webpage to begin with you'd use urllib2. That
takes around one line of code. I wouldn't save it to a text file, because
they're not that good for random access. Or storing images. I'd save it in
a database. There are other database modules than SQLite, but SQLite is a
good one, for simple projects like that where you're just going to be
running one instance of the program at a time. SQLite is fast and it's the
only one that doesn't require a separate database engine to be installed and
running.
Classes are just a way of organizing code (and a little more, but they don't
have a lot to do with saving stuff to file).
I'm not clear on whether the GPA is available as text and an image, or just
an image. If it's just available as an image you're going to want to use
PIL (Python Image Library). Btw, use float() to convert a textual GPA to a
number.
You'll have to learn some basics of the SQL language (that applies to any
database). Or maybe not with SQLObject or SQLAlchemy, but I don't know how
easy those are to learn. Or if you don't want to learn SQL you could use a
text file with fixed-length fields and perhaps references to individual
filenames that store the pictures, and I could tell you how to do that. But
a database is a lot more flexible, and you wouldn't need to learn much SQL
for the same purposes.
Btw, I used SQLite version 2 before and it didn't allow me to return query
results as dictionaries (i.e., indexable by field name), just lists of
values, except by using some strange code I found somewhere. But a list is
also easy to use. But if version 3 doesn't do it either and you want the
code I have it.

**Martin Sand Christensen** · Jun 27 '08, 04:25 PM

Re: how would you...?

>>>>"inhahe" == inhahe <inhahe@gmail.c omwrites:
inhaheBtw, use float() to convert a textual GPA to a number.

It would be much better to use Decimal() instead of float(). A GPA of
3.6000000000000 001 probably doesn't make much sense; this problem
doesn't arise when using the Decimal type.

Martin

how would you...?

how would you...?

Comment

Comment

Comment

Comment