Help extracting info from HTML source ..

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • s. d. rose

    Help extracting info from HTML source ..

    Hello All.
    I am learning Python, and have never worked with HTML. However, I would
    like to write a simple script to audit my 100+ Netware servers via their web
    portal.

    I was reading Chapter 8 of Dive into Python, which deals with this topic.
    In the web portal of the server, there is a section similar to this:

    -- clients and <A
    href="http://eugenia.blogsom e.com/?s=ipkall">clev er</aservices. <--

    which I took from SlashDot, but what I'm talking about is using the word
    'services' to represent the link to eugenia.blogsom e.com.

    What I'd like to do is save the two pieces of info relative to the server
    name. Probably in a dictionary, such as server1[link] to the page on
    eugenia.blogsom e.com and server1[description] to 'services'.

    I've used the example from Dive into Python to get the actual link in the
    source of the HTML, but I don't know how to get the text that is the
    hyperlink.

    So in the portal, I've got a link 'Scheduled Server Reboot' going to say
    /ScheduledTasks/ID000000003/ on Server1, using similar to above clipped HTML
    source code.

    Can someone please help me? Sure, I could manually go to each server, but I
    wouldn't learn anything. I've learned some, but also have real deadlines,
    so I eagerly hope for any assistance & instruction.

    Thank you!
    -Dave
    Shelton, CT



  • Miki

    #2
    Re: Help extracting info from HTML source ..

    Hello Shelton,
    I am learning Python, and have never worked with HTML. However, I would
    like to write a simple script to audit my 100+ Netware servers via their web
    portal.
    Always use the right tool, BeautilfulSoup
    (http://www.crummy.com/software/BeautifulSoup/) is best for web
    scraping (IMO).

    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup

    html = urlopen("http://www.python.org" ).read()
    soup = BeautifulSoup(h tml)
    for link in soup("a"):
    print link["href"], "-->", link.contents

    HTH,
    --
    Miki
    If it won't be simple, it simply won't be. [Hire me, source code]


    Comment

    • Nikita the Spider

      #3
      Re: Help extracting info from HTML source ..

      In article <1169819118.201 093.267320@h3g2 000cwc.googlegr oups.com>,
      "Miki" <miki.tebeka@gm ail.comwrote:
      Hello Shelton,
      >
      I am learning Python, and have never worked with HTML. However, I would
      like to write a simple script to audit my 100+ Netware servers via their web
      portal.
      Always use the right tool, BeautilfulSoup
      (http://www.crummy.com/software/BeautifulSoup/) is best for web
      scraping (IMO).
      >
      from urllib import urlopen
      from BeautifulSoup import BeautifulSoup
      >
      html = urlopen("http://www.python.org" ).read()
      soup = BeautifulSoup(h tml)
      for link in soup("a"):
      print link["href"], "-->", link.contents
      Agreed. HTML scraping is really complicated once you get into it. It
      might be interesting to write such a library just for your own
      satisfaction, but if you want to get something done then use a module
      that already written, like BeautifulSoup. Another module that will do
      the same job but works differently (and more simply, IMO) is HTMLData by
      Connelly Barnes:
      Oregon State University delivers exceptional, accessible education and problem-solving innovation as Oregon's largest and statewide public research university.


      --
      Philip

      Whole-site HTML validation, link checking and more

      Comment

      Working...