Extracting text from a Webpage using BeautifulSoup

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Magnus.Moraberg@gmail.com

    Extracting text from a Webpage using BeautifulSoup

    Hi,

    I wish to extract all the words on a set of webpages and store them in
    a large dictionary. I then wish to procuce a list with the most common
    words for the language under consideration. So, my code below reads
    the page -

    BBC, News, BBC News, news online, world, wales, welsh, uk, international, foreign, british, online, service


    a welsh language page. I hope to then establish the 1000 most commonly
    used words in Welsh. The problem I'm having is that
    soup.findAll(te xt=True) is returning the likes of -

    u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
    www.w3.org/TR/REC-html40/loose.dtd"'

    and -

    <a href=" \'+url+\'?rss=\ '+rssURI+\'" class="sel"

    Any suggestions how I might overcome this problem?

    Thanks,

    Barry.


    Here's my code -

    import urllib
    import urllib2
    from BeautifulSoup import BeautifulSoup

    # proxy_support = urllib2.ProxyHa ndler({"http":" http://
    999.999.999.999 :8080"})
    # opener = urllib2.build_o pener(proxy_sup port)
    # urllib2.install _opener(opener)

    page = urllib2.urlopen ('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
    newsid_7420900/7420967.stm')
    soup = BeautifulSoup(p age)

    pageText = soup.findAll(te xt=True)
    print pageText

  • Marc 'BlackJack' Rintsch

    #2
    Re: Extracting text from a Webpage using BeautifulSoup

    On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
    I wish to extract all the words on a set of webpages and store them in
    a large dictionary. I then wish to procuce a list with the most common
    words for the language under consideration. So, my code below reads
    the page -
    >
    BBC, News, BBC News, news online, world, wales, welsh, uk, international, foreign, british, online, service

    >
    a welsh language page. I hope to then establish the 1000 most commonly
    used words in Welsh. The problem I'm having is that
    soup.findAll(te xt=True) is returning the likes of -
    >
    u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
    www.w3.org/TR/REC-html40/loose.dtd"'
    Just extract the text from the body of the document.

    body_texts = soup.body(text= True)
    and -
    >
    <a href=" \'+url+\'?rss=\ '+rssURI+\'" class="sel"
    >
    Any suggestions how I might overcome this problem?
    Ask the BBC to produce HTML that's less buggy. ;-)

    http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
    or closing tags without opening ones and so on.

    Ciao,
    Marc 'BlackJack' Rintsch

    Comment

    • Magnus.Moraberg@gmail.com

      #3
      Re: Extracting text from a Webpage using BeautifulSoup

      On 27 Maj, 12:54, Marc 'BlackJack' Rintsch <bj_...@gmx.net wrote:
      On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
      I wish to extract all the words on a set of webpages and store them in
      a large dictionary. I then wish to procuce a list with the most common
      words for the language under consideration. So, my code below reads
      the page -
      >>
      a welsh language page. I hope to then establish the 1000 most commonly
      used words in Welsh. The problem I'm having is that
      soup.findAll(te xt=True) is returning the likes of -
      >
      u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
      www.w3.org/TR/REC-html40/loose.dtd"'
      >
      Just extract the text from the body of the document.
      >
      body_texts = soup.body(text= True)
      >
      and -
      >
      <a href=" \'+url+\'?rss=\ '+rssURI+\'" class="sel"
      >
      Any suggestions how I might overcome this problem?
      >
      Ask the BBC to produce HTML that's less buggy. ;-)
      >
      http://validator.w3.org/reports bugs like "'body' tag not allowed here"
      or closing tags without opening ones and so on.
      >
      Ciao,
      Marc 'BlackJack' Rintsch
      Great, thanks!

      Comment

      • Paul McGuire

        #4
        Re: Extracting text from a Webpage using BeautifulSoup

        On May 27, 5:01 am, Magnus.Morab... @gmail.com wrote:
        Hi,
        >
        I wish to extract all the words on a set of webpages and store them in
        a large dictionary. I then wish to procuce a list with the most common
        words for the language under consideration. So, my code below reads
        the page -
        >
        BBC, News, BBC News, news online, world, wales, welsh, uk, international, foreign, british, online, service

        >
        a welsh language page. I hope to then establish the 1000 most commonly
        used words in Welsh. The problem I'm having is that
        soup.findAll(te xt=True) is returning the likes of -
        >
        u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"'
        >
        and -
        >
        <a href=" \'+url+\'?rss=\ '+rssURI+\'" class="sel"
        >
        Any suggestions how I might overcome this problem?
        >
        Thanks,
        >
        Barry.
        >
        Here's my code -
        >
        import urllib
        import urllib2
        from BeautifulSoup import BeautifulSoup
        >
        # proxy_support = urllib2.ProxyHa ndler({"http":" http://
        999.999.999.999 :8080"})
        # opener = urllib2.build_o pener(proxy_sup port)
        # urllib2.install _opener(opener)
        >
        page = urllib2.urlopen ('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
        newsid_7420900/7420967.stm')
        soup = BeautifulSoup(p age)
        >
        pageText = soup.findAll(te xt=True)
        print pageText
        As an alternative datapoint, you can try out the htmlStripper example
        on the pyparsing wiki: http://pyparsing.wikispaces.com/spac...tmlStripper.py

        -- Paul

        Comment

        Working...