Stripping HTML from RSS feed

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jason

    Stripping HTML from RSS feed

    First things first, let me say that I couldn't decide whether to post
    this to the PHP ng, or to an XML ng. I know from experience that you
    guys know what you're talking about, though, and all of the questions
    mean "how to do this in PHP," so I hope I picked the right one ;-)

    For about a year, I've been importing Yahoo News headlines into my site
    via their RSS feed. But I would much rather import Google News
    headlines because I can make them specific to my location. The problem
    is that their RSS feed includes HTML, and the script I use can't figure
    it out.

    Here's an example tag from their most recent feed:

    <description><b r><table border=0 width= valign=top cellpadding=2
    cellspacing=7>< tr><td valign=top><a
    href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http ://www.raleighchro nicle.com/2006072105.html &cid=1108150860 &ei=hUzBRO_JMpe 2aInH4PsB">Truc k
    Driver Wins $100,000 In Lottery; <b>NC</bPowerball
    Winners</a><br><font size=-1><font color=#6f6f6f>R aleigh
    Chronicle,&nbsp ;NC&nbsp;-</font<nobr>2 hours
    ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
    <b>NC</bLottery Commission, when Dennis Mebane collected his prize
    this week from a winning $100,000 instant scratch-off ticket, he
    <b>...</b</font><br></table></description>


    The XML tag is <description> , but the script I use to parse it
    (rss2array, a cookie-cutter script that I downloaded) can't
    differentiate between <descriptiona nd <table>. Usually, I would call
    <descriptionb y using the variable $rss['items'][$i]['description']
    (where $i is the index counter), but with this output I would have to
    do something like
    $rss['items'][$i]['description']['table'><'tr'>< 'td'><'b><'font '>...

    So I guess I really have 2 questions:

    1. Is there a way to strip HTML completely out of the tag? Or better
    yet, to make rss2array read the HTML as actual HTML instead of XML
    tags?

    2. If not, is there a better way to read the XML than through the use
    of rss2array? I'm a fairly experienced coder, but don't really use XML
    often enough to have a good grasp of the logic.

    TIA,

    Jason

  • Noodle

    #2
    Re: Stripping HTML from RSS feed

    See if this works:

    1. Use $rss['items'][$i]['description'] to return the contents of the
    description node (including html markup) as a string.
    2. Use the 'strip_tags' function to remove unwanted HTML from the
    returned string.

    See http://www.php.net/manual/en/function.strip-tags.php for more info.



    Jason wrote:
    First things first, let me say that I couldn't decide whether to post
    this to the PHP ng, or to an XML ng. I know from experience that you
    guys know what you're talking about, though, and all of the questions
    mean "how to do this in PHP," so I hope I picked the right one ;-)
    >
    For about a year, I've been importing Yahoo News headlines into my site
    via their RSS feed. But I would much rather import Google News
    headlines because I can make them specific to my location. The problem
    is that their RSS feed includes HTML, and the script I use can't figure
    it out.
    >
    Here's an example tag from their most recent feed:
    >
    <description><b r><table border=0 width= valign=top cellpadding=2
    cellspacing=7>< tr><td valign=top><a
    href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http ://www.raleighchro nicle.com/2006072105.html &cid=1108150860 &ei=hUzBRO_JMpe 2aInH4PsB">Truc k
    Driver Wins $100,000 In Lottery; <b>NC</bPowerball
    Winners</a><br><font size=-1><font color=#6f6f6f>R aleigh
    Chronicle,&nbsp ;NC&nbsp;-</font<nobr>2 hours
    ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
    <b>NC</bLottery Commission, when Dennis Mebane collected his prize
    this week from a winning $100,000 instant scratch-off ticket, he
    <b>...</b</font><br></table></description>
    >
    >
    The XML tag is <description> , but the script I use to parse it
    (rss2array, a cookie-cutter script that I downloaded) can't
    differentiate between <descriptiona nd <table>. Usually, I would call
    <descriptionb y using the variable $rss['items'][$i]['description']
    (where $i is the index counter), but with this output I would have to
    do something like
    $rss['items'][$i]['description']['table'><'tr'>< 'td'><'b><'font '>...
    >
    So I guess I really have 2 questions:
    >
    1. Is there a way to strip HTML completely out of the tag? Or better
    yet, to make rss2array read the HTML as actual HTML instead of XML
    tags?
    >
    2. If not, is there a better way to read the XML than through the use
    of rss2array? I'm a fairly experienced coder, but don't really use XML
    often enough to have a good grasp of the logic.
    >
    TIA,
    >
    Jason

    Comment

    • Jason

      #3
      Re: Stripping HTML from RSS feed

      I'm afraid that didn't work. It reads the variable as empty.

      As a test, I tried print_r($rss), and while it reads everything else
      correctly, it shows ['description'] as an empty variable. The only ones
      that it reads correctly are the ones that don't have HTML.

      Knowing this, it's got to be a "problem" with rss2array. I put
      "problem" in quotes, because technically I think it's the XML file
      that's flawed, but there's not much I can do about that. What other way
      can I parse an XML database using PHP?

      - Jason


      See if this works:
      >
      1. Use $rss['items'][$i]['description'] to return the contents of the
      description node (including html markup) as a string.
      2. Use the 'strip_tags' function to remove unwanted HTML from the
      returned string.
      >
      See http://www.php.net/manual/en/function.strip-tags.php for more info.
      >
      >
      >
      Jason wrote:
      First things first, let me say that I couldn't decide whether to post
      this to the PHP ng, or to an XML ng. I know from experience that you
      guys know what you're talking about, though, and all of the questions
      mean "how to do this in PHP," so I hope I picked the right one ;-)

      For about a year, I've been importing Yahoo News headlines into my site
      via their RSS feed. But I would much rather import Google News
      headlines because I can make them specific to my location. The problem
      is that their RSS feed includes HTML, and the script I use can't figure
      it out.

      Here's an example tag from their most recent feed:

      <description><b r><table border=0 width= valign=top cellpadding=2
      cellspacing=7>< tr><td valign=top><a
      href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http ://www.raleighchro nicle.com/2006072105.html &cid=1108150860 &ei=hUzBRO_JMpe 2aInH4PsB">Truc k
      Driver Wins $100,000 In Lottery; <b>NC</bPowerball
      Winners</a><br><font size=-1><font color=#6f6f6f>R aleigh
      Chronicle,&nbsp ;NC&nbsp;-</font<nobr>2 hours
      ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
      <b>NC</bLottery Commission, when Dennis Mebane collected his prize
      this week from a winning $100,000 instant scratch-off ticket, he
      <b>...</b</font><br></table></description>


      The XML tag is <description> , but the script I use to parse it
      (rss2array, a cookie-cutter script that I downloaded) can't
      differentiate between <descriptiona nd <table>. Usually, I would call
      <descriptionb y using the variable $rss['items'][$i]['description']
      (where $i is the index counter), but with this output I would have to
      do something like
      $rss['items'][$i]['description']['table'><'tr'>< 'td'><'b><'font '>...

      So I guess I really have 2 questions:

      1. Is there a way to strip HTML completely out of the tag? Or better
      yet, to make rss2array read the HTML as actual HTML instead of XML
      tags?

      2. If not, is there a better way to read the XML than through the use
      of rss2array? I'm a fairly experienced coder, but don't really use XML
      often enough to have a good grasp of the logic.

      TIA,

      Jason

      Comment

      • Noodle

        #4
        Re: Stripping HTML from RSS feed

        There are two things I can suggest:

        1. Using the DOM XML functions to extract the values (See
        http://au3.php.net/domxml), then use the strip_tags() functions.
        or
        2. Use Regular expressions to extract the html tags from the XML before
        you use rss2array

        e.g.

        // Remove all <ptags
        $xml = preg_replace('/<p(.*)?>(.*)? <\/p>/', "$2", $xml);

        //Remove all <fonttags
        $xml = preg_replace('/<font(.*)?>(.*) ?<\/font>/', "$2", $xml);

        //etc...


        Jason wrote:
        I'm afraid that didn't work. It reads the variable as empty.
        >
        As a test, I tried print_r($rss), and while it reads everything else
        correctly, it shows ['description'] as an empty variable. The only ones
        that it reads correctly are the ones that don't have HTML.
        >
        Knowing this, it's got to be a "problem" with rss2array. I put
        "problem" in quotes, because technically I think it's the XML file
        that's flawed, but there's not much I can do about that. What other way
        can I parse an XML database using PHP?
        >
        - Jason
        >
        >
        >
        See if this works:

        1. Use $rss['items'][$i]['description'] to return the contents of the
        description node (including html markup) as a string.
        2. Use the 'strip_tags' function to remove unwanted HTML from the
        returned string.

        See http://www.php.net/manual/en/function.strip-tags.php for more info.



        Jason wrote:
        First things first, let me say that I couldn't decide whether to post
        this to the PHP ng, or to an XML ng. I know from experience that you
        guys know what you're talking about, though, and all of the questions
        mean "how to do this in PHP," so I hope I picked the right one ;-)
        >
        For about a year, I've been importing Yahoo News headlines into my site
        via their RSS feed. But I would much rather import Google News
        headlines because I can make them specific to my location. The problem
        is that their RSS feed includes HTML, and the script I use can't figure
        it out.
        >
        Here's an example tag from their most recent feed:
        >
        <description><b r><table border=0 width= valign=top cellpadding=2
        cellspacing=7>< tr><td valign=top><a
        href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http ://www.raleighchro nicle.com/2006072105.html &cid=1108150860 &ei=hUzBRO_JMpe 2aInH4PsB">Truc k
        Driver Wins $100,000 In Lottery; <b>NC</bPowerball
        Winners</a><br><font size=-1><font color=#6f6f6f>R aleigh
        Chronicle,&nbsp ;NC&nbsp;-</font<nobr>2 hours
        ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
        <b>NC</bLottery Commission, when Dennis Mebane collected his prize
        this week from a winning $100,000 instant scratch-off ticket, he
        <b>...</b</font><br></table></description>
        >
        >
        The XML tag is <description> , but the script I use to parse it
        (rss2array, a cookie-cutter script that I downloaded) can't
        differentiate between <descriptiona nd <table>. Usually, I would call
        <descriptionb y using the variable $rss['items'][$i]['description']
        (where $i is the index counter), but with this output I would have to
        do something like
        $rss['items'][$i]['description']['table'><'tr'>< 'td'><'b><'font '>...
        >
        So I guess I really have 2 questions:
        >
        1. Is there a way to strip HTML completely out of the tag? Or better
        yet, to make rss2array read the HTML as actual HTML instead of XML
        tags?
        >
        2. If not, is there a better way to read the XML than through the use
        of rss2array? I'm a fairly experienced coder, but don't really use XML
        often enough to have a good grasp of the logic.
        >
        TIA,
        >
        Jason

        Comment

        Working...