Extracting data from HTML using PERL Regex

typedefcoder

New Member

Join Date: Oct 2011
Posts: 1

Extracting data from HTML using PERL Regex

Oct 16 '11, 11:09 AM

I have two files, xml and an html and need to extract data from these on certain patterns. my XML file is pretty well formatted and i can use getline to read a line and search data between tags.

if($line =~ /\$varvalue\</tag1>/)

However, for my HTML, it has one of the worst code i have scene and the file is like...

Code:

<div class="theater">
                                        <h2>

<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
                                        <div class="address">
                                            <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
                                        </div>
                                    </div>


                                              <div class="mtitle">


<a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
                                                            <span>**(PG-13 , 1 hr. 31 min.)**</span>
                                                        </div>




                                                <div class="times">

                                                                        **1:00 PM,**
                                                                                                        </div>

Tags: perl xml html regex

RonB

Recognized Expert Contributor

Join Date: Jun 2009

Posts: 589
#2

Oct 16 '11, 04:20 PM

You should not be using a regex for parsing xml or html. You should be using one of the parsers on cpan such as HTML::Parser.

This list contains several html parsers.

http://search.cpan.org/modlist/World_Wide_Web/HTML
Comment

Extracting data from HTML using PERL Regex

Extracting data from HTML using PERL Regex

Comment