Extracting data from HTML using PERL Regex

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • typedefcoder
    New Member
    • Oct 2011
    • 1

    Extracting data from HTML using PERL Regex

    I have two files, xml and an html and need to extract data from these on certain patterns. my XML file is pretty well formatted and i can use getline to read a line and search data between tags.

    if($line =~ /\$varvalue\</tag1>/)

    However, for my HTML, it has one of the worst code i have scene and the file is like...


    Code:
    <div class="theater">
                                            <h2>
    
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
                                            <div class="address">
                                                <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
                                            </div>
                                        </div>
    
    
                                                  <div class="mtitle">
    
    
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
                                                                <span>**(PG-13 , 1 hr. 31 min.)**</span>
                                                            </div>
    
    
    
    
                                                    <div class="times">
    
                                                                            **1:00 PM,**
                                                                                                            </div>
  • RonB
    Recognized Expert Contributor
    • Jun 2009
    • 589

    #2
    You should not be using a regex for parsing xml or html. You should be using one of the parsers on cpan such as HTML::Parser.

    This list contains several html parsers.

    Comment

    Working...