count same HTML tags using regex in perl

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • techtween
    New Member
    • May 2011
    • 7

    count same HTML tags using regex in perl

    Kindly help me with the regular expression to count same html tags.

    for example,
    <html>
    <body>
    <b>file1</b><b>file2</b>
    </body>
    </html>

    output => two (since two <b> tags are present in the file)

    I need the expression to match any such same tags not specific to one tag.
  • miller
    Recognized Expert Top Contributor
    • Oct 2006
    • 1086

    #2
    Don't use a regex, use an html parser like HTML::TreeBuild er

    Code:
    use HTML::TreeBuilder;
    
    use strict;
    use warnings;
    
    my $data = do {local $/; <DATA>};
    
    my $tree = HTML::TreeBuilder->new;
    $tree->parse($data);
    
    my @tags = $tree->look_down('_tag' => 'b');
    
    print "b has: " . scalar(@tags) . "\n";
    
    $tree = $tree->delete;
    
    __DATA__
    <html>
    <body>
    <b>file1</b><b>file2</b>
    </body>
    </html>
    If you insist on a regex, than something like the following would work

    Code:
    use strict;
    use warnings;
    
    my $data = do {local $/; <DATA>};
    
    my $tag = 'b';
    
    my $count = () = $data =~ /<\Q$tag\E(?:\s.*?)?>/g;
    
    print "$tag has: " . $count . "\n";
    
    __DATA__
    <html>
    <body>
    <b>file1</b><b>file2</b>
    </body>
    </html>
    Last edited by miller; May 4 '11, 04:56 PM.

    Comment

    • techtween
      New Member
      • May 2011
      • 7

      #3
      Hi Miller,

      Thanks a ton. I knew it could best be done using tree builder, but i needed it using regex logic. Your code works perfectly fine..:)

      Comment

      • rampdv
        New Member
        • Feb 2013
        • 7

        #4
        Hi techtween,
        find the code below this is working fine,to count the html tags based on the tag name using regular expressions.

        here i used below html code to check
        <html>
        <h1>hi this is ramanjaneyulu</h1>
        <b>this is bold text</b>
        <br/>
        <br/>
        <body>this is the body of the html</body><body> this is another body</body>
        <b>this is bold again </b><b> this is another bold </b>
        <head>this is head</head>
        </html>

        if u have any queries please make a follow up.
        Code:
        open(HTM,"checktag.html");
        my @data;
        while(<HTM>) {
        while($_ =~/<(\w+)[>?|(?:(?:.*)?\/)?]>?/gi) {
        push @data,$1;
        
        }
        }
        
         my %hash;
        for($i=0;$i< $#data;$i++) {
        
        if($hash{$data[$i]}){
        
        $hash{$data[$i]}++;
        
        }
        else
        {
        
        $hash{$data[$i]} = 1;
        
        }
        
        }
        
        foreach (keys %hash) {
        print " $_ occurs  $hash{$_} times \n";
        }

        Comment

        Working...