How can I delete contents between

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zhengmath
    New Member
    • Jun 2010
    • 3

    How can I delete contents between

    How can I delete contents between <SEC-HEADER> and </SEC-HEADER> in a htm file?
    Why my code does not work?
    Thanks!
    Code:
    #!/usr/bin/perl
    
    # This is a program which can process the Edgar 10-k html file into a plain text
    # file without graphs and tables.
    
    $filename="H:/Test Data/wmt2004.htm";
    open IN, '<', $filename or die;
    @contents = <IN>;
    close IN;
    
    @contents = grep !/<SEC-HEADER>.*</SEC-HEADER>/ @contents;
    
    $filenameout="H:/Test Data/wmt2004-processed.htm";
    open OUT, '>', $filenameout or die;
    print OUT @contents;
    close OUT;
  • numberwhun
    Recognized Expert Moderator Specialist
    • May 2007
    • 3467

    #2
    Have you taken a look at the perldoc page for grep in Perl? You will note that your grep statement should actually be coded as follows:

    Code:
    @contents = grep {!/<SEC-HEADER>.*</SEC-HEADER>/} @contents;
    As for "not working", can you please elaborate? What are you seeing that is going wrong and what are you expecting to see?

    Regards,

    Jeff

    Comment

    • toolic
      Recognized Expert New Member
      • Sep 2009
      • 70

      #3
      Your code has syntax errors and does not compile. Please post the actual code you are running.

      You should have also posted a small snippet of your input file. Here is my guess: your input file has start and end tags on different lines. Consider:

      [CODE=perl]use warnings;
      use strict;

      my @contents = <DATA>;
      @contents = grep { !/<SEC-HEADER>.*<\/SEC-HEADER>/ } @contents;
      print @contents;

      __DATA__
      <html>

      <SEC-HEADER>foo</SEC-HEADER>

      <SEC-HEADER>
      bar</SEC-HEADER>

      </html>[/CODE]

      This prints out:

      Code:
      <html>
      
      
      <SEC-HEADER>
      bar</SEC-HEADER>
      
      </html>
      In any case, you really should use one of the HTML parser modules from CPAN instead of regular expressions.

      Comment

      • zhengmath
        New Member
        • Jun 2010
        • 3

        #4
        Thank you guys I figure out. Thanks very much!
        I am trying to get familiar with perl.

        Comment

        Working...