Changing Tag case, while ignoring tag attribute values.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bluemaxx
    New Member
    • Jul 2007
    • 2

    Changing Tag case, while ignoring tag attribute values.

    I'm required to write a script that will take an HTML document and alter the tags to all be upper case.
    for example
    Code:
    <html>
    <head>
    <title>Please help I'm going mad</title>
    </head>
    <body>
    <img src="pic.gif></img>
    </body>
    </html>
    would output
    Code:
    <HTML>
    <HEAD>
    <TITLE>Please help I'm going mad</TITLE>
    </HEAD>
    <BODY>
    <IMG SRC="pic.gif></IMG>
    </BODY>
    </HTML>
    I can open files and use the tr/// function easily enough, its just getting it to ignore anything outwith<> and within "".

    Current code:
    Code:
    #!/usr/local/bin/perl
    print "Enter name of file you wish to edit.\n";
    chomp ($filename = <STDIN>);
    validateExtension();
    readTags();
    
    
    sub validateExtension
    {
    	if ($filename =~ m/\w{1,}[.]{1}[htmlHTML]{3,4}$/)
    	{
    		print "pass\n";
    	}
    else
    	{
    		print "fail\n";
    	}
    }	
    
    sub readTags
    {
    	open(READFILE, "$filename") || die "Couldn't open the file: $!";
    	while(<READFILE>)
    		{
    			$_ =~ tr/[a-z]/[A-Z]$/;
    		}
    	close(READFILE);
    }
  • miller
    Recognized Expert Top Contributor
    • Oct 2006
    • 1086

    #2
    First. Lower case is a better than upper case. It's just easier to read.

    Second. Your html is malformed. You're missing the closing quote in the image tag.

    Third. Always, always, always "use strict;". It will simply require you to declare your variables, but this is a good thing.

    Fourth. Don't have your subroutines work on global variables. Always pass parameters to your subroutines. Always. It's the simpliest way to document what things are doing.

    Fifth. You should probably do a little studying of regular expressions. I laud your attempt at using them to verify your data, but you need a little more knowledge.
    perldoc perlrequick

    Sixth. Your script won't currently edit the file. To do that you should read this. Specifically the second question is very relevant.
    perldoc perlfaq5 Files and Formats

    Finally, here is your script so modified. Note, I personally would simplify it more by removing the subroutines. But I left them in there to demonstrate better coding practices with regard to subs.

    [CODE=perl]
    #!/usr/local/bin/perl

    # Upper case all the tags within an html file.

    use Tie::File;

    use strict;
    use warnings;

    # Get filename from the command line
    my $filename = shift;

    # Else prompt for it
    if (!$filename) {
    print "Enter name of file you wish to edit.\n";
    chomp($filename = <STDIN>);
    }

    validate_extens ion($filename);
    uc_tags($filena me);

    sub validate_extens ion {
    my $filename = shift or die "Filename required";
    print $filename =~ m{^\w+\.html?$} i ? "pass\n" : "fail\n";
    }

    sub uc_tags {
    my $filename = shift or die "Filename required";
    tie my @array, 'Tie::File', $filename or die "Can't open $filename: $!";

    foreach my $line (@array) {
    $line =~ s{(</?\w+)}{\U$1}g;
    }
    }

    1;

    __END__
    [/CODE]

    - Miller

    Comment

    • KevinADC
      Recognized Expert Specialist
      • Jan 2007
      • 4092

      #3
      the vaidation of the file name is not really very good:

      Code:
      if ($filename =~ m/\w{1,}[.]{1}[htmlHTML]{3,4}$/)
      you are using a character class incorrectly: [htmlHTML]

      anything inside a character class is not interpreted as a string but as individual characters in any order. So a file ext of .HhH will match as well as .html or any other valid html extension. All you really need is:

      Code:
      if ($filename =~ m/^.*?\.html?$/i)
      to see if a file is named with a .htm or .html extension."i" ignores case so upper and lower case will match.

      Now later you have:

      Code:
                  $_ =~ tr/[a-z]/[A-Z]$/;
      the tr opeator does not recognize the use of [] as a character class and the "$" on the end on the replacement side is doing nothing. "tr" has no concept of anchors (^$) like "m" and "s" do.

      You shoud just use a "range", which "tr" does understand
      Code:
                 $_ =~ tr/a-z/A-Z/;

      Also, you have not attempted to differentiate between html tags and text at all. I realize that is where you are confused, but I would be more comfortable helping you with your course work if I saw some attempt to do so.

      Comment

      • miller
        Recognized Expert Top Contributor
        • Oct 2006
        • 1086

        #4
        Sorry Kevin. I failed to clue in to the fact that this was most likely homework. I was busy remembering some unnamed obsessive compulsive programmer creating such a script back in the day to change all tag casing to lower case, as it should be.

        What do you think? Delete my provided code?

        - Miller

        Comment

        • KevinADC
          Recognized Expert Specialist
          • Jan 2007
          • 4092

          #5
          No, don't delete your code. The OP has at least posted some code so appears to be making an effort.

          I wonder who this could be: unnamed obessive compulsive programmer ;)

          Comment

          • miller
            Recognized Expert Top Contributor
            • Oct 2006
            • 1086

            #6
            Sure thing.

            I went ahead and added a little comment at the top of the script to state what it does. The number of times I've reopened a script having new clue what it accomplishes... oi.

            - M

            Comment

            • bluemaxx
              New Member
              • Jul 2007
              • 2

              #7
              Thank you for the help gents, much appreciated.

              Comment

              Working...