Building a File Download Script with Perl

**KevinADC** · Aug 9 '08, 06:41 AM

I realize we don't have too many perl coders here, but any and all feedback will be appreciated. Here is my working draft so far. I only need to add the resources section (links to online reference material) to the end of the script and my copyright/license notice (Creative Commons License). Article begins now.

Note: You may skip to the end of the article if all you want is the perl code.

Introduction

Many websites have a form or a link you can use to download a file. You click a form button or click on a link and after a moment or two a file download dialog box pops-up in your web browser and prompts you for some instructions, such as “open” or “save“. I’m going to show you how to do that using a perl script.

This article will not teach you how to write perl programs but will introduce you to CGI scripting with perl and the CGI module that comes with perl. If you already have some experience with perl and CGI scripting you should still find the information in the article useful and hopefully interesting. At the end of the article is a list of online resources you can access for more information concerning some of the details that will be discussed in this article.

What You Need

Any recent version of perl (5.06 or newer should be good) and a server to run the script on. The ability to upload and run perl scripts on the server. A server that allows you to store files above the web root is preferable but not necessary. That’s the safest place to put files you don’t want people or ‘bots’ to be able to access. A little bit of prior HTML knowledge would be helpful but is not necessary.

The Perl Code

Just about all perl scripts that run as a CGI process need to start with what is called the shebang line. The most common shebang line is:

Code:

#!/usr/bin/perl

It simply tells the server where to find perl. The shebang line your server requires might be different. Most web hosts will have that information posted on their site somewhere. In the interest of good perl coding practices and CGI security we are going to add a switch to the shebang line: -T. Note: it must be an uppercase T.

Code:

#!/usr/bin/perl -T

The T stands for "taint" mode. This is really to prevent you, as the programmer of the script, from making a terrible mistake and allowing the users of your CGI form to send data to the server that can be used in an insecure way. All perl scripts that run as a CGI process should use the -T switch so I include it for that reason.

Modules

Modules are sort of like separate perl programs you can use in your perl program. Many people have written modules that have become standards that other perl programmers use all the time. We will be using these modules:

Code:

use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use Tie::File;

The first two are not actually modules they are pragmas. They affect the way perl itself functions. I’m not going to explain them for the purpose of this article. You’ll need to trust me that they are important to use in nearly all of your perl programs. The "CGI" module is the module that will do most of the work for us; Process form data, print http headers, and more. The "CGI::Carp" module is really for debugging and may help you to get your script running if you have problems. If there are any fatal errors that cause the script to fail, it will print an error message to the screen. These are the same errors that will be printed in the server error log too. “Tie::File ” is a module that treats files like perl arrays. It has a simple interface that makes editing files inplace very easy.

The next two lines in the program establish some important parameters:

Code:

$CGI::POST_MAX = 1024;
$CGI::DISABLE_UPLOADS = 1;

“POST_MAX” sets the maximum limit in bytes of how much data will be considered too much and cause the script to return an error. I have set this limit low (1 kb) because this script will need little data sent to it to work. The second line tells the script to not accept file uploads. Makes sense since we want to download files, not upload them. This prevents users from using a “hacked” form to send files to your script. Why is this important? All forms can be saved to file and the HTML code changed and the user can send anything he wants to your script, its up to you to prevent this on the server end. What the user does on their end is entirely out of your control.

Setting Paths and Options

Code:

####################################
#### User Configuration Section ####
####################################

# The path to where the downloadable files are. 
# Preferably this should be above the web root folder.
my $path_to_files = '/home/users/downloads/';

# The path to the error log file
my $error_log     = '/home/users/downloads/logs/errors.txt';

# The path to the counter file
my $counter_log   = '/home/users/downloads/logs/counter.txt';

# Option to log errors: 1 = yes, 0 = no
my $log           = 1;

# Option to count downloads: 1 = yes, 0 = no
my $counter       = 1;

# Checks if someone is trying to hot-link to your script
my $url = 'http://www.yoursite.com';
########################################
#### End User Configuration Section ####
########################################

$path_to_files is the directory where you store the files to be downloaded. I recommend you store them in a folder that is not web accessible. This is commonly done by putting them in a folder parallel to your root web folder (public_html or www) or above it.

$error_log is the path to the errors.txt file that logs errors generated by the script.

$counter_log is the path to where the counter.txt file that keeps track of how many times files are downloaded.

$log and $counter turn the logs on or off.

$url should be the name of your website including the “http://” part.

Create the CGI object

Code:

my $q = CGI->new;

$q is the object we will use to execute various methods of the CGI module. I like to compare it to a butler. You tell the butler what you want and he knows how to get it done, you don’t have to worry about the details. Our “butler”, $q, will know what to do with the “commands” we will give him.

In reality, the CGI module has many “commands” you can give to the “butler”. We will use but a few of them. Learning to use the CGI module is almost like learning a small programming language. But the beauty is you only need to know what the commands do, not how they do it. Just like a real butler you have to trust that he knows what he is doing and will get the job done efficiently and effectively without looking over his shoulder. I recommend you take the time to read the CGI modules documentation, even if you don’t understand much of it, you should at least be familiar with the basic form processing methods. I leave that up to you.

Security Checkpoint

Never underestimate the need for security when running scripts as a CGI. We are going to use three “checkpoin ts” to detect any suspicious activity. The first is going to check the amount of data sent to the script. We give the cgi_error() command to our trusty butler “q”. “413” indicates the limit we set for $CGI::POST_MAX has been exceeded.

Code:

if (my $error = $q->cgi_error()){
   if ($error =~ /^413\b/o) {
      error('Maximum data limit exceeded.');
   }
   else {
      error('An unknown error has occured.'); 
   }
}

Next I am going to see if someone has altered the form to try and upload a file to the script. “multi-part/form-data” must be used in a CGI forms “enctype” attribute in order to send files.

Code:

if ($ENV{'CONTENT_TYPE'} =~ m|^multipart/form-data|io ) {
   error('Invalid Content-Type : multipart/form-data.')
}

Next we check that the request to use the script comes from your website.

Code:

if ($ENV{'HTTP_REFERER'} && $ENV{'HTTP_REFERER'} !~ m|^\Q$url|io) {
   error('Access forbidden.')
}

Get the Filename

I am going to use the Vars method to get all the parameters sent to the script into a hash. Once again, we call on “q” to do the actual work.

Code:

my %IN = $q->Vars;

Now we make sure there is a parameter named “file”.

Code:

my $file = $IN{'file'} or error('No file selected.');

Validate, Validate, Validate

You can’t say it enough, all data sent to a CGI script has to be validated. If we allowed just any thing to be sent to the script someone could send something like this: /foo/bar and depending on the path you append that to, the script will obediently go find the foo directory and download the bar file. There are of course much worse things a person could try, but this is not an article about how to hack into a website using the front door. To prevent the user from getting away with such a dangerous stunt we need to validate the data sent to the script.

Code:

if ($file =~ /^(\w+[\w.-]+\.\w+)$/) {
   $file = $1;
}
else {
   error('Invalid characters in filename.');
}

The really cryptic looking part of that code, ($file =~ /^(\w+[\w.-]+\.\w+)$/), is called a regular expression (regexp). Typically a regexp is what you would use to validate/filter data. Regular expressions are way beyond the scope of this article. If you are interested to understand that regexp you will have to read some regexp tutorials. See the online resources at the end of the article. Basically it is checking that the data is something like this: frog.gif, or puppy-dog.jpg, or meatloaf.txt. It checks for a restricted set of characters “a-zA-Z0-9_-.”, in a basic filename format, filename.ext, and rejects anything else as invalid.

The above code is also “untaintin g” the data. Since the data will be used to open a file on the server we must untaint it to satisfy the –T switch that we are not doing anything insecure. The only way to untaint data is to use a regexp. The parenthesis in the regexp store the pattern match in memory, we get that value using $1. We then assign the value back to our variable $file and now the data we will use to open the file is internal to our script and the –T switch will consider it safe to use. It’s up to you to know that your validation/filtering is sufficient for the task. If, for example, you used this pattern in the: regexp /(.*)/ the –T switch will not complain but the data will be passed into the script just like it was entered in the form or sent via a hyperlink. That would be a silly thing to do.

If the data does not pass the validation routine a message is sent to the error subroutine and the user is alerted.

Ready for Downloading

Code:

if (download($file)) {
   #increments the files download count
   counter($file) if ($counter);
}
else {
   error('An unknown error has occured. Try again.');
}

The “if” condition calls the download() subroutine and checks that a true value is returned which indicates the file download was successful. If it was successful, the counter() subroutine is called and increments the file download count for that file. If the file download fails, the “else” condition is evaluated and alerts the user that an error occurred.

The download() Subroutine

Code:

sub download {
   my $file = $_[0] or return(0);

   # Uncomment the next line only for debugging the script 
   #open(my $DLFILE, '<', "$path_to_files/$file") or die "Can't open file '$path_to_files/$file' : $!";

   # Comment the next line if you uncomment the above line 
   open(my $DLFILE, '<', "$path_to_files/$file") or return(0);

   # this prints the download headers with the file size included
   # so you get a progress bar in the dialog box that displays during file downlaods. 
   print $q->header(-type            => 'application/x-download',
                    -attachment      => $file,
                    'Content-length' => -s "$path_to_files/$file",
   );

   binmode $DLFILE;
   print while <$DLFILE>;
   undef ($DLFILE);
   return(1);
}

The first line of the subroutine gets the filename or returns 0 (zero) back to the caller to indicate failure. There are two lines that open the file, one is for debugging purposes and one is for running the script when all is working properly. The next section of the code prints the headers that cause the web browser to download the file instead of trying to display it.

Code:

   print $q->header(-type            => 'application/x-download',
                    -attachment      => $file,
                    'Content-length' => -s "$path_to_files/$file",
   );

The “–type” option in the header() command is the specific header that causes the download. The
“-attachment” option defines the name of the file being downloaded. You could give the file any name you wanted to, it does not have to be the actual filename. That can be useful if you have a reason to hide the real name of the file or needed to give the downloaded file a name other than the real name. The “Content-length” option uses the –s file test operator to get the size of the file. This allows the file download dialog box to display the file size and a progress bar and estimate the time remaining to complete the file download.

The last four lines of the subroutine complete the download process.

Code:

   binmode $DLFILE;
   print while <$DLFILE>;
   undef ($DLFILE);
   return(1);

The binmode() function tells perl to transfer the file in “binary” mode. There is a small chance that using binary mode will corrupt the file on the receiving end. But in general there is no problem using it and in some cases it is necessary. If you experience problems when using binmode, remove or comment out the line. See the binmode functions documentation for more details. The “print” line is what actually transfers the file from the server to the client. “undef” closes the file because I used an indirect filehandle. We return 1 (one) at the end of the subroutine to indicate success.

Subroutines

The “error” subroutine is very simple. It uses a few html generating methods to print a basic html document that displays the error messages we send to it. The error message is stored in $_[0]. Each of these methods are discussed in the CGI modules documentation. If you have error logging turned on the “log_error ” function is also called. Anytime the “error” subroutine is called it will display the html document and then terminate the script, which is what exit() does.

Code:

sub error {
   print $q->header(-type=>'text/html'),
         $q->start_html(-title=>'Error'),
         $q->h3("Error: $_[0]"),
         $q->end_html;
   log_error($_[0]) if $log;
   exit(0);
}

Next is the “log_error ” subroutine. Each error the script detects can be logged so you can see how visitors to your site are misusing the script. This is good information to keep track of. It might be overkill, but I am a great believer in tracking errors since they can help you write more secure scripts and alert you to bots or people trying to abuse the script. It appends the errors and some other information to a file. I personally like to record the name/value pairs that are sent to the script to see if the form or query string has been altered by the user. Those values will be in $params, formatted like so: “name=”value::: name=value:::na me=value”. scalar localtime() is a convenience to you so you can easily read the date/time of the error. “time” records the date/time in epoch seconds which is a standard way of recording the date/time so computer programs and scripts can make sense of it. Its ultimately up to you to decide what, if anything, to do with this information. I suggest you check the error log once in a while. You can delete it and the script will create a new one. Or turn off error logging entirely in the User Configuration Section of the script.

Code:

sub log_error {
   my $error = $_[0];

   #open (my $log, ">>", $error_log) or die "Can't open error log: $!";

   open (my $log, ">>", $error_log) or return(0);

   flock $log,2;
   my $params = join(':::', map{"$_=$IN{$_}"} keys %IN) || 'no params';
   print $log '"', join('","',time, 
                      scalar localtime(),
                      $ENV{'REMOTE_ADDR'},
                      $ENV{'SERVER_NAME'},
                      $ENV{'HTTP_HOST'},
                      $ENV{'HTTP_REFERER'},
                      $ENV{'HTTP_USER_AGENT'},
                      $ENV{'SCRIPT_NAME'},
                      $ENV{'REQUEST_METHOD'},
                      $params,
                      $error),
                      "\"\n";
}

The last subroutine is the counter. Because we are not creating a new file or appending to a file each time the script runs, I use Tie::File to modify just the line that has the filename and count. Tie::File is very handy for that purpose, but if you get a corrupted file you may have to stop counting. One bad thing about Tie::File is that it does not handle concurrent access to a file very well. I have used “flock” to lock the file, but that may not be sufficient on some systems. See the Tie::File documentation for details.

Basically the file will look like this:

Frog.jpg,12
Meatloaf.txt,10 000
Babypics.zip,12 34

The filename is one the left of the comma and the count is on the right. The “counter” subroutine should only be called is the download is successful so the counts should be accurate.

Code:

sub counter {
   my $filename = $_[0] or return(0);

   #my $o = tie my @array, "Tie::File", $counter_log or die "Can't open counter log: $!";

   my $o = tie my @array, "Tie::File", $counter_log or return(0);

   $o->flock;
   my $flag = 0;
   if ($array[0]) { 
      foreach my $line (@array) {
         my ($name,$count) = split(/,/,$line);
         if ($filename eq $name) {
            $count++; 
            $line = qq{$name,$count};
            $flag = 1;
            last;
         }
      }
      if ($flag == 0) {
         push @array, qq{$filename,1};
      }
   }
   else {$array[0] = qq{$filename,1};}
   undef $o;
   untie @array;
}

Resorces

resorces will be here

Complete script

Code:

#!/usr/bin/perl -T

# Copyright 2008 Kevin Ruggles.  All rights reserved.
# It may be used and modified freely, but I request that this copyright
# notice remain attached to the file.

## Load pragmas and modules
use strict;
use warnings;
use CGI;
use Tie::File;
# Uncomment the next line only for debugging the script.
#use CGI::Carp qw/fatalsToBrowser/;

# The next two lines are very important. Do not modify them
# if you do not understand what they do.
$CGI::POST_MAX = 1024;
$CGI::DISABLE_UPLOADS = 1; 


####################################
#### User Configuration Section ####
####################################

#/home/users/web/b706/ipw.beaspart/contacts/pages/error.html
# The path to where the downloadable files are. 
# Prefereably this should be above the web root folder.
#my $path_to_files = '/home/users/downloads/';
my $path_to_files = '/home/users/web/b706/ipw.beaspart/downloads/';

# The path to the error log file
my $error_log     = '/home/users/web/b706/ipw.beaspart/downloads/logs/errors.txt';

# The path to the counter file
my $counter_log   = '/home/users/web/b706/ipw.beaspart/downloads/logs/counter.txt';

# Option to log errors: 1 = yes, 0 = no
my $log           = 1;

# Option to count downloads: 1 = yes, 0 = no
my $counter       = 1;

# Checks if someone is trying to hot-link to your script
my $url = 'http://www.beaspartyponies.com';
####################################
## End User Configuration Section ##
####################################

# Edit below here at your own risk

my $q = CGI->new;

######################################
## This section checks for a number ##
## of possible errors or suspicious ##
## activity.                        ##
######################################

# check to see if data limit is exceeded
if (my $error = $q->cgi_error()){
   if ($error =~ /^413\b/o) {
      error('Maximum data limit exceeded.');
   }
   else {
      error('An unknown error has occured.'); 
   }
}

# Check to see if the content-type is acceptable.
# multipart/form-data indicates someone is trying
# to upload data to the script with a hacked form.
# $CGI_DISABLE_UPLOADS prevents uploads. This routine
# is to catch the attempt and log it. 
if ($ENV{'CONTENT_TYPE'} =~ m|^multipart/form-data|io ) {
   error('Invalid Content-Type : multipart/form-data.')
}	   

# Check if the request came from your website, if not
# it indicates remote access or hot linking.
if ($ENV{'HTTP_REFERER'} && $ENV{'HTTP_REFERER'} !~ m|^\Q$url|io) {
   error('Access forbidden.')
}

################################
## End error checking section ##
################################

# Get the data sent to the script.
my %IN = $q->Vars;

# Parse the "file" paramater sent to the script.
my $file = $IN{'file'} or error('No file selected.');

# Here we untaint the filename and make sure there are no characters like '/' 
# in the name that could be used to download files from any folder on the website.
if ($file =~ /^(\w+[\w.-]+\.\w+)$/o) {
   $file = $1;
}
else {
   error('Invalid characters in filename.');
}	

# Check if the download succeeded
if (download($file)) {
   #increments the files download count
   counter($file) if ($counter);
}
else {
   error('An unknown error has occured.');
}  

#################
## SUBROUTINES ##
#################

# download the file
sub download {
   my $file = $_[0] or return(0);

   # Uncomment the next line only for debugging the script 
   #open(my $DLFILE, '<', "$path_to_files/$file") or die "Can't open file '$path_to_files/$file' : $!";

   # Comment the next line if you uncomment the above line 
   open(my $DLFILE, '<', "$path_to_files/$file") or return(0);

   # This prints the download headers with the file size included
   # so you get a progress bar in the dialog box that displays during file downlaods. 
   print $q->header(-type            => 'application/x-download',
                    -attachment      => $file,
                    'Content-length' => -s "$path_to_files/$file",
   );

   binmode $DLFILE;
   print while <$DLFILE>;
   undef ($DLFILE);
   return(1);
}

# This is a very generic error page. You should make a better one.
sub error {
   print $q->header(-type=>'text/html'),
         $q->start_html(-title=>'Error'),
         $q->h3("Error: $_[0]"),
         $q->end_html;
   log_error($_[0]) if $log;
   exit(0);
}

# Log the error to a file
sub log_error {
   my $error = $_[0];

   # Uncomment the next line only for debugging the script
   #open (my $log, ">>", $error_log) or die "Can't open error log: $!";

   # Comment the next line if you uncomment the above line
   open (my $log, ">>", $error_log) or return(0);

   flock $log,2;
   my $params = join(':::', map{"$_=$IN{$_}"} keys %IN) || 'no params';
   print $log '"', join('","',time, 
                      scalar localtime(),
                      $ENV{'REMOTE_ADDR'},
                      $ENV{'SERVER_NAME'},
                      $ENV{'HTTP_HOST'},
                      $ENV{'HTTP_REFERER'},
                      $ENV{'HTTP_USER_AGENT'},
                      $ENV{'SCRIPT_NAME'},
                      $ENV{'REQUEST_METHOD'},
                      $params,
                      $error),
                      "\"\n";
}

# Incrememt the file download counter
sub counter {
   my $filename = $_[0] or return(0);

   # Uncomment the next line only for debugging the script 
   #my $o = tie my @array, "Tie::File", $counter_log or die "Can't open counter log: $!";

   # Comment the next line if you uncomment the above line 
   my $o = tie my @array, "Tie::File", $counter_log or return(0);

   $o->flock;
   my $flag = 0;
   if ($array[0]) { 
      foreach my $line (@array) {
         my ($name,$count) = split(/,/,$line);
         if ($filename eq $name) {
            $count++; 
            $line = qq{$name,$count};
            $flag = 1;
            last;
         }
      }
      if ($flag == 0) {
         push @array, qq{$filename,1};
      }
   }
   else {$array[0] = qq{$filename,1};}
   undef $o;
   untie @array;
}

**eWish** · Aug 9 '08, 06:14 PM

I felt that the article was well written. For me it had a nice flow to it and was informative.

In this section of code you are not closing the filehandle. Is that not necessary?

Code:

binmode $DLFILE;
print while <$DLFILE>;
undef ($DLFILE);
return(1);

Nice job Kevin!

**KevinADC** · Aug 9 '08, 06:50 PM

Originally posted by eWish

In this section of code you are not closing the filehandle. Is that not necessary?

I had originally explained that but removed it because the article seemed to be getting too long and I still have some stuff to add.

I am also a little surprised you don't know that it does close the filehandle. When you "undef" an indirect filehandle, $DLFILE in this case, that closes the file. The file would also be closed automatically once the $DLFILE scalar went out of scope, which is the end of the subroutine block. It could also be written using close():

close $DLFILE;

**eWish** · Aug 9 '08, 07:33 PM

I did not know that by using undef that it would close the file handle. I did know that it would go out of scope and not be an issue because of that. Anytime I open a filehandle I use close() at the end even though it goes out of scope.

--Kevin

**KevinADC** · Aug 9 '08, 09:13 PM

Originally posted by eWish

I did not know that by using undef that it would close the file handle. I did know that it would go out of scope and not be an issue because of that. Anytime I open a filehandle I use close() at the end even though it goes out of scope.

--Kevin

Qutoe from perldocs open tutorial, perlopentut, indirect filehandles:

Another convenient behavior is that an indirect filehandle automatically closes when it goes out of scope or when you undefine it:

Code:

    sub firstline {
	open( my $in, shift ) && return scalar <$in>;
	# no close() required
    }

**KevinADC** · Sep 19 '08, 05:59 AM

I have actually been doing a lot of freelancing lately so this article has been delayed but I do plan to finish it.

**KevinADC** · Dec 4 '08, 06:40 AM

I have completed and posted the article:

503 Service Unavailable

http://bytes.com/insights/perl/857373-how-make-file-download-script-perl#post3440183

I scaled it down a little from the above article but it is essentially the same. Any editorial comments please post them here.

**Markus** · Dec 4 '08, 03:36 PM

Shouldn't

Next I am going to see if someone has altered the form to try and upload a file to the script. “multi-part/form-data” must be used in a CGI forms “encypt” attribute in order to send files.

be

Next I am going to see if someone has altered the form to try and upload a file to the script. “multi-part/form-data” must be used in a CGI forms “enctype” attribute in order to send files.

.. paying attention to the bold'd text. I know nothing of Perl or CGI so maybe I'm wrong here. But, in HTML, I know the attribute is 'enctype'.

Very well documented article, btw.

**KevinADC** · Dec 4 '08, 05:33 PM

Oops, good catch. It very well should be 'enctype'. But I can no longer edit the article. Please make the correction for me if you want to.

Thanks,
Kevin

**Nepomuk** · Dec 4 '08, 05:58 PM

There, I was here anyway, so I've corrected that for you.

Great article by the way! Well done! You may want to post it in several parts though, as it's a bit overwhelming all at once.

Greetings,
Nepomuk

**KevinADC** · Dec 4 '08, 06:14 PM

Yes, it might be long still, I did cut it down from the originally planned article, but the user is free to skip to the code at the end of the article and then read any section of the article to understand a section of the code if need be. Does that make sense or seem plausible to you?

And thanks for editing the article to correct that error.

**numberwhun** · Dec 11 '08, 09:04 PM

I have to agree with Kevin. I usually getting a bit "upset" when people break an article up into multiple parts, especially when it has to do with coding. If I am reading it, then I typically have the time to finish it and prefer it all on one page. But, that's my preference. :)

Great article Kevin! Thanks a bunch for writing it.

Regards,

Jeff

**KevinADC** · Dec 11 '08, 10:33 PM

Your money is in the mail Jeff. :)