How can I get my https file fetch working?

**KevinADC** · Feb 4 '09, 11:18 PM

Looks like it should work. Don't know what the problem is.

**numberwhun** · Feb 5 '09, 01:42 AM

I agree with Kevin. Right off the bat, it looks like it might work, but I haven't gone through it thoroughly. What I can say is that you want to look at the book "Spidering Hacks". Specifically, this part here:

Spidering Hacks

http://books.google.com/books?id=4M2wlXE0oCsC&printsec=frontcover&dq=spidering+hacks&ei=KkOKScmpKqTmyASQlPXiAg#PPA60,M1

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you. Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you. Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:Aggregate and associate data from disparate locations, then store and manipulate the data as you like Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites Integrate third-party data into your own applications or web sites Make your own site easier to scrape and more usable to others Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

That will help you with a fetch using the Mechanize module.

Regards,

Jeff

**MimiMi** · Feb 5 '09, 10:09 AM

Hi! Thank you so much guys, for giving me feedback quickly!
I now know though, that the problem is related to credentials...
When, in the script, I change
$mech->get( $url, ':content_file' => 'C:\Tmp\myFile. zip' );
to
$mech->get( $url, ':content_file' => 'C:\Tmp\myFile. html' );
I can see that the downloaded file is indeed a webpage; and that is, a login page..

I don't really know how to solve this though. I will have to investigate further.
There is some autologin asp session involved when fetching files from where I want to fetch them. Probably the browser handles a lot of that "behind the scenes", and I don't really know exactly what's going on, which, of course I must, in order to get my script to work.. These enterprise networks.. *sigh* :)...

**numberwhun** · Feb 5 '09, 12:18 PM

Originally posted by MimiMi

Hi! Thank you so much guys, for giving me feedback quickly!
I now know though, that the problem is related to credentials...
When, in the script, I change
$mech->get( $url, ':content_file' => 'C:\Tmp\myFile. zip' );
to
$mech->get( $url, ':content_file' => 'C:\Tmp\myFile. html' );
I can see that the downloaded file is indeed a webpage; and that is, a login page..

I don't really know how to solve this though. I will have to investigate further.
There is some autologin asp session involved when fetching files from where I want to fetch them. Probably the browser handles a lot of that "behind the scenes", and I don't really know exactly what's going on, which, of course I must, in order to get my script to work.. These enterprise networks.. *sigh* :)...

Check out the module documentation on CPAN for WWW::Mechanize. I am pretty positive that it provides options for logging in to such pages, you just have to code for it.

I don't know if it will help any, but here is a script I wrote a while ago that logs into a website (you had to log in before you could see the list of files) and then downloads everything that was there:

Code:

#!/usr/bin/perl

use strict;
use warnings;
use File::Basename;
use WWW::Mechanize;
use MIME::Base64;

$|++;

my $username = "username";
my $password = "password";
my $url = "http://www.site.com/page.asp";
my $realm;
my $tempfile = "temp.txt";

my $agent = WWW::Mechanize->new();
my @args = (
    Authorization => "Basic " . MIME::Base64::encode( $username . ':' . $password )
);


$agent->credentials( $url, $realm, $username, $password );

$agent->get( $url, @args)

Obviously, site name, username and password have all been changed to protect the innocent and the above values for each should be replaced with whatever you are using.

Regards,

Jeff

**KevinADC** · Feb 5 '09, 03:45 PM

Look into Win32::IE::Mech anize which can handle a lot more things than WWW::Mechanize can

**MimiMi** · Feb 27 '09, 01:58 PM

Hello again!
I appreciate all your efforts to help me out here!

I've been working on other things, but now it's time to get back to this. (I still haven't got it working).

Here's my current status:

The myFile.html I get from

Code:

$mech->get( $url, ':content_file' => 'C:\Tmp\myFile.html' );
(see previous posts if I'm unclear)

has JavaScript on it.. Here are some parts of the html-file (including the JavaScript):

Code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
	<head>
		<title>TheCompany Portal Login</title>
		<link type="text/css" rel="stylesheet" href="styles.css">	
		<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
		<META HTTP-EQUIV="Expires" CONTENT="-1">
		<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">

			<SCRIPT LANGUAGE="JavaScript">
				function resetCredFields()
				{
				  document.Login.PASSWORD.value = "";
				}

				function submitForm()
				{		 
				     document.Login.submit();
				}

				function cancelLogin()
				{
    					window.history.go(-1);
				}

				if (top.frames.length > 1)
				{
					top.location.href = document.location;
				}

				function checkEnter(event)
				{
					var code = 0;
					NS4 = (document.layers) ? true : false;
					if (NS4)
					code = event.which;
					else
					code = event.keyCode;
					if (code==13)
					document.Login.submit();
				}

			</SCRIPT>


	</head>
	
<BODY topmargin="0" leftmargin="0" marginwidth="0" marginheight="0">



 
<table height="95%" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>

... And so on and so forth..

I don't know anything about how WWW::Mechanize could work with JavaScript.. is that even possible? How then can I provide the JavaScript with the right credentials?

Cheers

**MimiMi** · Feb 27 '09, 03:14 PM

Sorry sorry.. I don't need to waste your time by asking silly questions such as whether WWW::Mechanize works with JavaScript, that wasn't hard to find out for myself. The answer is NO. Unfortunately.

Have to figure out how to solve this then.. some other way.. :/

Cheers

**KevinADC** · Feb 27 '09, 05:33 PM

I guess you missed my previosu post:

Look into Win32::IE::Mech anize which can handle a lot more things than WWW::Mechanize can

**MimiMi** · Mar 9 '09, 08:39 AM

Hi!
KevinADC: Yes that's right I missed looking into Win32::IE::Mech anize, sorry for that!
Now I've started looking into that though, and it seems to be filling my needs somewhat better, feels like I'm almost there, but still I don't get how I can get my files downloaded without manually having to provide any user input whatsoever.

As of now I get an IE browser starting up, and I get to the download file prompt, but I don't want to manually have to click
"save" and provide location etc.. Plus, I don't want IE to show at all.. Is that possible?

This script is to be run at a server, so I want everything to be "invisible" ..

Here's my current script:

Code:

use warnings;
use Win32::IE::Mechanize;

my $ie = Win32::IE::Mechanize->new( visible => 1 );

my $username = "user";
my $password = "pwd";

my $url = "http://weblink.To.TheFile";
my $realm;

 $ie->credentials( 'myHostname:myPort', $realm, $username, $password );
  
print "Fetching $url";
   $ie->get( $url, ':content_file' => 'C:\Temp\result\result.zip');
   die "Ooops, this didn't work: ", $ie->response->status_line unless $ie->success;

**KevinADC** · Mar 9 '09, 05:35 PM

Sorry but I don't know the answer or have any suggestions for your last questions. All I can suggest is to carefully read the modules documentation and see if there is anything that can help you solve those parts of your question.

**Icecrack** · Mar 9 '09, 10:07 PM

That is not a Perl issue that is a browser issue, you have too look into your browser settings or use the first version of google chrome as they started download when a file was clicked on. (this was updated in newer versions as it is a security risk, this is why they have a save option).

**AsiaWired** · Apr 17 '10, 03:08 PM

Silly question...

Why not use wget?

No need for a big perl script--assuming you're running *nix.

man wget

You can use it to imitate a browser, including login information and site cookies, while downloading files or webpages.

Example:

Code:

 wget -m -c --convert-links --user="Mister Man" --password=PreTTyPlease --load-cookies cookie.txt  --user-agent="Mozilla/4.0 (compatible; MSIE 7.0;  Windows NT 5.2)" http://www.your-special-site.com/get-that-archive.zip

You can automate the process via a cronjob on your *nix server to get those files from the remote location.

Of course, there are some security issues with putting your password into a shell command, and anyone with access to your crontab will be able to see it in plain text...but there are some other options if you are needing more security.

How can I get my https file fetch working?

How can I get my https file fetch working?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment