Warning: robots.txt unreliable in Apache servers

**Benjamin Niemann** · Oct 30 '05, 09:05 AM

Re: Warning: robots.txt unreliable in Apache servers

Anonymous, quoting Philip Ronan wrote:
[color=blue]
> I recently discovered that robots.txt files aren't necessarily any use on
> Apache servers.
>
> For some reason, the Apache developers decided to treat multiple
> consecutive forward slashes in a request URI as a single forward slash. So
> for example, <http://apache.org/foundation/> and
> <http://apache.org//////foundation/> both resolve to the same page.[/color]

I could not find anything about the semantics of empty path segments in http
URLs, but this behaviour seems to be common practice. What about IIS or
other webservers?
[color=blue]
> Let's suppose the Apache website owners want to stop search engine robots
> crawling through their "foundation " pages. They could put this rule in
> their robots.txt file:
>
> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.
>
> As a result of all this, Google is currently indexing a page on my website
> that I specifically asked it to stay away from :-([/color]

I would tend to blame googlebot (and any other effected robot). Unless a
different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
different resource on the server) is common practice, the robot should
normalize such pathes (removing empty segments) before matching it against
the rules from the robots.txt file.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

**Nick Kew** · Oct 30 '05, 10:05 AM

Re: Warning: robots.txt unreliable in Apache servers

Anonymous wrote:
[color=blue]
> For some reason, the Apache developers decided to treat multiple consecutive
> forward slashes in a request URI as a single forward slash. So for example,
> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
> both resolve to the same page.[/color]

Yep. If you apply filesystem semantics to that, you have a whopping
great security hole. Of course you could just return "bad request",
but that just transfers the risk leaving server admins to shoot
their own feet.

There was a story in TheRegister a couple of weeks ago about someone
who got a criminal conviction (for attempted unauthorized access)
after he requested a url like that and it triggered an intrusion
detection alarm.

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

--
Nick Kew

**Philip Ronan** · Oct 30 '05, 11:25 AM

Re: Warning: robots.txt unreliable in Apache servers

"Nick Kew" wrote:
[color=blue]
> If you have links to things like "////" and dumb robots, put the
> paths in your robots.txt. Don't forget that robots.txt is only
> advisory and is commonly ignored by evil and/or broken robots.[/color]

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

--
phil [dot] ronan @ virgin [dot] net

http://vzone.virgin.net/phil.ronan/

**Nick Kew** · Oct 30 '05, 01:15 PM

Re: Warning: robots.txt unreliable in Apache servers

Philip Ronan wrote:

[please don't crosspost without warning. Or with inadequate context]
[color=blue]
> "Nick Kew" wrote:
>
>[color=green]
>>If you have links to things like "////" and dumb robots, put the
>>paths in your robots.txt. Don't forget that robots.txt is only
>>advisory and is commonly ignored by evil and/or broken robots.[/color]
>
>
> But retroactively adding to the robots.txt file every time someone posts a
> bad link to your site just isn't a practical solution.[/color]

Who said anything about that? What's impractical about "Disallow //" ?

--
Nick Kew

**Stan Brown** · Oct 30 '05, 01:15 PM

Re: Warning: robots.txt unreliable in Apache servers

Sun, 30 Oct 2005 09:34:36 +0000 from Nick Kew
<nick@asgard.we bthing.com>:[color=blue]
> If you have links to things like "////" and dumb robots, put the
> paths in your robots.txt. Don't forget that robots.txt is only
> advisory and is commonly ignored by evil and/or broken robots.[/color]

Wouldn't it be more effective to have any URL containing http://.*//
return a 403 Forbidden or a 404 Not Found? This could be done in
..htaccess or perhaps httpd.conf. I may be having a failure of
imagination, but I can't think of any legitimate reason for such a
link.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA

Oak Road Systems -- Software Since 1984

http://OakRoadSystems.com/

HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:

http://diveintomark.org/archives/2003/05/05/why_we_wont_help_you

**Philip Ronan** · Oct 30 '05, 02:55 PM

Re: Warning: robots.txt unreliable in Apache servers

"Nick Kew" wrote:
[color=blue]
> [please don't crosspost without warning. Or with inadequate context][/color]

My original post was copied over to ciwah, so now there are two threads with
the same subject. I'm trying to tie them together, mkay?
[color=blue]
> Philip Ronan wrote:[color=green]
>>
>> But retroactively adding to the robots.txt file every time someone posts a
>> bad link to your site just isn't a practical solution.[/color]
>
> Who said anything about that?[/color]

You did, in your earlier post:

If you have links to things like
"////" and dumb robots, put the paths in your robots.txt.

[color=blue]
> What's impractical about "Disallow //" ?[/color]

It's a partial solution. If you're trying to protect content at deeper
levels in the hierarchy, you will also need:

Disallow: /path//to/file
Disallow: /path/to//file
Disallow: /path//to//file
Disallow: /path///to/file
etc..

As I said, robots.txt is inadequate for this purpose because it doesn't
support pattern matching.

--
phil [dot] ronan @ virgin [dot] net

http://vzone.virgin.net/phil.ronan/

**Philip Ronan** · Oct 30 '05, 02:55 PM

Re: Warning: robots.txt unreliable in Apache servers

In comp.infosystem s.www.authoring.html, "Stan Brown" wrote:
[color=blue]
> Wouldn't it be more effective to have any URL containing http://.*//
> return a 403 Forbidden or a 404 Not Found? This could be done in
> .htaccess or perhaps httpd.conf. I may be having a failure of
> imagination, but I can't think of any legitimate reason for such a
> link.[/color]

That would also be effective, but maybe it's better to do something useful
with the URL if you can.

Most servers will redirect to a URL with a trailing slash when the name of a
directory is requested. Why not treat multiple slashes in a similar way?

Besides, it might help in terms of page rank.

[[Crossposted to alt.internet.se arch-engines, with apologies to Nick]]

--
phil [dot] ronan @ virgin [dot] net

http://vzone.virgin.net/phil.ronan/

**David** · Oct 30 '05, 03:55 PM

Re: Warning: robots.txt unreliable in Apache servers

On Sun, 30 Oct 2005 11:15:03 +0000, Philip Ronan
<invalid@invali d.invalid> wrote:
[color=blue]
>"Nick Kew" wrote:
>[color=green]
>> If you have links to things like "////" and dumb robots, put the
>> paths in your robots.txt. Don't forget that robots.txt is only
>> advisory and is commonly ignored by evil and/or broken robots.[/color]
>
>But retroactively adding to the robots.txt file every time someone posts a
>bad link to your site just isn't a practical solution. I realize not all
>robots bother with the robots.txt protocol, but if even the legitimate
>spiders can be misdirected then the whole point of having a robots.txt file
>goes out the window.[/color]

A simple solution would be to add the robots meta tag to all pages you
don't want indexing as a backup for when someone links with //. Kind
of defeats the whole point of using a robots.txt file, but what else
can you do?

David
--
Free Search Engine Optimization Tutorial

SEO Gold Services

http://www.seo-gold.com/tutorial/

It's a sad SEO fact that 98% of search engine visitors leave a website without buying or converting to a potential sales lead! SEO Gold makes sure your website is super fast, so Google sends you more free traffic.

**David Ross** · Oct 30 '05, 06:06 PM

Re: Warning: robots.txt unreliable in Apache servers

Anonymous, quoting Philip Ronan wrote:[color=blue]
>
>
> Subject: Warning: robots.txt unreliable in Apache servers
> From: Philip Ronan <invalid@invali d.invalid>
> Newsgroups: alt.internet.se arch-engines
> Message-ID: <BF89BF33.39FDF %invalid@invali d.invalid>
> Date: Sat, 29 Oct 2005 23:07:46 GMT
>
> Hi,
>
> I recently discovered that robots.txt files aren't necessarily any use on
> Apache servers.
>
> For some reason, the Apache developers decided to treat multiple consecutive
> forward slashes in a request URI as a single forward slash. So for example,
> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
> both resolve to the same page.
>
> Let's suppose the Apache website owners want to stop search engine robots
> crawling through their "foundation " pages. They could put this rule in their
> robots.txt file:
>
> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.
>
> As a result of all this, Google is currently indexing a page on my website
> that I specifically asked it to stay away from :-(
>
> You might want to check the behaviour of your servers to see if you're
> vulnerable to the same sort of problem.
>
> If anyone's interested, I've put together a .htaccess rule and a PHP script
> that seem to sort things out.[/color]

I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.

--

David E. Ross
<URL:http://www.rossde.com/>

I use Mozilla as my Web browser because I want a browser that
complies with Web standards. See <URL:http://www.mozilla.org/>.

**Guy Macon** · Oct 30 '05, 06:56 PM

Re: Warning: robots.txt unreliable in Apache servers

David Ross wrote:[color=blue]
>
>Philip Ronan wrote:[color=green]
>>
>> I recently discovered that robots.txt files aren't necessarily any use on
>> Apache servers.
>>
>> For some reason, the Apache developers decided to treat multiple consecutive
>> forward slashes in a request URI as a single forward slash. So for example,
>> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
>> both resolve to the same page.
>>
>> Let's suppose the Apache website owners want to stop search engine robots
>> crawling through their "foundation " pages. They could put this rule in their
>> robots.txt file:
>>
>> Disallow: /foundation/
>>
>> But if I posted a link to //////foundation/ somewhere, the search engines
>> will be quite happy to index it because it isn't covered by this rule.
>>
>> As a result of all this, Google is currently indexing a page on my website
>> that I specifically asked it to stay away from :-(
>>
>> You might want to check the behaviour of your servers to see if you're
>> vulnerable to the same sort of problem.
>>
>> If anyone's interested, I've put together a .htaccess rule and a PHP script
>> that seem to sort things out.[/color]
>
>I thought that parsing and processing a robots.txt file is the
>responsibili ty of the bot and not the Web server. All the Web
>server has to do is deliver the robots.txt file to the bot.
>
>If that is true, the problem lies within Google and not Apache.[/color]

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

--
Guy Macon <http://www.guymacon.co m/>

**Brian Wakem** · Oct 30 '05, 07:26 PM

Re: Warning: robots.txt unreliable in Apache servers

Guy Macon <http://www.guymacon.co m/> wrote:
[color=blue]
> I was about to opine that "http://apache.org//////" is not the same
> as "http://apache.org/", but it appears that IIS has the same behavior:
> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
> Is there something in the specs that says that treating "//////" and
> "/" the same is proper behavior?
>[/color]

Don't know, but it seems to be the case on unix/linux filesystems too,

If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2

The web servers are probably mimicking this behaviour.

--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png

**Jim Moe** · Oct 30 '05, 08:25 PM

Re: Warning: robots.txt unreliable in Apache servers

Guy Macon wrote:[color=blue]
>
> I was about to opine that "http://apache.org//////" is not the same
> as "http://apache.org/", but it appears that IIS has the same behavior:
> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
> Is there something in the specs that says that treating "//////" and
> "/" the same is proper behavior?
>[/color]
You are referring to which specs?
This behavior for following paths is from unix and is how all C
compilers handle paths. It is simply applied to URLs as well. There may
even be a requirement in the C specification about paths.

--
jmm (hyphen) list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)

**Dave0x1** · Oct 30 '05, 08:55 PM

Re: Warning: robots.txt unreliable in Apache servers

Guy Macon wrote:

[color=blue]
> I was about to opine that "http://apache.org//////" is not the same
> as "http://apache.org/", but it appears that IIS has the same behavior:
> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
> Is there something in the specs that says that treating "//////" and
> "/" the same is proper behavior?[/color]

Hint: Read the documentation offered at either of the first two URLs.

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Dave

**Philip Ronan** · Oct 30 '05, 11:55 PM

Re: Warning: robots.txt unreliable in Apache servers

"Dave0x1" wrote:
[color=blue]
> I don't understand why this is a big deal. The issue can be addressed
> by numerous methods, including patching of the Apache web server source
> code.[/color]

OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.
[color=blue]
> It's not clear exactly what the problem *is*. I've never seen a URL
> with multiple adjacent forward slashes in my search results. Does
> someone have an example?[/color]

Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:
[color=blue][color=green]
>> Thank you for your note. We apologize for our delayed response.
>> We understand you're concerned about the inclusion of
>> http://###.####.###//contact/ in our index.
>>
>> It's important to note that we visited the live page in question
>> and found that it currently exists on the web as listed above.
>> Because this page falls outside your robots.txt file, you may
>> want to use meta tags to remove this page from our index. For
>> more information about using meta tags, please visit
>> http://www.google.com/remove.html
>>
>> [remainder snipped][/color][/color]

I didn't publish the link to //contact/, someone else did. So that means the
robots.txt protocol is ineffective on (probably) most servers because it can
be circumvented without your knowledge by a third party.

Hope that's all clear now.

--
phil [dot] ronan @ virgin [dot] net

http://vzone.virgin.net/phil.ronan/

Warning: robots.txt unreliable in Apache servers

Warning: robots.txt unreliable in Apache servers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment