Warning: robots.txt unreliable in Apache servers
Collapse
This topic is closed.
X
X
-
Borek -
Robi
Re: Warning: robots.txt unreliable in Apache servers
Borek wrote in message news:op.szim0ca m584cds@borek.. .[color=blue]
> On Mon, 31 Oct 2005 15:58:43 +0100, Robi wrote:
>[color=green]
> > and at the same time posting b0rken links.[/color]
>
> Is it a typo, or bad joke? ;)[/color]
nothing to do with your name, sorry ;-)
Comment
-
Philip Ronan
Re: Warning: robots.txt unreliable in Apache servers
"Robi" wrote:
[color=blue]
> Philip Ronan wrote in message news:BF8BAF28.3 A0BA%invalid@in valid.invalid.. .[color=green]
>>
>> Maybe there's something wrong with your newsreader then.
>> <http://groups.google.com/group/comp.....html/msg/9a0f
>> 7baad24c74dc>[/color]
>
> I don't know what is worse than telling someone
> "there's something wrong with newsreader"
> and at the same time posting b0rken links.[/color]
.... using a crap newsreader and blaming everyone else when it doesn't work?
If your newsreader can't handle this link:
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>
then try this one instead: <http://tinyurl.com/89bmv>
If you're not too busy then try this one too:
<http://rfc.net/rfc2396.html#sE .>
--
phil [dot] ronan @ virgin [dot] net
Comment
-
Guy Macon
Re: Warning: robots.txt unreliable in Apache servers
Tim wrote:[color=blue]
>
>Philip Ronan:
>[color=green]
>> the robots.txt protocol is ineffective on (probably) most servers because
>> it can be circumvented without your knowledge by a third party.[/color]
>
>It always has been, anyway. For numerous reasons. Your multiple slash
>example is just one of them. Some robots will ignore them altogether,
>others will deliberately look at what you tell them to ignore.[/color]
The robots.txt protocol has always been ineffective on bad
robots, but this is, as far as I know, the first example of
it being ineffective on good robots.
--
Guy Macon <http://www.guymacon.co m>
Comment
-
Guy Macon
Re: Warning: robots.txt unreliable in Apache servers
D. Stussy wrote:[color=blue]
>
>Guy Macon wrote:
>[color=green]
>> I am still hoping that one of the .htaccess experts will come up
>> with a way to make all multiple-slash requests 301 redirect to
>> their single-slash versions.[/color]
>
>Trivial. Do it yourself.[/color]
What I described appears to not only be non-trivial, but also
appears to be impossible. Feel free to prove me wrong by posting
a counterexample that redirects all multiple-slash requests to
their single-slash versions. I don't think that you can do it,
but I am not an expert on .htaccess wizardry, so I may be wrong.
One would think that if such a trivial fix existed that someone
in the last 40+ posts would have posted it, thus solving the
problem...
--
Guy Macon <http://www.guymacon.co m/>
Comment
-
Philip Ronan
Re: Warning: robots.txt unreliable in Apache servers
"Guy Macon" wrote:
[color=blue]
> One would think that if such a trivial fix existed that someone
> in the last 40+ posts would have posted it, thus solving the
> problem...[/color]
Guy, if you've seen my solution at <http://tinyurl.com/89bmv> and you
haven't got access to PHP, you could try a recursive solution using
..htaccess by itself:
RewriteEngine On
RewriteCond %{REQUEST_URI} //+
RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]
I haven't tested this, but -- in theory -- if the server detects a cluster
of forward slashes in a request URI, it will redirect the client to a URI
containing a single slash in its place. If a request contains more than one
cluster of forward slashes, then the client will be redirected more than
once, but it should eventually get to the right place.
--
phil [dot] ronan @ virgin [dot] net
Comment
-
Alan J. Flavell
Re: Warning: robots.txt unreliable in Apache servers
On Mon, 31 Oct 2005, Philip Ronan wrote:
[color=blue]
> RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L][/color]
Actually, a RewriteMatch would suffice, it doesn't need the
full panoply of mod_rewrite...
Your regex doesn't do quite what you hope, due to the greedy nature of
the first "(.*)"
Incidentally, I recommend "pcretest" for this kind of fun.
$ pcretest
PCRE version 3.9 02-Jan-2002
re> "^(.*)//+(.*)$"
data> /one////two/three
0: /one////two/three
1: /one//
2: two/three
As you see, $1 captures a pair of slashes which you really wanteed
to be captured by your "//+" portion. As I say, I made the same
mistake at first.
I'd then got closer, with ^(.*?)/{2,}(.*)$ $1/$2
re> "^(.*?)/{2,}(.*)$"
data> /one////two/three
0: /one////two/three
1: /one
2: two/three
with the end result being /one/two/three , as desired.
I think your "//+" is pretty much synonymous with my "/{2,}";
the key difference is to make the first regex non-greedy.
[color=blue]
> If a request contains more than one cluster of forward slashes, then
> the client will be redirected more than once, but it should
> eventually get to the right place.[/color]
Indeed.
But aren't there also analogous abuse possibilities with things like
/././ and /.././ and so on?
Comment
-
Alan J. Flavell
Re: Warning: robots.txt unreliable in Apache servers
On Mon, 31 Oct 2005, Alan J. Flavell wrote:
[color=blue]
> On Mon, 31 Oct 2005, Philip Ronan wrote:
>[color=green]
> > RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L][/color]
>
> Actually, a RewriteMatch would suffice,[/color]
*RATS*: I meant of course "RedirectMatch" . Sorry.
But I think the rest of what I posted is OK.
Comment
-
Philip Ronan
Re: Warning: robots.txt unreliable in Apache servers
"Alan J. Flavell" wrote:
[color=blue]
> On Mon, 31 Oct 2005, Philip Ronan wrote:
>[color=green]
>> RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L][/color]
>
> Actually, a [RedirectMatch] would suffice, it doesn't need the
> full panoply of mod_rewrite...
>
> Your regex doesn't do quite what you hope, due to the greedy nature of
> the first "(.*)"[/color]
Ah, well spotted. :-)
In which case, this ought to do the trick:
# Eliminate forward slash clusters
RedirectMatch 301 ^(.*?)//+(.*)$ $1/$2
[color=blue]
> But aren't there also analogous abuse possibilities with things like
> /././ and /.././ and so on?[/color]
Another good point. I thought my server was already redirecting those, but
apparently not -- it was the browser correcting my URLs for me.
Perhaps someone can debug these for me?
# Replace /./ with /
RedirectMatch 301 ^(.*?)/\./(.*)$ $1/$2
# Replace /../foo/bar with /foo/bar (at beginning of URI)
RedirectMatch 301 ^/\.\./(.*)$ /$1
# Replace /foo/../bar with /bar
RedirectMatch 301 ^(.*?)/[^/]+/\.\./(.*)$ $1\$2
Phil
--
phil [dot] ronan @ virgin [dot] net
Comment
-
Tim
Re: Warning: robots.txt unreliable in Apache servers
Tim:
[color=blue][color=green]
>> It always has been, anyway. For numerous reasons. Your multiple slash
>> example is just one of them. Some robots will ignore them altogether,
>> others will deliberately look at what you tell them to ignore.[/color][/color]
Guy Macon:
[color=blue]
> The robots.txt protocol has always been ineffective on bad robots, but
> this is, as far as I know, the first example of it being ineffective on
> good robots.[/color]
I'm not so sure that it's a fault with robots.text. After all,
strangeness notwithstanding ///example isn't the same as /example.
Personally, I think this is an issue you'd need to deal with within the
server (e.g. filter requests to disallow access to URIs with multiple
concurrent slashes in them, rather than work around such conditions).
--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.
This message was sent without a virus, please destroy some files yourself.
Comment
-
Dave0x01
Re: Warning: robots.txt unreliable in Apache servers
Borek wrote:
[color=blue]
> On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <ask@example.co m> wrote:
>
>[color=green]
>>It's not clear exactly what the problem *is*. I've never seen a URL
>>with multiple adjacent forward slashes in my search results. Does
>>someone have an example?[/color][/color]
<snip>
[color=blue]
> All of these generated 404 in last few weeks on my site.
>
> No additional slashes inside of the url, although several times
> they were added at the end.
>
> & vs & and wrong capitalization (bate, casc instead of BATE, CASC)
> are most prominent sources of errors. But it seems every error is possible
> :)[/color]
Sorry, I should've been more clear. I wanted to know whether anyone
could point to an actual URL (e.g., a search query) demonstrating that
URLs with multiple adjacent forward slashes are actually being indexed
by any of the major search engines. I haven't seen one.
However, I don't think that the original poster was concerned with
whether these multiple slashed URLs appear in the index as such, so it's
probably not terribly important.
Dave
Comment
-
Dave0x01
Re: Warning: robots.txt unreliable in Apache servers
Guy Macon wrote:
[color=blue]
> Dave0x1 wrote:
>
>[color=green]
>>It's not clear exactly what the problem *is*. I've never seen a URL
>>with multiple adjacent forward slashes in my search results.[/color]
>
>
> If there exists a way for someone else on the Internet to override
> your spidering decisions as defined in robots.txt, there will be
> those who use that ability to inconvenience, harass or hurt others.[/color]
A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.
Dave
Comment
-
Dave0x01
Re: Warning: robots.txt unreliable in Apache servers
Philip Ronan wrote:
[color=blue]
> "Dave0x1" wrote:
>
>[color=green]
>>I don't understand why this is a big deal. The issue can be addressed
>>by numerous methods, including patching of the Apache web server source
>>code.[/color]
>
>
> OK, so as long as the robots.txt documentation includes a note saying that
> you have to patch your server software to get reliable results, then we'll
> all be fine.[/color]
I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.
[color=blue][color=green]
>>It's not clear exactly what the problem *is*. I've never seen a URL
>>with multiple adjacent forward slashes in my search results. Does
>>someone have an example?[/color]
>
>
> Which bit didn't I explain properly? I'm not going to post a link for you to
> check, but here's the response I got from Google on the issue:
>
>[color=green][color=darkred]
>>>Thank you for your note. We apologize for our delayed response.
>>>We understand you're concerned about the inclusion of
>>>http://###.####.###//contact/ in our index.[/color][/color][/color]
Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.
Dave
Comment
-
Big Bill
Re: Warning: robots.txt unreliable in Apache servers
On Wed, 02 Nov 2005 17:45:05 -0500, Dave0x01 <ask@example.co m> wrote:
[color=blue]
>Guy Macon wrote:
>[color=green]
>> Dave0x1 wrote:
>>
>>[color=darkred]
>>>It's not clear exactly what the problem *is*. I've never seen a URL
>>>with multiple adjacent forward slashes in my search results.[/color]
>>
>>
>> If there exists a way for someone else on the Internet to override
>> your spidering decisions as defined in robots.txt, there will be
>> those who use that ability to inconvenience, harass or hurt others.[/color]
>
>A robots.txt file doesn't make any decisions about which parts of a site
>are indexed; it merely offers suggestions.
>
>Dave[/color]
Which is a good way of putting it.
BB
--
www.kruse.co.uk/ seo@kruse.demon .co.uk
Elvis does my SEO
Comment
-
Guy Macon
Re: Warning: robots.txt unreliable in Apache servers
Dave0x01 wrote:[color=blue]
>
>Guy Macon wrote:
>[color=green]
>> Dave0x1 wrote:
>>[color=darkred]
>>>It's not clear exactly what the problem *is*. I've never seen a URL
>>>with multiple adjacent forward slashes in my search results.[/color]
>>
>> If there exists a way for someone else on the Internet to override
>> your spidering decisions as defined in robots.txt, there will be
>> those who use that ability to inconvenience, harass or hurt others.[/color]
>
>A robots.txt file doesn't make any decisions about which parts of a site
>are indexed; it merely offers suggestions.[/color]
A robots.txt file most certainly does decide which parts of a site
are indexed - by good robots. It offers suggestions that every good
robot obeys. The effect we are discussing someone else on the Internet
to override your good-robot spidering decisions as defined in robots.txt.
Comment
Comment