Warning: robots.txt unreliable in Apache servers

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Borek

    #31
    Re: Warning: robots.txt unreliable in Apache servers

    On Mon, 31 Oct 2005 15:58:43 +0100, Robi <me@privacy.net > wrote:
    [color=blue]
    > and at the same time posting b0rken links.[/color]

    Is it a typo, or bad joke? ;)

    Best,
    Borek
    --




    Comment

    • Robi

      #32
      Re: Warning: robots.txt unreliable in Apache servers

      Borek wrote in message news:op.szim0ca m584cds@borek.. .[color=blue]
      > On Mon, 31 Oct 2005 15:58:43 +0100, Robi wrote:
      >[color=green]
      > > and at the same time posting b0rken links.[/color]
      >
      > Is it a typo, or bad joke? ;)[/color]



      nothing to do with your name, sorry ;-)

      Comment

      • Philip Ronan

        #33
        Re: Warning: robots.txt unreliable in Apache servers

        "Robi" wrote:
        [color=blue]
        > Philip Ronan wrote in message news:BF8BAF28.3 A0BA%invalid@in valid.invalid.. .[color=green]
        >>
        >> Maybe there's something wrong with your newsreader then.
        >> <http://groups.google.com/group/comp.....html/msg/9a0f
        >> 7baad24c74dc>[/color]
        >
        > I don't know what is worse than telling someone
        > "there's something wrong with newsreader"
        > and at the same time posting b0rken links.[/color]

        .... using a crap newsreader and blaming everyone else when it doesn't work?

        If your newsreader can't handle this link:
        <http://groups.google.com/group/comp.....html/msg/9a0f
        7baad24c74dc>

        then try this one instead: <http://tinyurl.com/89bmv>

        If you're not too busy then try this one too:
        <http://rfc.net/rfc2396.html#sE .>

        --
        phil [dot] ronan @ virgin [dot] net


        Comment

        • Guy Macon

          #34
          Re: Warning: robots.txt unreliable in Apache servers




          Tim wrote:[color=blue]
          >
          >Philip Ronan:
          >[color=green]
          >> the robots.txt protocol is ineffective on (probably) most servers because
          >> it can be circumvented without your knowledge by a third party.[/color]
          >
          >It always has been, anyway. For numerous reasons. Your multiple slash
          >example is just one of them. Some robots will ignore them altogether,
          >others will deliberately look at what you tell them to ignore.[/color]

          The robots.txt protocol has always been ineffective on bad
          robots, but this is, as far as I know, the first example of
          it being ineffective on good robots.

          --
          Guy Macon <http://www.guymacon.co m>


          Comment

          • Guy Macon

            #35
            Re: Warning: robots.txt unreliable in Apache servers




            D. Stussy wrote:[color=blue]
            >
            >Guy Macon wrote:
            >[color=green]
            >> I am still hoping that one of the .htaccess experts will come up
            >> with a way to make all multiple-slash requests 301 redirect to
            >> their single-slash versions.[/color]
            >
            >Trivial. Do it yourself.[/color]

            What I described appears to not only be non-trivial, but also
            appears to be impossible. Feel free to prove me wrong by posting
            a counterexample that redirects all multiple-slash requests to
            their single-slash versions. I don't think that you can do it,
            but I am not an expert on .htaccess wizardry, so I may be wrong.

            One would think that if such a trivial fix existed that someone
            in the last 40+ posts would have posted it, thus solving the
            problem...

            --
            Guy Macon <http://www.guymacon.co m/>

            Comment

            • Philip Ronan

              #36
              Re: Warning: robots.txt unreliable in Apache servers

              "Guy Macon" wrote:
              [color=blue]
              > One would think that if such a trivial fix existed that someone
              > in the last 40+ posts would have posted it, thus solving the
              > problem...[/color]

              Guy, if you've seen my solution at <http://tinyurl.com/89bmv> and you
              haven't got access to PHP, you could try a recursive solution using
              ..htaccess by itself:

              RewriteEngine On
              RewriteCond %{REQUEST_URI} //+
              RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]

              I haven't tested this, but -- in theory -- if the server detects a cluster
              of forward slashes in a request URI, it will redirect the client to a URI
              containing a single slash in its place. If a request contains more than one
              cluster of forward slashes, then the client will be redirected more than
              once, but it should eventually get to the right place.

              --
              phil [dot] ronan @ virgin [dot] net


              Comment

              • Alan J. Flavell

                #37
                Re: Warning: robots.txt unreliable in Apache servers

                On Mon, 31 Oct 2005, Philip Ronan wrote:
                [color=blue]
                > RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L][/color]

                Actually, a RewriteMatch would suffice, it doesn't need the
                full panoply of mod_rewrite...

                Your regex doesn't do quite what you hope, due to the greedy nature of
                the first "(.*)"

                Incidentally, I recommend "pcretest" for this kind of fun.

                $ pcretest
                PCRE version 3.9 02-Jan-2002

                re> "^(.*)//+(.*)$"
                data> /one////two/three
                0: /one////two/three
                1: /one//
                2: two/three

                As you see, $1 captures a pair of slashes which you really wanteed
                to be captured by your "//+" portion. As I say, I made the same
                mistake at first.

                I'd then got closer, with ^(.*?)/{2,}(.*)$ $1/$2

                re> "^(.*?)/{2,}(.*)$"
                data> /one////two/three
                0: /one////two/three
                1: /one
                2: two/three

                with the end result being /one/two/three , as desired.

                I think your "//+" is pretty much synonymous with my "/{2,}";
                the key difference is to make the first regex non-greedy.

                [color=blue]
                > If a request contains more than one cluster of forward slashes, then
                > the client will be redirected more than once, but it should
                > eventually get to the right place.[/color]

                Indeed.

                But aren't there also analogous abuse possibilities with things like
                /././ and /.././ and so on?

                Comment

                • Alan J. Flavell

                  #38
                  Re: Warning: robots.txt unreliable in Apache servers

                  On Mon, 31 Oct 2005, Alan J. Flavell wrote:
                  [color=blue]
                  > On Mon, 31 Oct 2005, Philip Ronan wrote:
                  >[color=green]
                  > > RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L][/color]
                  >
                  > Actually, a RewriteMatch would suffice,[/color]

                  *RATS*: I meant of course "RedirectMatch" . Sorry.

                  But I think the rest of what I posted is OK.

                  Comment

                  • Philip Ronan

                    #39
                    Re: Warning: robots.txt unreliable in Apache servers

                    "Alan J. Flavell" wrote:
                    [color=blue]
                    > On Mon, 31 Oct 2005, Philip Ronan wrote:
                    >[color=green]
                    >> RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L][/color]
                    >
                    > Actually, a [RedirectMatch] would suffice, it doesn't need the
                    > full panoply of mod_rewrite...
                    >
                    > Your regex doesn't do quite what you hope, due to the greedy nature of
                    > the first "(.*)"[/color]

                    Ah, well spotted. :-)

                    In which case, this ought to do the trick:

                    # Eliminate forward slash clusters
                    RedirectMatch 301 ^(.*?)//+(.*)$ $1/$2
                    [color=blue]
                    > But aren't there also analogous abuse possibilities with things like
                    > /././ and /.././ and so on?[/color]

                    Another good point. I thought my server was already redirecting those, but
                    apparently not -- it was the browser correcting my URLs for me.

                    Perhaps someone can debug these for me?

                    # Replace /./ with /
                    RedirectMatch 301 ^(.*?)/\./(.*)$ $1/$2

                    # Replace /../foo/bar with /foo/bar (at beginning of URI)
                    RedirectMatch 301 ^/\.\./(.*)$ /$1

                    # Replace /foo/../bar with /bar
                    RedirectMatch 301 ^(.*?)/[^/]+/\.\./(.*)$ $1\$2

                    Phil

                    --
                    phil [dot] ronan @ virgin [dot] net


                    Comment

                    • Tim

                      #40
                      Re: Warning: robots.txt unreliable in Apache servers

                      Tim:
                      [color=blue][color=green]
                      >> It always has been, anyway. For numerous reasons. Your multiple slash
                      >> example is just one of them. Some robots will ignore them altogether,
                      >> others will deliberately look at what you tell them to ignore.[/color][/color]

                      Guy Macon:
                      [color=blue]
                      > The robots.txt protocol has always been ineffective on bad robots, but
                      > this is, as far as I know, the first example of it being ineffective on
                      > good robots.[/color]

                      I'm not so sure that it's a fault with robots.text. After all,
                      strangeness notwithstanding ///example isn't the same as /example.
                      Personally, I think this is an issue you'd need to deal with within the
                      server (e.g. filter requests to disallow access to URIs with multiple
                      concurrent slashes in them, rather than work around such conditions).

                      --
                      If you insist on e-mailing me, use the reply-to address (it's real but
                      temporary). But please reply to the group, like you're supposed to.

                      This message was sent without a virus, please destroy some files yourself.

                      Comment

                      • Dave0x01

                        #41
                        Re: Warning: robots.txt unreliable in Apache servers

                        Borek wrote:
                        [color=blue]
                        > On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <ask@example.co m> wrote:
                        >
                        >[color=green]
                        >>It's not clear exactly what the problem *is*. I've never seen a URL
                        >>with multiple adjacent forward slashes in my search results. Does
                        >>someone have an example?[/color][/color]

                        <snip>
                        [color=blue]
                        > All of these generated 404 in last few weeks on my site.
                        >
                        > No additional slashes inside of the url, although several times
                        > they were added at the end.
                        >
                        > & vs &amp; and wrong capitalization (bate, casc instead of BATE, CASC)
                        > are most prominent sources of errors. But it seems every error is possible
                        > :)[/color]

                        Sorry, I should've been more clear. I wanted to know whether anyone
                        could point to an actual URL (e.g., a search query) demonstrating that
                        URLs with multiple adjacent forward slashes are actually being indexed
                        by any of the major search engines. I haven't seen one.

                        However, I don't think that the original poster was concerned with
                        whether these multiple slashed URLs appear in the index as such, so it's
                        probably not terribly important.


                        Dave


                        Comment

                        • Dave0x01

                          #42
                          Re: Warning: robots.txt unreliable in Apache servers

                          Guy Macon wrote:
                          [color=blue]
                          > Dave0x1 wrote:
                          >
                          >[color=green]
                          >>It's not clear exactly what the problem *is*. I've never seen a URL
                          >>with multiple adjacent forward slashes in my search results.[/color]
                          >
                          >
                          > If there exists a way for someone else on the Internet to override
                          > your spidering decisions as defined in robots.txt, there will be
                          > those who use that ability to inconvenience, harass or hurt others.[/color]

                          A robots.txt file doesn't make any decisions about which parts of a site
                          are indexed; it merely offers suggestions.

                          Dave

                          Comment

                          • Dave0x01

                            #43
                            Re: Warning: robots.txt unreliable in Apache servers

                            Philip Ronan wrote:
                            [color=blue]
                            > "Dave0x1" wrote:
                            >
                            >[color=green]
                            >>I don't understand why this is a big deal. The issue can be addressed
                            >>by numerous methods, including patching of the Apache web server source
                            >>code.[/color]
                            >
                            >
                            > OK, so as long as the robots.txt documentation includes a note saying that
                            > you have to patch your server software to get reliable results, then we'll
                            > all be fine.[/color]

                            I wouldn't consider patching of the Apache source code either necessary
                            or desirable in this situation.
                            [color=blue][color=green]
                            >>It's not clear exactly what the problem *is*. I've never seen a URL
                            >>with multiple adjacent forward slashes in my search results. Does
                            >>someone have an example?[/color]
                            >
                            >
                            > Which bit didn't I explain properly? I'm not going to post a link for you to
                            > check, but here's the response I got from Google on the issue:
                            >
                            >[color=green][color=darkred]
                            >>>Thank you for your note. We apologize for our delayed response.
                            >>>We understand you're concerned about the inclusion of
                            >>>http://###.####.###//contact/ in our index.[/color][/color][/color]

                            Does the URL in question appear in the index as
                            <http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
                            My assumption is the latter.

                            Dave





                            Comment

                            • Big Bill

                              #44
                              Re: Warning: robots.txt unreliable in Apache servers

                              On Wed, 02 Nov 2005 17:45:05 -0500, Dave0x01 <ask@example.co m> wrote:
                              [color=blue]
                              >Guy Macon wrote:
                              >[color=green]
                              >> Dave0x1 wrote:
                              >>
                              >>[color=darkred]
                              >>>It's not clear exactly what the problem *is*. I've never seen a URL
                              >>>with multiple adjacent forward slashes in my search results.[/color]
                              >>
                              >>
                              >> If there exists a way for someone else on the Internet to override
                              >> your spidering decisions as defined in robots.txt, there will be
                              >> those who use that ability to inconvenience, harass or hurt others.[/color]
                              >
                              >A robots.txt file doesn't make any decisions about which parts of a site
                              >are indexed; it merely offers suggestions.
                              >
                              >Dave[/color]

                              Which is a good way of putting it.

                              BB
                              --
                              www.kruse.co.uk/ seo@kruse.demon .co.uk
                              Elvis does my SEO

                              Comment

                              • Guy Macon

                                #45
                                Re: Warning: robots.txt unreliable in Apache servers




                                Dave0x01 wrote:[color=blue]
                                >
                                >Guy Macon wrote:
                                >[color=green]
                                >> Dave0x1 wrote:
                                >>[color=darkred]
                                >>>It's not clear exactly what the problem *is*. I've never seen a URL
                                >>>with multiple adjacent forward slashes in my search results.[/color]
                                >>
                                >> If there exists a way for someone else on the Internet to override
                                >> your spidering decisions as defined in robots.txt, there will be
                                >> those who use that ability to inconvenience, harass or hurt others.[/color]
                                >
                                >A robots.txt file doesn't make any decisions about which parts of a site
                                >are indexed; it merely offers suggestions.[/color]

                                A robots.txt file most certainly does decide which parts of a site
                                are indexed - by good robots. It offers suggestions that every good
                                robot obeys. The effect we are discussing someone else on the Internet
                                to override your good-robot spidering decisions as defined in robots.txt.


                                Comment

                                Working...