Warning: robots.txt unreliable in Apache servers

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anonymous, quoting Philip Ronan

    Warning: robots.txt unreliable in Apache servers




    Subject: Warning: robots.txt unreliable in Apache servers
    From: Philip Ronan <invalid@invali d.invalid>
    Newsgroups: alt.internet.se arch-engines
    Message-ID: <BF89BF33.39FDF %invalid@invali d.invalid>
    Date: Sat, 29 Oct 2005 23:07:46 GMT

    Hi,

    I recently discovered that robots.txt files aren't necessarily any use on
    Apache servers.

    For some reason, the Apache developers decided to treat multiple consecutive
    forward slashes in a request URI as a single forward slash. So for example,
    <http://apache.org/foundation/> and <http://apache.org//////foundation/>
    both resolve to the same page.

    Let's suppose the Apache website owners want to stop search engine robots
    crawling through their "foundation " pages. They could put this rule in their
    robots.txt file:

    Disallow: /foundation/

    But if I posted a link to //////foundation/ somewhere, the search engines
    will be quite happy to index it because it isn't covered by this rule.

    As a result of all this, Google is currently indexing a page on my website
    that I specifically asked it to stay away from :-(

    You might want to check the behaviour of your servers to see if you're
    vulnerable to the same sort of problem.

    If anyone's interested, I've put together a .htaccess rule and a PHP script
    that seem to sort things out.

    Phil

    --
    phil [dot] ronan @ virgin [dot] net


  • Benjamin Niemann

    #2
    Re: Warning: robots.txt unreliable in Apache servers

    Anonymous, quoting Philip Ronan wrote:
    [color=blue]
    > I recently discovered that robots.txt files aren't necessarily any use on
    > Apache servers.
    >
    > For some reason, the Apache developers decided to treat multiple
    > consecutive forward slashes in a request URI as a single forward slash. So
    > for example, <http://apache.org/foundation/> and
    > <http://apache.org//////foundation/> both resolve to the same page.[/color]

    I could not find anything about the semantics of empty path segments in http
    URLs, but this behaviour seems to be common practice. What about IIS or
    other webservers?
    [color=blue]
    > Let's suppose the Apache website owners want to stop search engine robots
    > crawling through their "foundation " pages. They could put this rule in
    > their robots.txt file:
    >
    > Disallow: /foundation/
    >
    > But if I posted a link to //////foundation/ somewhere, the search engines
    > will be quite happy to index it because it isn't covered by this rule.
    >
    > As a result of all this, Google is currently indexing a page on my website
    > that I specifically asked it to stay away from :-([/color]

    I would tend to blame googlebot (and any other effected robot). Unless a
    different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
    different resource on the server) is common practice, the robot should
    normalize such pathes (removing empty segments) before matching it against
    the rules from the robots.txt file.

    --
    Benjamin Niemann
    Email: pink at odahoda dot de
    WWW: http://www.odahoda.de/

    Comment

    • Nick Kew

      #3
      Re: Warning: robots.txt unreliable in Apache servers

      Anonymous wrote:
      [color=blue]
      > For some reason, the Apache developers decided to treat multiple consecutive
      > forward slashes in a request URI as a single forward slash. So for example,
      > <http://apache.org/foundation/> and <http://apache.org//////foundation/>
      > both resolve to the same page.[/color]

      Yep. If you apply filesystem semantics to that, you have a whopping
      great security hole. Of course you could just return "bad request",
      but that just transfers the risk leaving server admins to shoot
      their own feet.

      There was a story in TheRegister a couple of weeks ago about someone
      who got a criminal conviction (for attempted unauthorized access)
      after he requested a url like that and it triggered an intrusion
      detection alarm.

      If you have links to things like "////" and dumb robots, put the
      paths in your robots.txt. Don't forget that robots.txt is only
      advisory and is commonly ignored by evil and/or broken robots.

      --
      Nick Kew

      Comment

      • Philip Ronan

        #4
        Re: Warning: robots.txt unreliable in Apache servers

        "Nick Kew" wrote:
        [color=blue]
        > If you have links to things like "////" and dumb robots, put the
        > paths in your robots.txt. Don't forget that robots.txt is only
        > advisory and is commonly ignored by evil and/or broken robots.[/color]

        But retroactively adding to the robots.txt file every time someone posts a
        bad link to your site just isn't a practical solution. I realize not all
        robots bother with the robots.txt protocol, but if even the legitimate
        spiders can be misdirected then the whole point of having a robots.txt file
        goes out the window.

        --
        phil [dot] ronan @ virgin [dot] net


        Comment

        • Nick Kew

          #5
          Re: Warning: robots.txt unreliable in Apache servers

          Philip Ronan wrote:

          [please don't crosspost without warning. Or with inadequate context]
          [color=blue]
          > "Nick Kew" wrote:
          >
          >[color=green]
          >>If you have links to things like "////" and dumb robots, put the
          >>paths in your robots.txt. Don't forget that robots.txt is only
          >>advisory and is commonly ignored by evil and/or broken robots.[/color]
          >
          >
          > But retroactively adding to the robots.txt file every time someone posts a
          > bad link to your site just isn't a practical solution.[/color]

          Who said anything about that? What's impractical about "Disallow //" ?

          --
          Nick Kew

          Comment

          • Stan Brown

            #6
            Re: Warning: robots.txt unreliable in Apache servers

            Sun, 30 Oct 2005 09:34:36 +0000 from Nick Kew
            <nick@asgard.we bthing.com>:[color=blue]
            > If you have links to things like "////" and dumb robots, put the
            > paths in your robots.txt. Don't forget that robots.txt is only
            > advisory and is commonly ignored by evil and/or broken robots.[/color]

            Wouldn't it be more effective to have any URL containing http://.*//
            return a 403 Forbidden or a 404 Not Found? This could be done in
            ..htaccess or perhaps httpd.conf. I may be having a failure of
            imagination, but I can't think of any legitimate reason for such a
            link.

            --
            Stan Brown, Oak Road Systems, Tompkins County, New York, USA

            HTML 4.01 spec: http://www.w3.org/TR/html401/
            validator: http://validator.w3.org/
            CSS 2.1 spec: http://www.w3.org/TR/CSS21/
            validator: http://jigsaw.w3.org/css-validator/
            Why We Won't Help You:

            Comment

            • Philip Ronan

              #7
              Re: Warning: robots.txt unreliable in Apache servers

              "Nick Kew" wrote:
              [color=blue]
              > [please don't crosspost without warning. Or with inadequate context][/color]

              My original post was copied over to ciwah, so now there are two threads with
              the same subject. I'm trying to tie them together, mkay?
              [color=blue]
              > Philip Ronan wrote:[color=green]
              >>
              >> But retroactively adding to the robots.txt file every time someone posts a
              >> bad link to your site just isn't a practical solution.[/color]
              >
              > Who said anything about that?[/color]

              You did, in your earlier post:
              If you have links to things like
              "////" and dumb robots, put the paths in your robots.txt.
              [color=blue]
              > What's impractical about "Disallow //" ?[/color]

              It's a partial solution. If you're trying to protect content at deeper
              levels in the hierarchy, you will also need:

              Disallow: /path//to/file
              Disallow: /path/to//file
              Disallow: /path//to//file
              Disallow: /path///to/file
              etc..

              As I said, robots.txt is inadequate for this purpose because it doesn't
              support pattern matching.

              --
              phil [dot] ronan @ virgin [dot] net



              Comment

              • Philip Ronan

                #8
                Re: Warning: robots.txt unreliable in Apache servers

                In comp.infosystem s.www.authoring.html, "Stan Brown" wrote:
                [color=blue]
                > Wouldn't it be more effective to have any URL containing http://.*//
                > return a 403 Forbidden or a 404 Not Found? This could be done in
                > .htaccess or perhaps httpd.conf. I may be having a failure of
                > imagination, but I can't think of any legitimate reason for such a
                > link.[/color]

                That would also be effective, but maybe it's better to do something useful
                with the URL if you can.

                Most servers will redirect to a URL with a trailing slash when the name of a
                directory is requested. Why not treat multiple slashes in a similar way?

                Besides, it might help in terms of page rank.

                [[Crossposted to alt.internet.se arch-engines, with apologies to Nick]]

                --
                phil [dot] ronan @ virgin [dot] net


                Comment

                • David

                  #9
                  Re: Warning: robots.txt unreliable in Apache servers

                  On Sun, 30 Oct 2005 11:15:03 +0000, Philip Ronan
                  <invalid@invali d.invalid> wrote:
                  [color=blue]
                  >"Nick Kew" wrote:
                  >[color=green]
                  >> If you have links to things like "////" and dumb robots, put the
                  >> paths in your robots.txt. Don't forget that robots.txt is only
                  >> advisory and is commonly ignored by evil and/or broken robots.[/color]
                  >
                  >But retroactively adding to the robots.txt file every time someone posts a
                  >bad link to your site just isn't a practical solution. I realize not all
                  >robots bother with the robots.txt protocol, but if even the legitimate
                  >spiders can be misdirected then the whole point of having a robots.txt file
                  >goes out the window.[/color]

                  A simple solution would be to add the robots meta tag to all pages you
                  don't want indexing as a backup for when someone links with //. Kind
                  of defeats the whole point of using a robots.txt file, but what else
                  can you do?

                  David
                  --
                  Free Search Engine Optimization Tutorial
                  It's a sad SEO fact that 98% of search engine visitors leave a website without buying or converting to a potential sales lead! SEO Gold makes sure your website is super fast, so Google sends you more free traffic.

                  Comment

                  • David Ross

                    #10
                    Re: Warning: robots.txt unreliable in Apache servers

                    Anonymous, quoting Philip Ronan wrote:[color=blue]
                    >
                    >
                    > Subject: Warning: robots.txt unreliable in Apache servers
                    > From: Philip Ronan <invalid@invali d.invalid>
                    > Newsgroups: alt.internet.se arch-engines
                    > Message-ID: <BF89BF33.39FDF %invalid@invali d.invalid>
                    > Date: Sat, 29 Oct 2005 23:07:46 GMT
                    >
                    > Hi,
                    >
                    > I recently discovered that robots.txt files aren't necessarily any use on
                    > Apache servers.
                    >
                    > For some reason, the Apache developers decided to treat multiple consecutive
                    > forward slashes in a request URI as a single forward slash. So for example,
                    > <http://apache.org/foundation/> and <http://apache.org//////foundation/>
                    > both resolve to the same page.
                    >
                    > Let's suppose the Apache website owners want to stop search engine robots
                    > crawling through their "foundation " pages. They could put this rule in their
                    > robots.txt file:
                    >
                    > Disallow: /foundation/
                    >
                    > But if I posted a link to //////foundation/ somewhere, the search engines
                    > will be quite happy to index it because it isn't covered by this rule.
                    >
                    > As a result of all this, Google is currently indexing a page on my website
                    > that I specifically asked it to stay away from :-(
                    >
                    > You might want to check the behaviour of your servers to see if you're
                    > vulnerable to the same sort of problem.
                    >
                    > If anyone's interested, I've put together a .htaccess rule and a PHP script
                    > that seem to sort things out.[/color]

                    I thought that parsing and processing a robots.txt file is the
                    responsibility of the bot and not the Web server. All the Web
                    server has to do is deliver the robots.txt file to the bot.

                    If that is true, the problem lies within Google and not Apache.

                    --

                    David E. Ross
                    <URL:http://www.rossde.com/>

                    I use Mozilla as my Web browser because I want a browser that
                    complies with Web standards. See <URL:http://www.mozilla.org/>.

                    Comment

                    • Guy Macon

                      #11
                      Re: Warning: robots.txt unreliable in Apache servers



                      David Ross wrote:[color=blue]
                      >
                      >Philip Ronan wrote:[color=green]
                      >>
                      >> I recently discovered that robots.txt files aren't necessarily any use on
                      >> Apache servers.
                      >>
                      >> For some reason, the Apache developers decided to treat multiple consecutive
                      >> forward slashes in a request URI as a single forward slash. So for example,
                      >> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
                      >> both resolve to the same page.
                      >>
                      >> Let's suppose the Apache website owners want to stop search engine robots
                      >> crawling through their "foundation " pages. They could put this rule in their
                      >> robots.txt file:
                      >>
                      >> Disallow: /foundation/
                      >>
                      >> But if I posted a link to //////foundation/ somewhere, the search engines
                      >> will be quite happy to index it because it isn't covered by this rule.
                      >>
                      >> As a result of all this, Google is currently indexing a page on my website
                      >> that I specifically asked it to stay away from :-(
                      >>
                      >> You might want to check the behaviour of your servers to see if you're
                      >> vulnerable to the same sort of problem.
                      >>
                      >> If anyone's interested, I've put together a .htaccess rule and a PHP script
                      >> that seem to sort things out.[/color]
                      >
                      >I thought that parsing and processing a robots.txt file is the
                      >responsibili ty of the bot and not the Web server. All the Web
                      >server has to do is deliver the robots.txt file to the bot.
                      >
                      >If that is true, the problem lies within Google and not Apache.[/color]

                      I was about to opine that "http://apache.org//////" is not the same
                      as "http://apache.org/", but it appears that IIS has the same behavior:
                      See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
                      Is there something in the specs that says that treating "//////" and
                      "/" the same is proper behavior?

                      --
                      Guy Macon <http://www.guymacon.co m/>



                      Comment

                      • Brian Wakem

                        #12
                        Re: Warning: robots.txt unreliable in Apache servers

                        Guy Macon <http://www.guymacon.co m/> wrote:
                        [color=blue]
                        > I was about to opine that "http://apache.org//////" is not the same
                        > as "http://apache.org/", but it appears that IIS has the same behavior:
                        > See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
                        > Is there something in the specs that says that treating "//////" and
                        > "/" the same is proper behavior?
                        >[/color]


                        Don't know, but it seems to be the case on unix/linux filesystems too,

                        If I 'cd //////usr////////////local////apache2' I end up
                        in /usr/local/apache2

                        The web servers are probably mimicking this behaviour.


                        --
                        Brian Wakem
                        Email: http://homepage.ntlworld.com/b.wakem/myemail.png

                        Comment

                        • Jim Moe

                          #13
                          Re: Warning: robots.txt unreliable in Apache servers

                          Guy Macon wrote:[color=blue]
                          >
                          > I was about to opine that "http://apache.org//////" is not the same
                          > as "http://apache.org/", but it appears that IIS has the same behavior:
                          > See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
                          > Is there something in the specs that says that treating "//////" and
                          > "/" the same is proper behavior?
                          >[/color]
                          You are referring to which specs?
                          This behavior for following paths is from unix and is how all C
                          compilers handle paths. It is simply applied to URLs as well. There may
                          even be a requirement in the C specification about paths.

                          --
                          jmm (hyphen) list (at) sohnen-moe (dot) com
                          (Remove .AXSPAMGN for email)

                          Comment

                          • Dave0x1

                            #14
                            Re: Warning: robots.txt unreliable in Apache servers

                            Guy Macon wrote:

                            [color=blue]
                            > I was about to opine that "http://apache.org//////" is not the same
                            > as "http://apache.org/", but it appears that IIS has the same behavior:
                            > See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
                            > Is there something in the specs that says that treating "//////" and
                            > "/" the same is proper behavior?[/color]

                            Hint: Read the documentation offered at either of the first two URLs.

                            I don't understand why this is a big deal. The issue can be addressed
                            by numerous methods, including patching of the Apache web server source
                            code.

                            It's not clear exactly what the problem *is*. I've never seen a URL
                            with multiple adjacent forward slashes in my search results. Does
                            someone have an example?

                            Dave

                            Comment

                            • Philip Ronan

                              #15
                              Re: Warning: robots.txt unreliable in Apache servers

                              "Dave0x1" wrote:
                              [color=blue]
                              > I don't understand why this is a big deal. The issue can be addressed
                              > by numerous methods, including patching of the Apache web server source
                              > code.[/color]

                              OK, so as long as the robots.txt documentation includes a note saying that
                              you have to patch your server software to get reliable results, then we'll
                              all be fine.
                              [color=blue]
                              > It's not clear exactly what the problem *is*. I've never seen a URL
                              > with multiple adjacent forward slashes in my search results. Does
                              > someone have an example?[/color]

                              Which bit didn't I explain properly? I'm not going to post a link for you to
                              check, but here's the response I got from Google on the issue:
                              [color=blue][color=green]
                              >> Thank you for your note. We apologize for our delayed response.
                              >> We understand you're concerned about the inclusion of
                              >> http://###.####.###//contact/ in our index.
                              >>
                              >> It's important to note that we visited the live page in question
                              >> and found that it currently exists on the web as listed above.
                              >> Because this page falls outside your robots.txt file, you may
                              >> want to use meta tags to remove this page from our index. For
                              >> more information about using meta tags, please visit
                              >> http://www.google.com/remove.html
                              >>
                              >> [remainder snipped][/color][/color]

                              I didn't publish the link to //contact/, someone else did. So that means the
                              robots.txt protocol is ineffective on (probably) most servers because it can
                              be circumvented without your knowledge by a third party.

                              Hope that's all clear now.

                              --
                              phil [dot] ronan @ virgin [dot] net



                              Comment

                              Working...