Google causing excessive bandwidth uasage.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Doug Laidlaw

    Google causing excessive bandwidth uasage.

    I know that this thread is inappropriate for this list, but c.i.w.a.servers
    is dead at my address.

    Google has been around to my site twice this month and downloaded almost a
    GB, putting me over my bandwidth limit both times I imagine that if I
    wasn't paying a flat fee, that would be costing me money.

    Is there a way of limiting this while at the same time allowing Google
    reasonable indexing? I imagine that they are downloading the whole site
    for their cache, but most of it (21 MB) is a program in PHP, and
    unnecessary for archiving purposes. I have noticed from Google searches
    that there are quite a few PHP reports from my site listed. These are
    generated on-the-fly. Would I be better off asking this on the program site
    at Sourceforge?

    Doug.
    --
    Registered Linux User No. 277548. My true email address has hotkey for
    myaccess.
    The difference between 'involvement' and 'commitment' is like an
    eggs-and-ham breakfast: the chicken was 'involved' - the pig was
    'committed'.
    - Unknown.

  • Philip Ronan

    #2
    Re: Google causing excessive bandwidth uasage.

    "Doug Laidlaw" wrote:
    [color=blue]
    > I know that this thread is inappropriate for this list, but c.i.w.a.servers
    > is dead at my address.[/color]

    alt.internet.se arch-engines might have been a better place to ask.
    [color=blue]
    > Google has been around to my site twice this month and downloaded almost a
    > GB, putting me over my bandwidth limit both times I imagine that if I
    > wasn't paying a flat fee, that would be costing me money.[/color]

    Then I don't really see what the problem is. You've got all this content on
    your website, and presumably you want it indexed by Google. So you can't
    complain when the googlebot comes along and looks at the stuff.
    [color=blue]
    > Is there a way of limiting this while at the same time allowing Google
    > reasonable indexing? I imagine that they are downloading the whole site
    > for their cache, but most of it (21 MB) is a program in PHP, and
    > unnecessary for archiving purposes.[/color]

    All you can do is switch the indexing on or off. Either with a robots.txt
    file or with robots meta tags in individual pages. If you don't want google
    to crawl parts of your site, then tell it not to. It really is that simple.
    [color=blue]
    > I have noticed from Google searches
    > that there are quite a few PHP reports from my site listed. These are
    > generated on-the-fly. Would I be better off asking this on the program site
    > at Sourceforge?[/color]

    I think you would be better off reading this:
    <http://www.google.com/intl/en/webmasters/bot.html>

    --
    phil [dot] ronan @ virgin [dot] net


    Comment

    • Doug Laidlaw

      #3
      Re: Google causing excessive bandwidth uasage.

      Philip Ronan wrote:
      [color=blue]
      > "Doug Laidlaw" wrote:
      >[color=green]
      >> I know that this thread is inappropriate for this list, but
      >> c.i.w.a.servers is dead at my address.[/color]
      >
      > alt.internet.se arch-engines might have been a better place to ask.
      >[color=green]
      >> Google has been around to my site twice this month and downloaded almost
      >> a
      >> GB, putting me over my bandwidth limit both times I imagine that if I
      >> wasn't paying a flat fee, that would be costing me money.[/color]
      >
      > Then I don't really see what the problem is. You've got all this content
      > on your website, and presumably you want it indexed by Google. So you
      > can't complain when the googlebot comes along and looks at the stuff.
      >[color=green]
      >> Is there a way of limiting this while at the same time allowing Google
      >> reasonable indexing? I imagine that they are downloading the whole site
      >> for their cache, but most of it (21 MB) is a program in PHP, and
      >> unnecessary for archiving purposes.[/color]
      >
      > All you can do is switch the indexing on or off. Either with a robots.txt
      > file or with robots meta tags in individual pages. If you don't want
      > google to crawl parts of your site, then tell it not to. It really is that
      > simple.
      >[color=green]
      >> I have noticed from Google searches
      >> that there are quite a few PHP reports from my site listed. These are
      >> generated on-the-fly. Would I be better off asking this on the program
      >> site at Sourceforge?[/color]
      >
      > I think you would be better off reading this:
      > <http://www.google.com/intl/en/webmasters/bot.html>
      >[/color]
      Thanks Phil. I am entirely self-taught, and suddenly finding myself with my
      own domain, and having to do a lot more administration. There is a robots.txt
      file in the root directory of the program. I will follow up on that.

      Doug.
      --
      Registered Linux User No. 277548. My true email address has hotkey for
      myaccess.
      The only sure thing about luck is that it will change.
      - Bret Harte.

      Comment

      • Nick Kew

        #4
        Re: Google causing excessive bandwidth uasage.

        Doug Laidlaw wrote:
        [color=blue]
        > Thanks Phil. I am entirely self-taught, and suddenly finding myself with my
        > own domain, and having to do a lot more administration. There is a robots.txt
        > file in the root directory of the program. I will follow up on that.[/color]

        Your robots.txt must be incorrect - so look into it.

        Perhaps more importantly, you need to consider cacheability of your
        contents. Probably simplest is to ensure the server sends Last-Modified
        headers, so that google and others only have to download a page
        when it's changed, not every time it checks for a possible change.
        That'll save you bandwidth all round, not just from google et al.

        --
        Nick Kew

        Comment

        • Stan Brown

          #5
          Re: Google causing excessive bandwidth uasage.

          Sat, 19 Nov 2005 10:32:26 +0000 from Philip Ronan
          <invalid@invali d.invalid>:
          [color=blue]
          > "Doug Laidlaw" wrote:[color=green]
          > > Google has been around to my site twice this month and downloaded almost a
          > > GB, putting me over my bandwidth limit both times I imagine that if I
          > > wasn't paying a flat fee, that would be costing me money.
          > > Is there a way of limiting this while at the same time allowing Google
          > > reasonable indexing?[/color][/color]
          [color=blue]
          > alt.internet.se arch-engines might have been a better place to ask.[/color]

          (follow-ups redirected accordingly)
          [color=blue]
          > Then I don't really see what the problem is. You've got all this content on
          > your website, and presumably you want it indexed by Google. So you can't
          > complain when the googlebot comes along and looks at the stuff.[/color]

          I'm _not_ paying a flat fee, unlike the OP, and I'd like to know the
          answer to this also.
          [color=blue]
          > I think you would be better off reading this:
          > <http://www.google.com/intl/en/webmasters/bot.html>[/color]

          Good heavens! That page says Google trawls my site every few
          _seconds_. Not long ago I remember it used to be every few _days_. I
          noticed activity on my site grew quite a bit a little less that a
          year ago; I wonder if this was the reason?

          --
          Stan Brown, Oak Road Systems, Tompkins County, New York, USA
          Portal Live Casino Terbaik di DRAGON222! Nikmati taruhan game spesialis Baccarat dan roulette resmi. Mengajak mencari pengalaman bermain yang lebih seru secara online tanpa harus dateng ke casino offline.

          HTML 4.01 spec: http://www.w3.org/TR/html401/
          validator: http://validator.w3.org/
          CSS 2.1 spec: http://www.w3.org/TR/CSS21/
          validator: http://jigsaw.w3.org/css-validator/
          Why We Won't Help You:

          Comment

          • Jim Moe

            #6
            Re: Google causing excessive bandwidth uasage.

            Doug Laidlaw wrote:[color=blue]
            >
            > Google has been around to my site twice this month and downloaded almost a
            > GB, putting me over my bandwidth limit both times I imagine that if I
            > wasn't paying a flat fee, that would be costing me money.
            >[/color]
            Add a robots.txt file to the DocumentRoot.
            Also see <http://www.searchtools .com/robots/>. The "Disallow" keyword is
            what you need.

            --
            jmm (hyphen) list (at) sohnen-moe (dot) com
            (Remove .AXSPAMGN for email)

            Comment

            • Nick Kew

              #7
              Re: Google causing excessive bandwidth uasage.

              Stan Brown wrote:[color=blue]
              > (follow-ups redirected accordingly)[/color]

              And ignored. I'm not posting *only* to a group I don't read.
              [color=blue]
              > Good heavens! That page says Google trawls my site every few
              > _seconds_. Not long ago I remember it used to be every few _days_. I[/color]

              Erm, that'll be URLs that get visited at a high rate while it's
              spidering. So if it visits one per minute and you have 1440 pages,
              it'll take one day to spider the site from scratch.

              It'll then revisit in [???] days/weeks to check for changes.

              --
              Nick Kew

              Comment

              • Alan J. Flavell

                #8
                Re: Google causing excessive bandwidth uasage.

                On Sat, 19 Nov 2005, Stan Brown wrote:
                [color=blue][color=green]
                > > alt.internet.se arch-engines might have been a better place to ask.[/color]
                > (follow-ups redirected accordingly)[/color]

                Urgl. I missed that, first time, but this server doesn't do alt
                groups. So here goes again, including a group that I not only read
                but can post to...
                [color=blue][color=green]
                > > <http://www.google.com/intl/en/webmasters/bot.html>[/color]
                >
                > Good heavens! That page says Google trawls my site every few
                > _seconds_.[/color]

                I don't think so! It says the server shouldn't get *an* access from
                Googlebot more often than a few seconds. That's a rate control
                mechanism, not a frequency of revisiting.

                Though I'm a bit surprised to see that when I count up the log entries
                for Googlebot on our server, I count some 68K accesses in the current
                log, 13th November onwards, out of the total of some 400K accesses
                over that period.

                But the accesses are clustered by date, implying that they did a trawl
                twice this week - or once (30K hits in the previous week, in just a
                single cluster), with only a few hundreds of Googlebot hits per day on
                the intermediate days (presumably to re-check pages which were
                recently active?).

                I see most of the Googlebot accesses here are returning status 200.
                The references to my own personal space are mostly returning status
                304, but I see a few cases where my "xbithack full" pages are missing
                the g+x bit, and so they always return status 200, which I need to
                rectify.

                Hmmm, and I have to look into those status 200 responses elsewhere on
                the server, and probably do something about it. I have a theory.

                Comment

                • Doug Laidlaw

                  #9
                  Re: Google causing excessive bandwidth uasage.

                  Thanks, Nick. Is that done in httpd.conf?

                  There was no robots.txt. I have added one.

                  I was going to start a new thread, but here is my draft:

                  "I have Web space on a server with nothing in my root directory, but two
                  subdirectories. It was set up this way by the host.

                  For the benefit of anyone coming to my root directory, I placed there a
                  standard redirection page with a REFRESH meta tag. Apparently it made
                  Google loop the loop, and Google notched up over a GB of downloads. An
                  article says that such a page should not be exposed to search engines.

                  Before I heard that the tag was the problem, I set up a robots.txt file from
                  the URL kindly provided by Philip above. It exclude the program
                  subdirectory and leaves only the HTML. I have also asked Google to have a
                  look."

                  Now, a robots.txt file cannot have exceptions. I can't tell it to disallow
                  everything except the html directory. It is all or nothing, so far as that
                  goes.

                  Doug.

                  Nick Kew wrote:
                  [color=blue]
                  > Doug Laidlaw wrote:
                  >[color=green]
                  >> Thanks Phil. I am entirely self-taught, and suddenly finding myself with
                  >> my
                  >> own domain, and having to do a lot more administration. There is a
                  >> robots.txt
                  >> file in the root directory of the program. I will follow up on that.[/color]
                  >
                  > Your robots.txt must be incorrect - so look into it.
                  >
                  > Perhaps more importantly, you need to consider cacheability of your
                  > contents. Probably simplest is to ensure the server sends Last-Modified
                  > headers, so that google and others only have to download a page
                  > when it's changed, not every time it checks for a possible change.
                  > That'll save you bandwidth all round, not just from google et al.
                  >[/color]

                  --
                  Registered Linux User No. 277548. My true email address has hotkey for
                  myaccess.
                  Love is not blind - it sees more, not less. But because it sees more, it is
                  willing to see less.
                  - Rabbi Julins Gordon. (I could write a page about that.)

                  Comment

                  • Sherm Pendley

                    #10
                    Re: Google causing excessive bandwidth uasage.

                    Doug Laidlaw <laidlaws@myacc ess.com.au> writes:
                    [color=blue]
                    > For the benefit of anyone coming to my root directory, I placed there a
                    > standard redirection page with a REFRESH meta tag.[/color]

                    But... that's not a standard redirect, that's a hack that relies on browser
                    support for non-standard behavior. The standard way to do a redirect is
                    to tell the server to return a 301 (permanently moved), 302 (temporarily
                    moved), or 303 (replaced) HTTP status.

                    If you're using Apache, you can do that with a Redirect directive in a either
                    httpd.conf or .htaccess, whichever suits your situation best.

                    sherm--

                    --
                    Cocoa programming in Perl: http://camelbones.sourceforge.net
                    Hire me! My resume: http://www.dot-app.org

                    Comment

                    • Doug Laidlaw

                      #11
                      Re: Google causing excessive bandwidth uasage.

                      Sherm Pendley wrote:
                      [color=blue]
                      > Doug Laidlaw <laidlaws@myacc ess.com.au> writes:
                      >[color=green]
                      >> For the benefit of anyone coming to my root directory, I placed there a
                      >> standard redirection page with a REFRESH meta tag.[/color]
                      >
                      > But... that's not a standard redirect, that's a hack that relies on
                      > browser support for non-standard behavior. The standard way to do a
                      > redirect is to tell the server to return a 301 (permanently moved), 302
                      > (temporarily moved), or 303 (replaced) HTTP status.
                      >
                      > If you're using Apache, you can do that with a Redirect directive in a
                      > either httpd.conf or .htaccess, whichever suits your situation best.
                      >
                      > sherm--
                      >[/color]
                      Thank you very much for your answer. It tells me what I need to know. I
                      would rather use an .htaccess file than fiddle with somebody else's
                      httpd.conf.

                      The server uses Apache, and I have played around with Apache on my Linux
                      distro. I think that I am more au fait with that than the admin guy. He
                      wanted me to ask Google whether they owned the IP address causing the
                      trouble. I just used the "host" command.

                      Doug.
                      --
                      Registered Linux User No. 277548. My true email address has hotkey for
                      myaccess.
                      We don't seem to be able to check crime, so why don't we legalize it and
                      then tax it out of business.
                      - Will Rogers.

                      Comment

                      • Doug Laidlaw

                        #12
                        Re: Google causing excessive bandwidth uasage.

                        Doug Laidlaw wrote:

                        [color=blue]
                        > The server uses Apache, and I have played around with Apache on my Linux
                        > distro. I think that I am more au fait with that than the admin guy. He
                        > wanted me to ask Google whether they owned the IP address causing the
                        > trouble. I just used the "host" command.
                        >
                        > Doug.[/color]

                        The server uses cPanel. I have just found the Redirect dialog in its
                        config, and have set up a 302 Redirect there.

                        Too easy, after all the head scratching.

                        Doug.
                        --
                        Registered Linux User No. 277548. My true email address has hotkey for
                        myaccess.
                        Maturity begins to grow when you can sense your concern for others
                        outweighing your concern for yourself.
                        - John Macnaughton.

                        Comment

                        Working...