Can you avoid that googlebot indexes PHPSESSID pages?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • CAH

    Can you avoid that googlebot indexes PHPSESSID pages?

    Hi

    Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
    indexing pages with PHPSESSID, which makes it think my page has a
    infinite number of pages. How can one avoid this?

    Here is an exsample of url that google register, that might make is
    more clear what is happening

    https://www.winches.dk/winches.php?a...6f0d46334659ff...
    https://www.winches.dk/winches.php?a...b6aed41fc142ea...

    I do use session registred ID, but if I visit my site I never see those
    kind of urls? So how come google gets a hold of them?

    Best regards
    Mads

  • Jerry Stuckle

    #2
    Re: Can you avoid that googlebot indexes PHPSESSID pages?

    CAH wrote:[color=blue]
    > Hi
    >
    > Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
    > indexing pages with PHPSESSID, which makes it think my page has a
    > infinite number of pages. How can one avoid this?
    >
    > Here is an exsample of url that google register, that might make is
    > more clear what is happening
    >
    > https://www.winches.dk/winches.php?a...6f0d46334659ff...
    > https://www.winches.dk/winches.php?a...b6aed41fc142ea...
    >
    > I do use session registred ID, but if I visit my site I never see those
    > kind of urls? So how come google gets a hold of them?
    >
    > Best regards
    > Mads
    >[/color]



    --
    =============== ===
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    jstucklex@attgl obal.net
    =============== ===

    Comment

    • Chung Leong

      #3
      Re: Can you avoid that googlebot indexes PHPSESSID pages?

      CAH wrote:[color=blue]
      > Hi
      >
      > Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
      > indexing pages with PHPSESSID, which makes it think my page has a
      > infinite number of pages. How can one avoid this?[/color]

      Well, one way to handle this is to check the User-Agent header to see
      if the client is Googlebot and not enable session. Obviously if a page
      is dependent on session then it ceases to be indexible.
      [color=blue]
      > Here is an exsample of url that google register, that might make is
      > more clear what is happening
      >
      > https://www.winches.dk/winches.php?a...6f0d46334659ff...
      > https://www.winches.dk/winches.php?a...b6aed41fc142ea...
      >
      > I do use session registred ID, but if I visit my site I never see those
      > kind of urls? So how come google gets a hold of them?[/color]

      If session.use_tra ns_sid is enabled, then PHP tries to compensate for
      the lack of cookie by inserting the session id into any possible links.

      I think you have quite a problem on your hand. Once those links are in
      Google's database, the bot will keep returning to them. You'll need to
      detect the condition and tell Googlebot to buzz off so it doesn't eat
      up your bandwidth quota.

      Comment

      • CAH

        #4
        Re: Can you avoid that googlebot indexes PHPSESSID pages?


        CAH skrev:
        [color=blue]
        > Hi
        >
        > Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
        > indexing pages with PHPSESSID, which makes it think my page has a
        > infinite number of pages. How can one avoid this?
        >
        > Here is an exsample of url that google register, that might make is
        > more clear what is happening
        >
        > https://www.winches.dk/winches.php?a...6f0d46334659ff...
        > https://www.winches.dk/winches.php?a...b6aed41fc142ea...
        >
        > I do use session registred ID, but if I visit my site I never see those
        > kind of urls? So how come google gets a hold of them?
        >
        > Best regards
        > Mads[/color]

        I am now testing this as a solution

        "Using .htaccess often, you need to put the following two lines in the
        ..htaccess file, if your host is using PHP as an Apache module:

        php_value session.use_onl y_cookies 1
        php_value session.use_tra ns_sid 0 "

        The downside is my site now only functions when user has cookies
        enabled, and I am still not sure whethers this will do the trick.

        Comment

        • noone

          #5
          Re: Can you avoid that googlebot indexes PHPSESSID pages?

          CAH wrote:

          [color=blue]
          > CAH skrev:[/color]
          [color=blue][color=green]
          >> Hi
          >>
          >> Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
          >> indexing pages with PHPSESSID, which makes it think my page has a
          >> infinite number of pages. How can one avoid this?
          >>
          >> Here is an exsample of url that google register, that might make is
          >> more clear what is happening
          >>
          >> https://www.winches.dk/winches.php?a...6f0d46334659ff...
          >> https://www.winches.dk/winches.php?a...b6aed41fc142ea...
          >>
          >> I do use session registred ID, but if I visit my site I never see those
          >> kind of urls? So how come google gets a hold of them?
          >>
          >> Best regards
          >> Mads[/color][/color]
          [color=blue]
          > I am now testing this as a solution[/color]
          [color=blue]
          > "Using .htaccess often, you need to put the following two lines in the
          > ..htaccess file, if your host is using PHP as an Apache module:[/color]
          [color=blue]
          > php_value session.use_onl y_cookies 1
          > php_value session.use_tra ns_sid 0 "[/color]
          [color=blue]
          > The downside is my site now only functions when user has cookies
          > enabled, and I am still not sure whethers this will do the trick.[/color]

          IIRC, google and other sites search for a file called robots.txt that give
          directives on what it can and cannot index. Do a google search for
          robots.txt to see... (to verify, look in your webserver log files - it
          does show up as a request in my apache log files...)

          If your robots.txt includes the following directive - it will skip the
          entire site.

          User-agent: *
          Disallow: *

          or to limit the scope of it's search:
          User-agent: *
          Disallow: /cgi-bin/
          Disallow: /images/
          Disallow: *.php



          Comment

          • CAH

            #6
            Re: Can you avoid that googlebot indexes PHPSESSID pages?

            > IIRC, google and other sites search for a file called robots.txt that give[color=blue]
            > directives on what it can and cannot index. Do a google search for
            > robots.txt to see... (to verify, look in your webserver log files - it
            > does show up as a request in my apache log files...)
            >
            > If your robots.txt includes the following directive - it will skip the
            > entire site.
            >
            > User-agent: *
            > Disallow: *
            >
            > or to limit the scope of it's search:
            > User-agent: *
            > Disallow: /cgi-bin/
            > Disallow: /images/
            > Disallow: *.php[/color]

            I was testing this robot.txt

            User-agent: Googlebot
            Disallow: /*PHPSESSID

            And that might solve it, I just do not know whether is works or not.

            Mads

            Comment

            • Scott

              #7
              Re: Can you avoid that googlebot indexes PHPSESSID pages?

              On Mon, 2006-04-03 at 01:20 -0700, CAH wrote:[color=blue]
              > Hi
              >
              > Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
              > indexing pages with PHPSESSID, which makes it think my page has a
              > infinite number of pages. How can one avoid this?
              >
              > Here is an exsample of url that google register, that might make is
              > more clear what is happening
              >
              > https://www.winches.dk/winches.php?a...6f0d46334659ff...
              > https://www.winches.dk/winches.php?a...b6aed41fc142ea...
              >
              > I do use session registred ID, but if I visit my site I never see those
              > kind of urls? So how come google gets a hold of them?
              >
              > Best regards
              > Mads
              >[/color]

              There was some discussion of forcing cookies, but the author didn't want
              to limit his users, so...

              How about doing something like this:

              // See if the user agent is Googlebot
              $isGoogle = stripos($_SERVE R['HTTP_USER_AGEN T'], 'Googlebot');

              // If it is, use ini_set to only allow cookies for the session variable
              if ($isGoogle !== false) {
              ini_set('sessio n.use_only_cook ies', '1');
              }

              Comment

              • CAH

                #8
                Re: Can you avoid that googlebot indexes PHPSESSID pages?

                > There was some discussion of forcing cookies, but the author didn't want[color=blue]
                > to limit his users, so...
                >
                > How about doing something like this:
                >
                > // See if the user agent is Googlebot
                > $isGoogle = stripos($_SERVE R['HTTP_USER_AGEN T'], 'Googlebot');
                >
                > // If it is, use ini_set to only allow cookies for the session variable
                > if ($isGoogle !== false) {
                > ini_set('sessio n.use_only_cook ies', '1');
                > }[/color]

                That is a cool solution, but can one be sure that one can reconize
                googlebot? And how about all the other robots? Could one make a "is not
                robot test"?

                Thanks for the help
                Mads

                Comment

                • Scott

                  #9
                  Re: Can you avoid that googlebot indexes PHPSESSID pages?

                  On Mon, 2006-04-03 at 23:57 -0700, CAH wrote:[color=blue][color=green]
                  > > There was some discussion of forcing cookies, but the author didn't want
                  > > to limit his users, so...
                  > >
                  > > How about doing something like this:
                  > >
                  > > // See if the user agent is Googlebot
                  > > $isGoogle = stripos($_SERVE R['HTTP_USER_AGEN T'], 'Googlebot');
                  > >
                  > > // If it is, use ini_set to only allow cookies for the session variable
                  > > if ($isGoogle !== false) {
                  > > ini_set('sessio n.use_only_cook ies', '1');
                  > > }[/color]
                  >
                  > That is a cool solution, but can one be sure that one can reconize
                  > googlebot? And how about all the other robots? Could one make a "is not
                  > robot test"?
                  >
                  > Thanks for the help
                  > Mads
                  >[/color]

                  I wouldn't expect all (or even most) robots to be easily identified by
                  the user-agent. Maybe you could make an array of the most common ones
                  (Googlebot, Inktomi, etc) and loop through it with the logic I
                  suggested. I also don't think you could check to see if it's a browser,
                  because firewalls & proxy servers may not send that information through.

                  Sorry! (It's not my internet. I just work here!)

                  Scott

                  Comment

                  • CAH

                    #10
                    Re: Can you avoid that googlebot indexes PHPSESSID pages?

                    > I wouldn't expect all (or even most) robots to be easily identified
                    by[color=blue]
                    > the user-agent. Maybe you could make an array of the most common ones
                    > (Googlebot, Inktomi, etc) and loop through it with the logic I
                    > suggested. I also don't think you could check to see if it's a browser,
                    > because firewalls & proxy servers may not send that information through.[/color]

                    I see what you mean.

                    Do you think this solution will work?

                    "Using .htaccess often, you need to put the following two lines in the
                    ..htaccess file, if your host is using PHP as an Apache module:

                    php_value session.use_onl y_cookies 1
                    php_value session.use_tra ns_sid 0 "

                    I think it does, and even though you then have to rely on cookies, I
                    think it is the better solution because today this is a small minus,
                    compared to search engine problems.

                    If this solutions works

                    User-agent: Googlebot
                    Disallow: /*PHPSESSID

                    it would be by far the simplest, I do however not feel to sure that it
                    does work, and have no opportunity to check it at this time.

                    Regards
                    Mads

                    Comment

                    • R. Rajesh Jeba Anbiah

                      #11
                      Re: Can you avoid that googlebot indexes PHPSESSID pages?

                      CAH wrote:[color=blue]
                      > Hi
                      >
                      > Can you avoid that googlebot indexes PHPSESSID pages? Googlebot is
                      > indexing pages with PHPSESSID, which makes it think my page has a
                      > infinite number of pages. How can one avoid this?
                      >
                      > Here is an exsample of url that google register, that might make is
                      > more clear what is happening
                      >
                      > https://www.winches.dk/winches.php?a...6f0d46334659ff...
                      > https://www.winches.dk/winches.php?a...b6aed41fc142ea...[/color]

                      Such a change in session id shouldn't happen in a normal site. Also,
                      AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).

                      FWIW, <news:112041560 2.234835.159790 @g43g2000cwa.go oglegroups.com> (
                      http://groups.google.com/group/comp....7bb41576afe16d )

                      --
                      <?php echo 'Just another PHP saint'; ?>
                      Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

                      Comment

                      • CAH

                        #12
                        Re: Can you avoid that googlebot indexes PHPSESSID pages?

                        > > https://www.winches.dk/winches.php?a...6f0d46334659ff...[color=blue][color=green]
                        > > https://www.winches.dk/winches.php?a...b6aed41fc142ea...[/color]
                        >
                        > Such a change in session id shouldn't happen in a normal site.[/color]

                        Why not? I would think a session ID should be unique. If you think I am
                        doing something wrong, what could that be then?

                        Also,[color=blue]
                        > AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).[/color]

                        you can try this seach in google site:www.winches.dk

                        or click her



                        Look at the last 100 entries or so.

                        Best regards
                        mads

                        Comment

                        • CAH

                          #13
                          Re: Can you avoid that googlebot indexes PHPSESSID pages?

                          > There was some discussion of forcing cookies, but the author didn't want[color=blue]
                          > to limit his users, so...
                          >
                          > How about doing something like this:
                          >
                          > // See if the user agent is Googlebot
                          > $isGoogle = stripos($_SERVE R['HTTP_USER_AGEN T'], 'Googlebot');
                          >
                          > // If it is, use ini_set to only allow cookies for the session variable
                          > if ($isGoogle !== false) {
                          > ini_set('sessio n.use_only_cook ies', '1');
                          > }[/color]

                          Hi Scott

                          The solution you have come up with, is the cool one. How can I test if
                          my host allows me to set ini_set('sessio n.use_only_cook ies', '1');
                          The way you suggest in your code? Can all do this on any host?
                          Any ideas as to how to chek to See if the user agent is Googlebot?

                          Thanks for the suggestions. I must say I have had my first encounter
                          with cookie problems, so I would like to get the PHPSESID back in the
                          url.

                          Best regards
                          Mads

                          Comment

                          • CAH

                            #14
                            PHPSESSID URLs restricted by robots.txt

                            > If this solutions works[color=blue]
                            >
                            > User-agent: Googlebot
                            > Disallow: /*PHPSESSID
                            >
                            > it would be by far the simplest, I do however not feel to sure that it
                            > does work, and have no opportunity to check it at this time.[/color]

                            PHPSESSID URLs restricted by robots.txt

                            In Google sitemap BETA I can see 10 URLs restricted by robots.txt , and
                            that is with the above robot text. So i guess that might do the trick,
                            what do you think is this and indication that the above robot text is
                            egnoug?


                            Cah

                            Comment

                            • R. Rajesh Jeba Anbiah

                              #15
                              Re: Can you avoid that googlebot indexes PHPSESSID pages?

                              CAH wrote:[color=blue][color=green][color=darkred]
                              > > > https://www.winches.dk/winches.php?a...6f0d46334659ff...
                              > > > https://www.winches.dk/winches.php?a...b6aed41fc142ea...[/color]
                              > >
                              > > Such a change in session id shouldn't happen in a normal site.[/color]
                              >
                              > Why not? I would think a session ID should be unique. If you think I am
                              > doing something wrong, what could that be then?[/color]

                              It shouldn't happen in a single session--session id remains same for
                              the single session unless:
                              1. Crawler is returning and caching in multiple run
                              2. You have used session_regener ate_id()
                              3. There are random absoulte links poining in from your site to your
                              site (instead of relative links)
                              [color=blue]
                              > Also,[color=green]
                              > > AFAIK Google will remove the PHPSESSID from URL (after crawling(?)).[/color]
                              >
                              > you can try this seach in google site:www.winches.dk
                              >
                              > or click her
                              >
                              > http://www.google.com/search?q=site:...e=off&filter=0
                              >
                              > Look at the last 100 entries or so.[/color]

                              It doesn't seem to strip session id as I thought. If your site
                              contents doesn't rely on session (for non-members), you may safely turn
                              off trans sid
                              <news:111160396 2.594721.154710 @l41g2000cwc.go oglegroups.com> (
                              http://groups.google.com/group/comp....24f27f2b7ac610 )
                              --even you can selectively turn off only for the crawler by sniffing
                              user agent string and or IP.

                              But, if your site depends on session (for non-members and hence
                              crawler) and you'd like to enable session for crawler, but doesn't want
                              the trans sid, you need to go for some other hack. If that is your
                              situation, I may help you with the hack.

                              --
                              <?php echo 'Just another PHP saint'; ?>
                              Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

                              Comment

                              Working...