Googlebot goes berserk on my site

Kjeld · 2010-10-14 00:44:36

After trying to figure out exceedingly heavy server loads for many weeks, I discovered today that googlebot does some really weird searches on my new site MeijiShowa.com. The problems here may explain high server loads on some of my other sites, too. Maybe other people have the same problem?

I get entries like the following in my logs:

calendar/?all&keywords=Japan
calendar/?all&keywords=birds
calendar/?all&keywords=Ikuta
calendar/?all&keywords=piers

maps/?all&keywords=Koinobori
maps/?all&keywords=Motomachi
maps/?all&keywords=Azabu-san
maps/?all&keywords=tradition

Now, these keywords do exist on my site, but not not in the sections calendar and maps.
(Ignore all&. It is a solution I came up with for a specific problem on this site, and unrelated to the problem at hand.)

As I have thousands of keywords, all these searches put a heavy burden on my server. As I am hosted at MediaTemple, it greatly increases my GPU use and I have been getting hundreds of dollars in overage bills.

My questions:

1. How does googlebot find these keywords and match them up with unrelated sections?
2. Should I put something on my robot.txt file (to be created) to prevent googlebot (and other bots) from making search requests that lead nowhere and just adds to server load?
3. Is there anything else that I can do to prevent these useless searches from occurring?

colak · 2010-10-14 04:28:41

Google’s webmaster tools, ask for the search query in your site which is then used. Not sure where I saw this but I think that I can narrow it down to google sitemaps.

Kjeld · 2010-10-14 06:36:25

colak wrote:

Google’s webmaster tools, ask for the search query in your site which is then used. Not sure where I saw this but I think that I can narrow it down to google sitemaps.

Google’s webmaster tools has no data yet for MeijiShowa.com.

gaekwad · 2010-10-14 11:01:25

I’ve had great success with rah_sitemap in taming GoogleBot. You can also set the GoogleBot crawl rate in GWT, which should help.

Kjeld · 2010-10-14 11:21:33

Thanks, Pete & Yiannis.

What I really want to know though is: how does googlebot find these keywords and match them up with unrelated sections?

Can googlebot somehow get into the database to get to the keywords?

Last edited by Kjeld (2010-10-14 11:22:22)

Gocom · 2010-10-14 11:51:09

GoogleBots job is to crawl, and you are linking the pages (map/calendar) with ‘keywords’ parameters. End results is that GoogleBot is going to try the keywords you are using and tries to find something new to index.

You could block those ‘filtered’ pages with Robots.txt by telling Google to not use ‘keywords’ parameter. Afterall, only thing that those filtered pages do, is that they spread the same content across multiple addresses and dublicated content is really newer a good thing. Plus it wastes bandwidth.

For example adding something like:

User-agent: *
Disallow: /maps/?keywords=*
Disallow: /calendar/?keywords=*

To your robots.txt tells the google not to use keywords.

Kjeld wrote:

Can googlebot somehow get into the database to get to the keywords?

No.

Last edited by Gocom (2010-10-14 12:01:07)

Kjeld · 2010-10-14 12:09:13

Thanks a lot, Jukka. I was wondering how to phrase the Disallow. Your samples help.

I still don’t understand where googlebot gets those non-existing links. For example, links like maps/?all&keywords=Koinobori and maps/?all&keywords=tradition don’t exist anywhere on the site. The keywords Koinobori and tradition do exist, but only in the photography section. How does googlebot manage to combine keywords from one section to create links in another?

Gocom · 2010-10-14 12:20:14

Kjeld wrote:

How does googlebot manage to combine keywords from one section to create links in another?

Probably because of the shared ?keywords parameter. It sees you are using keywords tradition and Koinobori on the other pages, so, when it finds the same keywords parameter used on a different page, it tries those existing keywords on that page too.

Last edited by Gocom (2010-10-14 12:22:49)

Kjeld · 2010-10-14 12:37:06

Gocom wrote:

Probably because of the shared ?keywords parameter. It sees you are using keywords tradition and Koinobori on the other pages, so, when it finds the same keywords parameter used on a different page, it tries those existing keywords on that page too.

Thanks again, Jukka. I didn’t know the googlebot took that much initiative! I always thought bots just looked for existing links only.

I always turn my textpattern logs off, but I have turned them on this time to see where exactly googlebot goes. Using the info from the logs, I have now created a robots.txt file to warn all bots away from the links with the parameters.

I have also installed rah_sitemap and submitted the site’s sitemap to google webmaster tools. Thanks, Pete.

Once again, thanks to all for the input. I had no idea that the bots were this intrusive.

Now, let’s hope all these efforts bear fruit!

Kjeld · 2010-10-14 21:12:30

Yesterday I created a robots file with entries like the following:

Disallow: /maps/?all&keywords=*
Disallow: /photography/?all&keywords=*
Disallow: /calendar/?all&keywords=*
Disallow: /illustrations/?all&keywords=*
Disallow: /maps/?all&place=*
Disallow: /photography/?all&place=*
Disallow: /calendar/?all&place=*
Disallow: /illustrations/?all&place=*

But it seems that googlebot still accesses such entries. For example, today I found the following googlebot entry in my logs:

maps/?all&keywords=Daibiru%20Honkan

This should have been blocked by the first entry above…

Is there something wrong in the Disallow syntax, or does Disallow only work for actual directories and files? I can’t seem to find info on that at robotstxt.org

MattD · 2010-10-14 22:17:06

I’ve just tried testing your rule and url on Google’s Webmaster Tools and confirmed it does not work.

Gocom · 2010-10-14 22:53:29

Google’s own Robots.txt manual is pretty helpful.

Might be that the rules are failing because the parameter all doesn’t have a value. Didn’t check but that might be the case. Don’t quote me on that.

Anyhow, to cut the amount of rules, as there is plenty, you could just block URLs that use question mark. For example something like might work (not tested):

User-agent: *
Disallow: /*?
Disallow: /present/
Disallow: /featured-images/
Disallow: /explore/
Disallow: /checkout/
Disallow: /confirmation/

If it works, it would also block some other unwantedish pages, like ?id=, ?q= and ?s= which are part of TXP’s URL structure.

Kjeld wrote:

Is there something wrong in the Disallow syntax, or does Disallow only work for actual directories and files?

As with your question about if GoogleBot can access your database, the answer is no. Visitors don’t see your database, nor they really know if the content that an URL shows is a directory, a dynamic page or a static file.

Robots.txt itself should support regular expression like syntax, well – just couple wildcards really, in addition to literal strings and full directives.

Last edited by Gocom (2010-10-14 22:59:30)

Textpattern CMS

Textpattern CMS support forum

#1 2010-10-14 00:44:36

Googlebot goes berserk on my site

#2 2010-10-14 04:28:41

Re: Googlebot goes berserk on my site

#3 2010-10-14 06:36:25

Re: Googlebot goes berserk on my site

#4 2010-10-14 11:01:25

Re: Googlebot goes berserk on my site

#5 2010-10-14 11:21:33

Re: Googlebot goes berserk on my site

#6 2010-10-14 11:51:09

Re: Googlebot goes berserk on my site

#7 2010-10-14 12:09:13

Re: Googlebot goes berserk on my site

#8 2010-10-14 12:20:14

Re: Googlebot goes berserk on my site

#9 2010-10-14 12:37:06

Re: Googlebot goes berserk on my site

#10 2010-10-14 21:12:30

Re: Googlebot goes berserk on my site

#11 2010-10-14 22:17:06

Re: Googlebot goes berserk on my site

#12 2010-10-14 22:53:29

Re: Googlebot goes berserk on my site

Board footer