Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2010-10-14 00:44:36

Kjeld
Member
From: Tokyo, Japan
Registered: 2005-02-05
Posts: 453
Website

Googlebot goes berserk on my site

After trying to figure out exceedingly heavy server loads for many weeks, I discovered today that googlebot does some really weird searches on my new site MeijiShowa.com. The problems here may explain high server loads on some of my other sites, too. Maybe other people have the same problem?

I get entries like the following in my logs:

calendar/?​all&keywords=Japan
calendar/?​all&keywords=birds
calendar/?​all&keywords=Ikuta
calendar/?​all&keywords=piers
maps/?​all&keywords=Koinobori
maps/?​all&keywords=Motomachi
maps/?​all&keywords=Azabu-​san
maps/?​all&keywords=tradition

Now, these keywords do exist on my site, but not not in the sections calendar and maps.
(Ignore ​all&. It is a solution I came up with for a specific problem on this site, and unrelated to the problem at hand.)

As I have thousands of keywords, all these searches put a heavy burden on my server. As I am hosted at MediaTemple, it greatly increases my GPU use and I have been getting hundreds of dollars in overage bills.

My questions:

1. How does googlebot find these keywords and match them up with unrelated sections?
2. Should I put something on my robot.txt file (to be created) to prevent googlebot (and other bots) from making search requests that lead nowhere and just adds to server load?
3. Is there anything else that I can do to prevent these useless searches from occurring?


Old Photos of Japan – Japan in the 1850s~1960s (100% txp)
MeijiShowa – Stock photos of Japan in the 1850s~1960s (100% txp)
JapaneseStreets.com – Japanese street fashion (mostly txp)

Offline

#2 2010-10-14 04:28:41

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,254
Website GitHub Mastodon Twitter

Re: Googlebot goes berserk on my site

Google’s webmaster tools, ask for the search query in your site which is then used. Not sure where I saw this but I think that I can narrow it down to google sitemaps.


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#3 2010-10-14 06:36:25

Kjeld
Member
From: Tokyo, Japan
Registered: 2005-02-05
Posts: 453
Website

Re: Googlebot goes berserk on my site

colak wrote:

Google’s webmaster tools, ask for the search query in your site which is then used. Not sure where I saw this but I think that I can narrow it down to google sitemaps.

Google’s webmaster tools has no data yet for MeijiShowa.com.


Old Photos of Japan – Japan in the 1850s~1960s (100% txp)
MeijiShowa – Stock photos of Japan in the 1850s~1960s (100% txp)
JapaneseStreets.com – Japanese street fashion (mostly txp)

Offline

#4 2010-10-14 11:01:25

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,482
Bitbucket GitHub

Re: Googlebot goes berserk on my site

I’ve had great success with rah_sitemap in taming GoogleBot. You can also set the GoogleBot crawl rate in GWT, which should help.

Offline

#5 2010-10-14 11:21:33

Kjeld
Member
From: Tokyo, Japan
Registered: 2005-02-05
Posts: 453
Website

Re: Googlebot goes berserk on my site

Thanks, Pete & Yiannis.

What I really want to know though is: how does googlebot find these keywords and match them up with unrelated sections?

Can googlebot somehow get into the database to get to the keywords?

Last edited by Kjeld (2010-10-14 11:22:22)


Old Photos of Japan – Japan in the 1850s~1960s (100% txp)
MeijiShowa – Stock photos of Japan in the 1850s~1960s (100% txp)
JapaneseStreets.com – Japanese street fashion (mostly txp)

Offline

#6 2010-10-14 11:51:09

Gocom
Developer Emeritus
From: Helsinki, Finland
Registered: 2006-07-14
Posts: 4,533
Website

Re: Googlebot goes berserk on my site

GoogleBots job is to crawl, and you are linking the pages (map/calendar) with ‘keywords’ parameters. End results is that GoogleBot is going to try the keywords you are using and tries to find something new to index.

You could block those ‘filtered’ pages with Robots.txt by telling Google to not use ‘keywords’ parameter. Afterall, only thing that those filtered pages do, is that they spread the same content across multiple addresses and dublicated content is really newer a good thing. Plus it wastes bandwidth.

For example adding something like:

User-agent: *
Disallow: /maps/?keywords=*
Disallow: /calendar/?keywords=*

To your robots.txt tells the google not to use keywords.

Kjeld wrote:

Can googlebot somehow get into the database to get to the keywords?

No.

Last edited by Gocom (2010-10-14 12:01:07)

Offline

#7 2010-10-14 12:09:13

Kjeld
Member
From: Tokyo, Japan
Registered: 2005-02-05
Posts: 453
Website

Re: Googlebot goes berserk on my site

Thanks a lot, Jukka. I was wondering how to phrase the Disallow. Your samples help.

I still don’t understand where googlebot gets those non-existing links. For example, links like maps/?​all&keywords=Koinobori and maps/?​all&keywords=tradition don’t exist anywhere on the site. The keywords Koinobori and tradition do exist, but only in the photography section. How does googlebot manage to combine keywords from one section to create links in another?


Old Photos of Japan – Japan in the 1850s~1960s (100% txp)
MeijiShowa – Stock photos of Japan in the 1850s~1960s (100% txp)
JapaneseStreets.com – Japanese street fashion (mostly txp)

Offline

#8 2010-10-14 12:20:14

Gocom
Developer Emeritus
From: Helsinki, Finland
Registered: 2006-07-14
Posts: 4,533
Website

Re: Googlebot goes berserk on my site

Kjeld wrote:

How does googlebot manage to combine keywords from one section to create links in another?

Probably because of the shared ?keywords parameter. It sees you are using keywords tradition and Koinobori on the other pages, so, when it finds the same keywords parameter used on a different page, it tries those existing keywords on that page too.

Last edited by Gocom (2010-10-14 12:22:49)

Offline

#9 2010-10-14 12:37:06

Kjeld
Member
From: Tokyo, Japan
Registered: 2005-02-05
Posts: 453
Website

Re: Googlebot goes berserk on my site

Gocom wrote:

Probably because of the shared ?keywords parameter. It sees you are using keywords tradition and Koinobori on the other pages, so, when it finds the same keywords parameter used on a different page, it tries those existing keywords on that page too.

Thanks again, Jukka. I didn’t know the googlebot took that much initiative! I always thought bots just looked for existing links only.

I always turn my textpattern logs off, but I have turned them on this time to see where exactly googlebot goes. Using the info from the logs, I have now created a robots.txt file to warn all bots away from the links with the parameters.

I have also installed rah_sitemap and submitted the site’s sitemap to google webmaster tools. Thanks, Pete.

Once again, thanks to all for the input. I had no idea that the bots were this intrusive.

Now, let’s hope all these efforts bear fruit!


Old Photos of Japan – Japan in the 1850s~1960s (100% txp)
MeijiShowa – Stock photos of Japan in the 1850s~1960s (100% txp)
JapaneseStreets.com – Japanese street fashion (mostly txp)

Offline

#10 2010-10-14 21:12:30

Kjeld
Member
From: Tokyo, Japan
Registered: 2005-02-05
Posts: 453
Website

Re: Googlebot goes berserk on my site

Yesterday I created a robots file with entries like the following:

Disallow: /maps/?all&keywords=*
Disallow: /photography/?all&keywords=*
Disallow: /calendar/?all&keywords=*
Disallow: /illustrations/?all&keywords=*
Disallow: /maps/?all&place=*
Disallow: /photography/?all&place=*
Disallow: /calendar/?all&place=*
Disallow: /illustrations/?all&place=*

But it seems that googlebot still accesses such entries. For example, today I found the following googlebot entry in my logs:

maps/?​all&keywords=Daibiru%20Honkan

This should have been blocked by the first entry above…

Is there something wrong in the Disallow syntax, or does Disallow only work for actual directories and files? I can’t seem to find info on that at robotstxt.org


Old Photos of Japan – Japan in the 1850s~1960s (100% txp)
MeijiShowa – Stock photos of Japan in the 1850s~1960s (100% txp)
JapaneseStreets.com – Japanese street fashion (mostly txp)

Offline

#11 2010-10-14 22:17:06

MattD
Plugin Author
From: Monterey, California
Registered: 2008-03-21
Posts: 1,254
Website

Re: Googlebot goes berserk on my site

I’ve just tried testing your rule and url on Google’s Webmaster Tools and confirmed it does not work.


My Plugins

Piwik Dashboard, Google Analytics Dashboard, Minibar, Article Image Colorpicker, Admin Datepicker, Admin Google Map, Admin Colorpicker

Offline

#12 2010-10-14 22:53:29

Gocom
Developer Emeritus
From: Helsinki, Finland
Registered: 2006-07-14
Posts: 4,533
Website

Re: Googlebot goes berserk on my site

Google’s own Robots.txt manual is pretty helpful.

Might be that the rules are failing because the parameter all doesn’t have a value. Didn’t check but that might be the case. Don’t quote me on that.

Anyhow, to cut the amount of rules, as there is plenty, you could just block URLs that use question mark. For example something like might work (not tested):

User-agent: *
Disallow: /*?
Disallow: /present/
Disallow: /featured-images/
Disallow: /explore/
Disallow: /checkout/
Disallow: /confirmation/

If it works, it would also block some other unwantedish pages, like ?id=, ?q= and ?s= which are part of TXP’s URL structure.

Kjeld wrote:

Is there something wrong in the Disallow syntax, or does Disallow only work for actual directories and files?

As with your question about if GoogleBot can access your database, the answer is no. Visitors don’t see your database, nor they really know if the content that an URL shows is a directory, a dynamic page or a static file.

Robots.txt itself should support regular expression like syntax, well – just couple wildcards really, in addition to literal strings and full directives.

Last edited by Gocom (2010-10-14 22:59:30)

Offline

Board footer

Powered by FluxBB