Robots Exclusion Protocol

Destry · 2019-03-13 16:47:52

I have to admit, this whole robots.txt thing was not even on my radar until about three years ago, and not even on my mind as a site owner until recently. I’m guessing I’m not alone, though not with you smart bunch. And I’m trying to get my head around a few things.

I know when and how it started, and who was the man. I know it’s not a standard, per se, but still very much used, and will remain so as long as the likes of Google still recognize it. I know that robot makers don’t have to respect it, thus there is a lot of ‘badbots’ out there. And I know there’s been some variously-supported extensions to the original ‘spec’, like Allow, Sitemap, Crawl-delay, use of wildcards (*), and $ (purpose of which I do not know).

I really started taking an interest in these things when I realized website archives are starting to not respect them (thus another editorial effort in the making).

So I have a few basic questions for those who can help.

What does this do (the Ua is arbitrary here)?

User-agent: *
Allow: /$
Disallow: /*

Also, I see a lot of robots.txt files that are long, filled with stuff like this to block the bad bots (if they abide):

User-agent: Digimind
Disallow: /
#
User-agent: Knowings
Disallow: /
#
User-agent: Sindup
Disallow: /
#
User-agent: Cision
Disallow: /
#
User-agent: Talkwater
Disallow: /
#
User-agent: TurnitinBot
Disallow: /

Is it not possible to write that like this ?

User-agent: Digimind
User-agent: Knowings
User-agent: Sindup
User-agent: Cision
User-agent: Talkwater
User-agent: TurnitinBot
Disallow: /

And if you’re going to use a robots.txt file, what’s the best way to setup the content="" in <meta name="robots" content=""> content attribute? I recall reading somewhere that people often make a mistake here because of a difference between ‘crawler’ bots and ‘indexer’ bots, and that when push comes to shove, the robots.txt file is the boss, because it’s most likely to get read if for any reason the file is blocked.

Destry · 2019-03-13 17:28:21

Using a potential Txp site example, does this hold water?

User-agent: *              # all user agents
Disallow: /                # disallow access to everything
#
# Now override the global directive;
# allowing access to specific bots,
# except to specific things.
#
User-agent: googlebot
User-agent: duckduckgo
User-agent: Qwantify
User-agent: yandex
User-agent: baidu
User-agent: bingbot
User-agent: teoma
Allow: 
Disallow: /textpattern/
Disallow: /assets/
Disallow: /themes/
Disallow: /images/
Disallow: /files/
Disallow: /rpc/
Disallow: /css.php

The first block is a global to block everything. The second is an override to allow only the low bandwidth I care about, which is probably most of the online searching world anyway.

I just don’t know about the cascading structure of it all.

Is it supposed to be Allow: /?

Could .htaccess go in there? Or does the hidden nature already block it?

Come to think of it… Maybe the Txp .htaccess file is setup such that disallowing those core locations via a robots file is obsolete? I mean, better to not bring attention to something that’s not getting attention, right?

Last edited by Destry (2019-03-13 17:55:05)

Destry · 2019-03-13 18:05:06

Btw, here’s something reeeeeaally juicy… I was looking around at different robot files of heavy hitters, and the one I thought to check in France, Le Monde, has this interesting bit at top, translated here to English for you:

#It is prohibited to use web indexing robots or other automatic browsing or navigation methods on this website.
#We prohibit crawling our website using a stolen user agent that does not match your identity.
#“Violation of the right of the database producer – article L 342-1 and according to the Intellectual Property Code”.
#We invite you to contact us to contract a user license. Only partners are authorized to use our content for purposes other than strictly individual use.

So what that is suggesting, assuming Le Monde knows what it’s talking about, and it might, is that the French Droit d’Auteur (French copyright policy), has an article pointing out that doing what the Internet Archive does, and ignore the robots.txt file is, in fact, against the law if not given permission, and since the robots.txt file of Le Monde is designed to restrict the bot, that’s essentially not giving automatic permission. This blows a big hole in all those dude-bros screaming about ‘fair use’. Sorry. ;)

And here’s the Article, again translated:

The database producer has the right to prohibit:

1° The extraction, by permanent or temporary transfer of all or a qualitatively or quantitatively substantial part of the contents of a database to another medium, by any means and in any form whatsoever;

2° Reuse, by making available to the public all or a qualitatively or quantitatively substantial part of the content of the database, whatever the form.

These rights may be transferred or assigned or licensed.

Public lending is not an act of extraction or reuse.

This could/would be relevant to Txp sites, as example, because they are database-driven. Static sites would be excluded from the exemption, say.

I could be interpreting it wrong; it needs following up with. One thing that is unclear to me still is how IA sees itself. As a national library? How so? What qualifies that legally speaking? That could be their loophole.

But the Archive does grant removals when made in writing, and that is certainly a good nod to something like GDPR compliance.

A lot of fuzz, a lot of fuzz on that peach.

philwareham · 2019-03-14 12:32:22

I wouldn’t worry too much over the robots.txt file, bad bots are going to ignore its directives anyway… because, well, they are bad bots. ?

Just the following is broadly fine:

User-agent: *
Disallow: /file_download/
Disallow: /textpattern/
Sitemap: https://example.com/sitemap.xml

You’d be more successful to block via the .htacess file or Nginx/Lighttpd/whatever equivalent. See this article for some pointers.

colak · 2019-03-14 13:11:18

philwareham wrote #317048:

You’d be more successful to block via the .htacess file or Nginx/Lighttpd/whatever equivalent. See this article for some pointers.

+1 on that. Here’s a list of what we have to get you started

Destry · 2019-03-15 09:23:48

Wise counsel. Thanks.

Colak, wow. That’s quite a file…

I’m assuming everything commented out is disfunctional? There’s sure a lot of it. E.g. the ‘hotlink protection’ starting line 105, that’s not doing anything quiet?

It looks like you have your ‘deny’ rules really condensed, alphabetically maybe? I don’t really understand it. Can you explain a given line, say the ‘Ms’, line 90? What do all the back-slashes and bars mean?

I’m not sure I want to spend too much time worrying about every bad bot on the planet; might be a time sink. Are there simpler rules to use? Say to first block everything wholesale, then individually allow only the two-hands-worth of agents one cares about?

There is some other stuff in there that looks interesting and useful, like at line 130… What’s that about ‘ETags’?

And line 43, <IfModule mod_expires.c>, what are those lines doing? I might like those too. ;)

And those file rules at the end. Good, good, good. Thanks.

Where must Txp rules go in relation to all this stuff? At very top? Very bottom? Does it matter?

colak · 2019-03-15 14:21:31

Destry wrote #317059:

Wise counsel. Thanks.

Colak, wow. That’s quite a file…

I’m assuming everything commented out is disfunctional? There’s sure a lot of it. E.g. the ‘hotlink protection’ starting line 105, that’s not doing anything quiet?

Indeed. The idea was that I was going to prevent hotlinks except from the search engines as they actually do bring traffic in. Instead, the rules blocked everyone.

It looks like you have your ‘deny’ rules really condensed, alphabetically maybe? I don’t really understand it. Can you explain a given line, say the ‘Ms’, line 90? What do all the back-slashes and bars mean?

Through the years I compiled the bad bots through my logs but also through other people. Eventually the list became too large to handle in any other way except alphabetically. I can not really say what each bot did any more but I mostly get proper traffic now. HAving said that, I still need to do something regarding bad referrers.

I’m not sure I want to spend too much time worrying about every bad bot on the planet; might be a time sink. Are there simpler rules to use? Say to first block everything wholesale, then individually allow only the two-hands-worth of agents one cares about?

I’m sure that there is a way… or just copy/paste my rules and you’ll be ready to go.

There is some other stuff in there that looks interesting and useful, like at line 130… What’s that about ‘ETags’?

are you referring to line 33? en.wikipedia.org/wiki/HTTP_ETag

And line 43, <IfModule mod_expires.c>, what are those lines doing? I might like those too. ;)

Those lines instruct the browser to cache the particular types for the particular durations. Note that when I do major changes to the site, I assign 1 second to all of those values and leave it like that for a couple of weeks in order to make sure as many people as possible will have access to the changes.

And those file rules at the end. Good, good, good. Thanks.

glad you like:)

Where must Txp rules go in relation to all this stuff? At very top? Very bottom? Does it matter?

I have those on line 238.

Textpattern CMS

Textpattern CMS support forum

#1 2019-03-13 16:47:52