Stop bots touching files / affecting download counters

Bloke · 2007-09-20 23:13:48

This may be a daft question but if I employ a robots.txt file that disallows access to the /files directory, should well-behaved bots be forbidden from following links generated by the <txp:download...> tags? e.g. (http://site.com/file_download/34)

Judging by the download counts of some of the mp3 files I host, I think bots are either downloading or, at the very least, touching the files linked off my site via file_download and then aborting; either way the download count goes up and the stats are misleading.

I’d like to politely tell bots to lay off anything in the files directory, whether accessed directly (site.com/files/my.mp3) or indirectly (site.com/file_download/28). At least my download counts would then be a little more of an accurate measure of human downloads. And I’ll probably save some bandwidth in the process.

Many thanks in advance for any pointers, and sorry if the question is rather naive.

marios · 2007-09-21 02:39:27

I’ve wondered about this one as well, a while ago.
One of the most stupid things I ever did was this unproperly packed file upload, that you can see from the third entry below.

http://www.google.com/search?client=safari&rls=en&q=mrs_exif_lab&ie=UTF-8&oe=UTF-8

The google bot even unzipped the file untars it, and then displayed the path as result. Nice invitation to hackers.

I’ve yet to remove this one and let gogle reindex ghe file.

I also believe, that my download count is inaccurate.

May be some sort of Server Side Trickery could resolve this.

regards, marios

obeewan · 2007-09-21 06:48:05

I haven’t used robots.txt for quite some time but isn’t it supposed to limit bots behaviours, therefor if you specify /files to be off-limits and place the textfile in the root of the server even if the bot is following a link it should consult robots.txt before proceeding?

Sure, this is for, as you put it, the well-behaved ones. The more brat-versioned ones simulate so much so they’re hard to distinguish from normal webbrowsers anyhow.

Edit:
http://www.robotstxt.org/wc/robots.html seems like a fine read to brush up anyones knowledge on robots.txt.

Last edited by obeewan (2007-09-21 06:49:44)

Bloke · 2007-09-21 08:59:30

Thanks for the replies guys.

@marios:

Unzipping and indexing content… ouch! Never realised Googlebot was so bold.

May be some sort of Server Side Trickery could resolve this.

Yeah, perhaps. If I have to play dirty with mod_rewrite or something I will, but before I can work out what’s possible I’d like to find out exactly what happens if a bot follows a file_download link (see below). I’m only half-programmer and the TXP core is too clever for me to fully grasp.

@obeewan:

Thanks for the link, I’ll have a trawl through that and brush up.

even if the bot is following a link it should consult robots.txt before proceeding?

That’s what I hope will happen. But I’d like it confirmed from somone more knowledgeable than myself, since file_download is a sort of ‘virtual’ link that increases the counter before serving the file.

I guess beneath the surface the site.com/file_download/#ID must resolve to a real URL and serve the file to the requesting party as site.com/files/real_file so, in theory, at that point the robots.txt should be consulted and the file denied.

A couple of questions arise:

Does it work this way? Not being a bot (or capable of writing one) I can’t test this without putting a robots.txt file there and waiting for the site to be indexed, hoping.
Even if the file is denied, is the fact that there has been an attempt at access enough to increase the download count for that file? I suspect so :-(

The 2nd question is the one that’s really bugging me. If I click a download link and then click Cancel at the “Save to disk” request, the download count still goes up by 1. Fair enough, the counter has no way of legislating for me overriding my decision to download.

But, essentially, does a bot that has been denied access to a file do exactly the same thing and increase the counter for a file it may have no hope of ever indexing? That’s what I’d like to stop to try and make sure my download counters are (a little) more meaningful.

If all my downloads are from real people, that’s way cool and I’m pleased to be so popular. But I have this nagging sensation that a vast proportion are spiders…

Thanks for any wisdom sent this way.

obeewan · 2007-09-21 09:30:57

I guess beneath the surface the site.com/file_download/#ID must resolve to a real URL and serve the file to the requesting party as site.com/files/real_file so, in theory, at that point the robots.txt should be consulted and the file denied.

If you deny both /files and /file_download in robots.txt this should work for the well-behaved ones. I wonder if it is possible to send this info also through headers.

One more thing
This tracking that I have no idea how it works should be able to check for a “download complete” handshake with the user somehow and when that handshake has been done then update the tracker.
If you can by Javascript check this why not by serverside PHP? My experience and knowledgebase here are a bit limited, haven’t worked with file upload/download for quite some time so can’t come up with any reference for this. Google is probably the way to go for info on this.

Textpattern CMS

Textpattern CMS support forum

#1 2007-09-20 23:13:48

Stop bots touching files / affecting download counters

#2 2007-09-21 02:39:27

Re: Stop bots touching files / affecting download counters

#3 2007-09-21 06:48:05

Re: Stop bots touching files / affecting download counters

#4 2007-09-21 08:59:30

Re: Stop bots touching files / affecting download counters

#5 2007-09-21 09:30:57

Re: Stop bots touching files / affecting download counters

Board footer