Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Blocking crawler logging
Anybody know how to keeping crawler visits out of the log?
I don’t want to view just referrers.
I don’t mind tweaking the code, but I’m not sure what would do it.
Thanks!
~joe
Offline
#2 2006-04-07 16:15:52
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: Blocking crawler logging
In this thread you’ll find directions how to modify log.php to do this. At the time I couldn’t get it working, but I’m not a coder. So if the thread helps you to find a way how to do it, I’d love to hear it ;)
Offline
Re: Blocking crawler logging
Thanks, Els. That’s not quite what I was after anyway, but it does make it obvious what I need to do.
I’m going to code up a fix that actually prevents the logging of designated IPs and domains, allowing you to specify partial IPs and partial domains (and to avoid clock cycles doing domain lookups and DB writes).
It’s pretty straightforward. When I’m sure it works, I’ll post it here.
~joe
Offline
Re: Blocking crawler logging
Well, I’ve finished it and have it working, but I just spent the past 45 minutes trying to figure out how to get Textile to post the code to this forum. Putting the code between pre and code tags didn’t work. It would still interpret some of the code and — amazingly — even remove whole sections of code.
Google can find an article on posting code to the forum, but that article is gone. How do I do this???!!!!! Thanks!
Offline
#5 2006-04-08 05:11:45
- KurtRaschke
- Plugin Author
- Registered: 2004-05-16
- Posts: 275
Re: Blocking crawler logging
http://textpattern.com/faq/43/how-do-i-post-tags-and-code-on-the-forum
or use pastebin.
-Kurt
kurt@kurtraschke.com
Offline
Re: Blocking crawler logging
Okay, I have something that appears to work. It’s a hack for TxP 4.0.3. Two steps:
(1) Create a file called ignore.php
and put it in textpattern/
, the same directory that has config.php
. Put this code in the file, but you customize it as needed:
<pre>
<notextile>
<?php
// IP addresses not to log (or front-ends thereof if ends with dot).
// List an IP address if you can, as doing so eliminates DNS lookups.
$ignore_ips = array(
’192.168.1.’
);
// Domains not to log (or tail-ends thereof if begins with dot).
// List domains in any letter case; they are not case-sensitive.
$ignore_domains = array(
‘.googlebot.com’,
‘.inktomisearch.com’
);
?>
</code>
</pre>
Be sure to put commas between your quoted IPs or domains, as PHP requires.
(2) Add to textpattern/publish/log.php all of the lines designated by /* JTL */
and found between /* BEGIN JTL */
and /* END JTL */
, in the following code (or just replace the function logit() with everything here):
Code for log.php
(Sorry, I had to use pastebin. Textile does not seem to have a working syntax-escape. Thanks Kurt for pointing me there.)
This does not help people with dynamic IP who want to ignore their own visits. I’m not quite sure how to deal with that.
Lemme know if you have any problems.
~joe
Offline
#7 2006-04-08 06:00:25
- zem
- Developer Emeritus
- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Blocking crawler logging
Joe, grep publish.php for $nolog
. You can do this in a plugin.
Alex
Offline
Re: Blocking crawler logging
Hi Alex. I must be missing something. I don’t see the hook, but that does look like a good place for a callback. ~joe
Offline
#9 2006-04-08 06:55:15
- zem
- Developer Emeritus
- From: Melbourne, Australia
- Registered: 2004-04-08
- Posts: 2,579
Re: Blocking crawler logging
Plugin code loads early on, before that block. You could do your check at plugin load time, and set $nolog
if it’s a hit that shouldn’t be logged.
Last edited by zem (2006-04-08 06:55:35)
Alex
Offline
Re: Blocking crawler logging
Ah, I see. Hmmm. That works well for the IP check but if I do the DNS check there we’d do two DNS lookups per logged visit (or one lookup and one “cache” retrieval or two “cache” retrievals). It works, but it’s not exactly desirable.
I’d need to hijack logit() altogether, but I’m not sure I can do that at plugin load time. If I can, I’d be duplicating TxP code (logit) and trying to keep it current with the latest TxP.
I’m not sure what the right answer is.
Offline
Re: Blocking crawler logging
Is anybody using this hack?
I’ve just realized that it has a drawback: since ignored domains don’t make it into the log file, and since the log file is the domain cache, every visit by an ignored domain suffers a DNS lookup.
I think the right thing to do is to have a proper domain cache (which would be a faster lookup, anyway, and fewer lookups overall). But perhaps an intermediate fix is to have ignored domains logged only once; if the domain already appears in a log entry, it is ignored, and otherwise it is logged. Reduces log clutter, anyway, and lets you know that your domain is being visited by the search bots.
Would be nice to come up with a solution we can put into TxP proper. Perhaps implementing a domain name cache would allow a plugin to implement this ignore feature and communicate the DNS lookup to the logger without resulting in a second lookup.
~joe
Offline
Re: Blocking crawler logging
Grrr. My old pastebin code seems to have disappeared from the pastebin site. Nevermind. I came up with a way to post arbitrary code to Textile. I’ll provide a separate post saying how it’s done.
Here’s the new logit() function that ignores all but one ignored_domain visit. That is, if an ignored domain does not yet have a log in the log file, it’s first visit is logged. Subsequent visits are not logged, until its previous logged visit expires from the log file. Reasoning is per my previous post. (This is a temporary hack until TxP gets a real domain name cache.)
Replace the logit() function in publish/log.php:
<pre><code> function logit($r='') { global $ignore_ips, $ignore_domains; /*JTL*/
global $siteurl, $prefs, $pretext; $mydomain = str_replace('www.','',preg_quote($siteurl,"/")); $out['uri'] = @$pretext['request_uri']; $out['ref'] = clean_url(str_replace("http://","",serverSet('HTTP_REFERER'))); $host = $ip = serverSet('REMOTE_ADDR'); /*BEGIN JTL*/ foreach($ignore_ips as $ignore) { $max_len = strlen($ignore); if($ignore{$max_len - 1} == '.') // don't want .1 matching .100 { if(!strncmp($ignore, $ip, $max_len)) return; } else if(!strcmp($ignore, $ip)) return; } /*END JTL*/ if (!empty($prefs['use_dns'])) { // A crude rDNS cache if ($h = safe_field('host', 'txp_log', "ip='".doSlash($ip)."' limit 1")) { $host = $h; /*BEGIN JTL*/ $max_len = strlen($host); foreach($ignore_domains as $ignore) { if(strlen($ignore) <= $max_len) { if($ignore{0} == '.') // don't want bop.com matching beebop.com { if(!strcasecmp($ignore, substr($host, $max_len - strlen($ignore)))) return; } else if(!stricmp($ignore, $host)) return; } } /*END JTL*/ } else { // Double-check the rDNS $host = @gethostbyaddr(serverSet('REMOTE_ADDR')); if ($host != $ip and @gethostbyname($host) != $ip) $host = $ip; } } $out['ip'] = $ip; $out['host'] = $host; $out['status'] = 200; // FIXME $out['method'] = serverSet('REQUEST_METHOD'); if (preg_match("/^[^\.]*\.?$mydomain/i", $out['ref'])) $out['ref'] = ""; if ($r=='refer') { if (trim($out['ref']) != "") { insert_logit($out); } } else insert_logit($out); }</code></pre>
Offline