Blocking crawler logging

jtlapp · 2006-04-07 15:57:21

Anybody know how to keeping crawler visits out of the log?
I don’t want to view just referrers.

I don’t mind tweaking the code, but I’m not sure what would do it.

Thanks!
~joe

els · 2006-04-07 16:15:52

In this thread you’ll find directions how to modify log.php to do this. At the time I couldn’t get it working, but I’m not a coder. So if the thread helps you to find a way how to do it, I’d love to hear it ;)

jtlapp · 2006-04-08 03:38:31

Thanks, Els. That’s not quite what I was after anyway, but it does make it obvious what I need to do.

I’m going to code up a fix that actually prevents the logging of designated IPs and domains, allowing you to specify partial IPs and partial domains (and to avoid clock cycles doing domain lookups and DB writes).

It’s pretty straightforward. When I’m sure it works, I’ll post it here.

~joe

jtlapp · 2006-04-08 05:08:36

Well, I’ve finished it and have it working, but I just spent the past 45 minutes trying to figure out how to get Textile to post the code to this forum. Putting the code between pre and code tags didn’t work. It would still interpret some of the code and — amazingly — even remove whole sections of code.

Google can find an article on posting code to the forum, but that article is gone. How do I do this???!!!!! Thanks!

KurtRaschke · 2006-04-08 05:11:45

http://textpattern.com/faq/43/how-do-i-post-tags-and-code-on-the-forum

or use pastebin.

-Kurt

jtlapp · 2006-04-08 05:22:55

Okay, I have something that appears to work. It’s a hack for TxP 4.0.3. Two steps:

(1) Create a file called ignore.php and put it in textpattern/, the same directory that has config.php. Put this code in the file, but you customize it as needed:

<pre>
<notextile>
<?php

// IP addresses not to log (or front-ends thereof if ends with dot).
// List an IP address if you can, as doing so eliminates DNS lookups.

$ignore_ips = array( ’192.168.1.’
);

// Domains not to log (or tail-ends thereof if begins with dot).
// List domains in any letter case; they are not case-sensitive.

$ignore_domains = array( ‘.googlebot.com’, ‘.inktomisearch.com’
);

?>
</code>
</pre>

Be sure to put commas between your quoted IPs or domains, as PHP requires.

(2) Add to textpattern/publish/log.php all of the lines designated by /* JTL */ and found between /* BEGIN JTL */ and /* END JTL */, in the following code (or just replace the function logit() with everything here):

Code for log.php
(Sorry, I had to use pastebin. Textile does not seem to have a working syntax-escape. Thanks Kurt for pointing me there.)

This does not help people with dynamic IP who want to ignore their own visits. I’m not quite sure how to deal with that.

Lemme know if you have any problems.

~joe

zem · 2006-04-08 06:00:25

Joe, grep publish.php for $nolog. You can do this in a plugin.

jtlapp · 2006-04-08 06:23:12

Hi Alex. I must be missing something. I don’t see the hook, but that does look like a good place for a callback. ~joe

zem · 2006-04-08 06:55:15

Plugin code loads early on, before that block. You could do your check at plugin load time, and set $nolog if it’s a hit that shouldn’t be logged.

Last edited by zem (2006-04-08 06:55:35)

jtlapp · 2006-04-08 07:15:31

Ah, I see. Hmmm. That works well for the IP check but if I do the DNS check there we’d do two DNS lookups per logged visit (or one lookup and one “cache” retrieval or two “cache” retrievals). It works, but it’s not exactly desirable.

I’d need to hijack logit() altogether, but I’m not sure I can do that at plugin load time. If I can, I’d be duplicating TxP code (logit) and trying to keep it current with the latest TxP.

I’m not sure what the right answer is.

jtlapp · 2006-04-19 00:19:49

Is anybody using this hack?

I’ve just realized that it has a drawback: since ignored domains don’t make it into the log file, and since the log file is the domain cache, every visit by an ignored domain suffers a DNS lookup.

I think the right thing to do is to have a proper domain cache (which would be a faster lookup, anyway, and fewer lookups overall). But perhaps an intermediate fix is to have ignored domains logged only once; if the domain already appears in a log entry, it is ignored, and otherwise it is logged. Reduces log clutter, anyway, and lets you know that your domain is being visited by the search bots.

Would be nice to come up with a solution we can put into TxP proper. Perhaps implementing a domain name cache would allow a plugin to implement this ignore feature and communicate the DNS lookup to the logger without resulting in a second lookup.

~joe

jtlapp · 2006-04-19 01:46:56

Grrr. My old pastebin code seems to have disappeared from the pastebin site. Nevermind. I came up with a way to post arbitrary code to Textile. I’ll provide a separate post saying how it’s done.

Here’s the new logit() function that ignores all but one ignored_domain visit. That is, if an ignored domain does not yet have a log in the log file, it’s first visit is logged. Subsequent visits are not logged, until its previous logged visit expires from the log file. Reasoning is per my previous post. (This is a temporary hack until TxP gets a real domain name cache.)

Replace the logit() function in publish/log.php:

<pre><code> function logit($r='') { global $ignore_ips, $ignore_domains; /*JTL*/

global $siteurl, $prefs, $pretext; $mydomain = str_replace('www.','',preg_quote($siteurl,"/")); $out['uri'] = @$pretext['request_uri']; $out['ref'] = clean_url(str_replace("http://","",serverSet('HTTP_REFERER'))); $host = $ip = serverSet('REMOTE_ADDR'); /*BEGIN JTL*/ foreach($ignore_ips as $ignore) { $max_len = strlen($ignore); if($ignore{$max_len - 1} == '.') // don't want .1 matching .100 { if(!strncmp($ignore, $ip, $max_len)) return; } else if(!strcmp($ignore, $ip)) return; } /*END JTL*/ if (!empty($prefs['use_dns'])) { // A crude rDNS cache if ($h = safe_field('host', 'txp_log', "ip='".doSlash($ip)."' limit 1")) { $host = $h; /*BEGIN JTL*/ $max_len = strlen($host); foreach($ignore_domains as $ignore) { if(strlen($ignore) <= $max_len) { if($ignore{0} == '.') // don't want bop.com matching beebop.com { if(!strcasecmp($ignore, substr($host, $max_len - strlen($ignore)))) return; } else if(!stricmp($ignore, $host)) return; } } /*END JTL*/ } else { // Double-check the rDNS $host = @gethostbyaddr(serverSet('REMOTE_ADDR')); if ($host != $ip and @gethostbyname($host) != $ip) $host = $ip; } } $out['ip'] = $ip; $out['host'] = $host; $out['status'] = 200; // FIXME $out['method'] = serverSet('REQUEST_METHOD'); if (preg_match("/^[^\.]*\.?$mydomain/i", $out['ref'])) $out['ref'] = ""; if ($r=='refer') { if (trim($out['ref']) != "") { insert_logit($out); } } else insert_logit($out); }

</code></pre>

Textpattern CMS

Textpattern CMS support forum

#1 2006-04-07 15:57:21

Blocking crawler logging

#2 2006-04-07 16:15:52

Re: Blocking crawler logging

#3 2006-04-08 03:38:31

Re: Blocking crawler logging

#4 2006-04-08 05:08:36

Re: Blocking crawler logging

#5 2006-04-08 05:11:45

Re: Blocking crawler logging

#6 2006-04-08 05:22:55

Re: Blocking crawler logging

#7 2006-04-08 06:00:25

Re: Blocking crawler logging

#8 2006-04-08 06:23:12

Re: Blocking crawler logging

#9 2006-04-08 06:55:15

Re: Blocking crawler logging

#10 2006-04-08 07:15:31

Re: Blocking crawler logging

#11 2006-04-19 00:19:49

Re: Blocking crawler logging

#12 2006-04-19 01:46:56

Re: Blocking crawler logging

Board footer