Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Apache: redirecting deep links
I have a list of publications, some of which have links to locally stored PDF files. Crawlers deep link to those files and serve them directly, which I would like to gently redirect. The publications URL sets a cookie (“bot”) to a number. So, my .htaccess file looks like
RewriteCond %{HTTP_COOKIE} !"bot=\d*"
RewriteRule "\.pdf$" publications
This doesn’t work. It’s supposed to serve up the publications Txp section static page. Even when the cookie is set, the PDF is returned. It returns the PDF even if I take out the cookie RewriteCond
. What am I doing dumb?
Offline
Re: Apache: redirecting deep links
Can you post a sample link the crawlers use?
Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.
Offline
Re: Apache: redirecting deep links
The standard txp .htaccess
file contains these rules:
RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^(.+) - [PT,L]
which mean that if a file/dir exists, it will be served directly, stopping (via [L]
flag) processing other rules. Might it be interfering with your block?
Offline
Re: Apache: redirecting deep links
Yiannis:
104.222.174.240 - - [09/Sep/2025:02:31:01 +0000] "GET /file_download/54/20_spie5523-6.pdf HTTP/1.1" 200 817527 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.3"
23.226.219.216 - - [09/Sep/2025:09:25:53 +0000] "GET /file_download/44/09_spie3355.pdf HTTP/1.1" 200 392117 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/117.0.0.0"
49.206.132.15 - - [10/Sep/2025:04:13:59 +0000] "GET /file_download/44/09_spie3355.pdf HTTP/2" 200 392117 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
49.233.159.76 - - [11/Sep/2025:17:31:28 +0000] "GET /file_download/46/12_spie4008.pdf HTTP/2" 200 471571 "skeurae,," "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1; rv:110.0) Gecko/20100101 Firefox/110.0"
If the cookie had been set, then referer would have been skeurae,,V1
. I would normally have a lot more hits, but I currently reject all non-search-engine hits from Azure, Google, and Amazon clouds.
Oleg:
I have the unmodified Txp .htaccess code at the bottom of the .htaccess file, so those lines are there. Now that you point it out, though, I see that my .htaccess lines are modifyingREQUEST_URI
, while those lines are using REQUEST_FILENAME
. I bet this is the issue. (Do I add [N]
to the RewriteRule? I’ve never used [N]
, to avoid loops. Will [N]
recompute REQUEST_FILENAME
?)
Edit: The Apache documentation says that REQUEST_FILENAME
is rewritten when REQUEST_URI
is updated by mod_rewrite
. If this is true, then what I did should work. Of course, I have LightSpeed and not Apache, so it may be bugged.
Last edited by skewray (2025-09-13 17:33:05)
Offline
Re: Apache: redirecting deep links
So, I haven’t solved the .htaccess
mystery, but I did come up with something that works. Maybe better?
RewriteCond %{HTTP_COOKIE} !"bot=\d*"
RewriteRule "\.pdf$" publications [NC,L,R=303]
This forces a browser load of the publications
section static page as a “See Other” server response code.
Offline
Re: Apache: redirecting deep links
Clever. You might probably avoid the 303 danse with
RewriteRule "\.pdf$" index.php?s=publications [NC,L]
but it’s less clean.
Offline
Re: Apache: redirecting deep links
Oh, yes, that should work. However, 303 isn’t actually a dance; these (LLM) crawler folk don’t come back after the 303, which is what I want anyway. It’s been running overnight, and the bots seems to either GET & GET or HEAD & GET, all on the original URL.
Googlebot and Bingbot both did a single GET, but I’ve seen them use an expired session cookie, so they will figure it out at some point. Or they won’t; search engine traffic nowadays is a joke, so I am not sure I care.
Offline
Re: Apache: redirecting deep links
I’m wondering if you need the cookie or if it can be done without it.
Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.
Offline
Re: Apache: redirecting deep links
The Txp URL sets the cookie, so I use the cookie’s existence to confirm that the visitor may be human before allowing resources like PDFs and images. I have some vague recollection that Txp may have some way to prevent files and images from being deep linked?
Unfortunately, I can’t use this with the OpenGraph images. I doubt that an image fetch for a link posting on Mastodon or Bluesky is going to set a cookie.
Offline
#10 Today 06:44:39
Re: Apache: redirecting deep links
skewray wrote #340510:
The Txp URL sets the cookie, so I use the cookie’s existence to confirm that the visitor may be human before allowing resources like PDFs and images. I have some vague recollection that Txp may have some way to prevent files and images from being deep linked?
Unfortunately, I can’t use this with the OpenGraph images. I doubt that an image fetch for a link posting on Mastodon or Bluesky is going to set a cookie.
I see. Thanks!
Txp does not have a native way to prevent deep linking. /file_download/54/20_spie5523-6.pdf
forces the pdf to download and to be counted but that’s as far as it goes.
Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.
Offline