Apache: redirecting deep links

skewray · 2025-09-12 23:04:09

I have a list of publications, some of which have links to locally stored PDF files. Crawlers deep link to those files and serve them directly, which I would like to gently redirect. The publications URL sets a cookie (“bot”) to a number. So, my .htaccess file looks like

RewriteCond %{HTTP_COOKIE}              !"bot=\d*"
RewriteRule "\.pdf$"                    publications

This doesn’t work. It’s supposed to serve up the publications Txp section static page. Even when the cookie is set, the PDF is returned. It returns the PDF even if I take out the cookie RewriteCond. What am I doing dumb?

colak · 2025-09-13 03:36:36

Can you post a sample link the crawlers use?

etc · 2025-09-13 08:04:54

The standard txp .htaccess file contains these rules:

    RewriteCond %{REQUEST_FILENAME} -f [OR]
    RewriteCond %{REQUEST_FILENAME} -d
    RewriteRule ^(.+) - [PT,L]

which mean that if a file/dir exists, it will be served directly, stopping (via [L] flag) processing other rules. Might it be interfering with your block?

skewray · 2025-09-13 15:03:36

Yiannis:

104.222.174.240 - - [09/Sep/2025:02:31:01 +0000] "GET /file_download/54/20_spie5523-6.pdf HTTP/1.1" 200 817527 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.3"
23.226.219.216 - - [09/Sep/2025:09:25:53 +0000] "GET /file_download/44/09_spie3355.pdf HTTP/1.1" 200 392117 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/117.0.0.0"
49.206.132.15 - - [10/Sep/2025:04:13:59 +0000] "GET /file_download/44/09_spie3355.pdf HTTP/2" 200 392117 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
49.233.159.76 - - [11/Sep/2025:17:31:28 +0000] "GET /file_download/46/12_spie4008.pdf HTTP/2" 200 471571 "skeurae,," "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1; rv:110.0) Gecko/20100101 Firefox/110.0"

If the cookie had been set, then referer would have been skeurae,,V1. I would normally have a lot more hits, but I currently reject all non-search-engine hits from Azure, Google, and Amazon clouds.

Oleg:

I have the unmodified Txp .htaccess code at the bottom of the .htaccess file, so those lines are there. Now that you point it out, though, I see that my .htaccess lines are modifying REQUEST_URI, while those lines are using REQUEST_FILENAME. I bet this is the issue. (Do I add [N] to the RewriteRule? I’ve never used [N], to avoid loops. Will [N] recompute REQUEST_FILENAME?)

Edit: The Apache documentation says that REQUEST_FILENAME is rewritten when REQUEST_URI is updated by mod_rewrite. If this is true, then what I did should work. Of course, I have LightSpeed and not Apache, so it may be bugged.

Last edited by skewray (2025-09-13 17:33:05)

skewray · 2025-09-13 21:11:41

So, I haven’t solved the .htaccess mystery, but I did come up with something that works. Maybe better?

RewriteCond %{HTTP_COOKIE}    !"bot=\d*"
RewriteRule "\.pdf$"          publications        [NC,L,R=303]

This forces a browser load of the publications section static page as a “See Other” server response code.

etc · 2025-09-14 09:27:55

Clever. You might probably avoid the 303 danse with

RewriteRule "\.pdf$"          index.php?s=publications        [NC,L]

but it’s less clean.

skewray · 2025-09-14 15:24:35

Oh, yes, that should work. However, 303 isn’t actually a dance; these (LLM) crawler folk don’t come back after the 303, which is what I want anyway. It’s been running overnight, and the bots seems to either GET & GET or HEAD & GET, all on the original URL.

Googlebot and Bingbot both did a single GET, but I’ve seen them use an expired session cookie, so they will figure it out at some point. Or they won’t; search engine traffic nowadays is a joke, so I am not sure I care.

colak · 2025-09-15 03:24:32

I’m wondering if you need the cookie or if it can be done without it.

skewray · 2025-09-15 06:34:19

The Txp URL sets the cookie, so I use the cookie’s existence to confirm that the visitor may be human before allowing resources like PDFs and images. I have some vague recollection that Txp may have some way to prevent files and images from being deep linked?

Unfortunately, I can’t use this with the OpenGraph images. I doubt that an image fetch for a link posting on Mastodon or Bluesky is going to set a cookie.

colak · 2025-09-15 06:44:39

skewray wrote #340510:

The Txp URL sets the cookie, so I use the cookie’s existence to confirm that the visitor may be human before allowing resources like PDFs and images. I have some vague recollection that Txp may have some way to prevent files and images from being deep linked?

Unfortunately, I can’t use this with the OpenGraph images. I doubt that an image fetch for a link posting on Mastodon or Bluesky is going to set a cookie.

I see. Thanks!

Txp does not have a native way to prevent deep linking. /file_download/54/20_spie5523-6.pdf forces the pdf to download and to be counted but that’s as far as it goes.

skewray · 2025-09-15 15:37:09

So, in theory I could modify Txp to check for the cookie and do the redirect, rather than using .htaccess.

I wish there were some easy way to insert the cookie and then send the rest of a Txp URL. Rather like a 206. Then the crawlers that snipe (just load one URL and never come back) would be stopped.

(ps…I couldn’t stand it and made a hole for googlebot and bingbot. They are my largest rejects.)

colak · 2025-09-15 16:09:37

I wouldn’t modify txp as it will be an ongoing chore for the next versions.

I’m wondering if a robots.txt directive could do the job.

User-agent: *
Disallow: /file_download/

skewray · 2025-09-15 16:30:56

robots.txt is entirely voluntary, and all the evil crawlers ignore it. Also, seemingly legitimate crawlers ignore robots.txt because they say that they aren’t ‘crawling’ but rather ‘fetching’. I have…let me count…51 crawlers blocked in my .htaccess file right now, each of which has, in the past, ignored robots.txt.

(My comment about modifying Txp was motivated by the existence of plugins; I know that they exist, but I have no idea what they can and cannot do.)

Textpattern CMS

Textpattern CMS support forum

#1 2025-09-12 23:04:09

Apache: redirecting deep links

#2 2025-09-13 03:36:36

Re: Apache: redirecting deep links

#3 2025-09-13 08:04:54

Re: Apache: redirecting deep links

#4 2025-09-13 15:03:36

Re: Apache: redirecting deep links

#5 2025-09-13 21:11:41

Re: Apache: redirecting deep links

#6 2025-09-14 09:27:55

Re: Apache: redirecting deep links

#7 2025-09-14 15:24:35

Re: Apache: redirecting deep links

#8 2025-09-15 03:24:32

Re: Apache: redirecting deep links

#9 2025-09-15 06:34:19

Re: Apache: redirecting deep links

#10 2025-09-15 06:44:39

Re: Apache: redirecting deep links

skewray wrote #340510:

#11 2025-09-15 15:37:09

Re: Apache: redirecting deep links

#12 2025-09-15 16:09:37

Re: Apache: redirecting deep links

#13 2025-09-15 16:30:56

Re: Apache: redirecting deep links

Board footer