Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2025-09-12 23:04:09

skewray
Member
From: Sunny Southern California
Registered: 2013-04-25
Posts: 240
Website Mastodon

Apache: redirecting deep links

I have a list of publications, some of which have links to locally stored PDF files. Crawlers deep link to those files and serve them directly, which I would like to gently redirect. The publications URL sets a cookie (“bot”) to a number. So, my .htaccess file looks like

RewriteCond %{HTTP_COOKIE}              !"bot=\d*"
RewriteRule "\.pdf$"                    publications

This doesn’t work. It’s supposed to serve up the publications Txp section static page. Even when the cookie is set, the PDF is returned. It returns the PDF even if I take out the cookie RewriteCond. What am I doing dumb?

Offline

#2 2025-09-13 03:36:36

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,241
Website GitHub Mastodon Twitter

Re: Apache: redirecting deep links

Can you post a sample link the crawlers use?


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#3 2025-09-13 08:04:54

etc
Developer
Registered: 2010-11-11
Posts: 5,457
Website GitHub

Re: Apache: redirecting deep links

The standard txp .htaccess file contains these rules:

    RewriteCond %{REQUEST_FILENAME} -f [OR]
    RewriteCond %{REQUEST_FILENAME} -d
    RewriteRule ^(.+) - [PT,L]

which mean that if a file/dir exists, it will be served directly, stopping (via [L] flag) processing other rules. Might it be interfering with your block?

Offline

#4 2025-09-13 15:03:36

skewray
Member
From: Sunny Southern California
Registered: 2013-04-25
Posts: 240
Website Mastodon

Re: Apache: redirecting deep links

Yiannis:

104.222.174.240 - - [09/Sep/2025:02:31:01 +0000] "GET /file_download/54/20_spie5523-6.pdf HTTP/1.1" 200 817527 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.3"
23.226.219.216 - - [09/Sep/2025:09:25:53 +0000] "GET /file_download/44/09_spie3355.pdf HTTP/1.1" 200 392117 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/117.0.0.0"
49.206.132.15 - - [10/Sep/2025:04:13:59 +0000] "GET /file_download/44/09_spie3355.pdf HTTP/2" 200 392117 "skeurae,," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
49.233.159.76 - - [11/Sep/2025:17:31:28 +0000] "GET /file_download/46/12_spie4008.pdf HTTP/2" 200 471571 "skeurae,," "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1; rv:110.0) Gecko/20100101 Firefox/110.0"

If the cookie had been set, then referer would have been skeurae,,V1. I would normally have a lot more hits, but I currently reject all non-search-engine hits from Azure, Google, and Amazon clouds.

Oleg:

I have the unmodified Txp .htaccess code at the bottom of the .htaccess file, so those lines are there. Now that you point it out, though, I see that my .htaccess lines are modifying REQUEST_URI, while those lines are using REQUEST_FILENAME. I bet this is the issue. (Do I add [N] to the RewriteRule? I’ve never used [N], to avoid loops. Will [N] recompute REQUEST_FILENAME?)

Edit: The Apache documentation says that REQUEST_FILENAME is rewritten when REQUEST_URI is updated by mod_rewrite. If this is true, then what I did should work. Of course, I have LightSpeed and not Apache, so it may be bugged.

Last edited by skewray (2025-09-13 17:33:05)

Offline

#5 2025-09-13 21:11:41

skewray
Member
From: Sunny Southern California
Registered: 2013-04-25
Posts: 240
Website Mastodon

Re: Apache: redirecting deep links

So, I haven’t solved the .htaccess mystery, but I did come up with something that works. Maybe better?

RewriteCond %{HTTP_COOKIE}    !"bot=\d*"
RewriteRule "\.pdf$"          publications        [NC,L,R=303]

This forces a browser load of the publications section static page as a “See Other” server response code.

Offline

#6 Yesterday 09:27:55

etc
Developer
Registered: 2010-11-11
Posts: 5,457
Website GitHub

Re: Apache: redirecting deep links

Clever. You might probably avoid the 303 danse with

RewriteRule "\.pdf$"          index.php?s=publications        [NC,L]

but it’s less clean.

Offline

#7 Yesterday 15:24:35

skewray
Member
From: Sunny Southern California
Registered: 2013-04-25
Posts: 240
Website Mastodon

Re: Apache: redirecting deep links

Oh, yes, that should work. However, 303 isn’t actually a dance; these (LLM) crawler folk don’t come back after the 303, which is what I want anyway. It’s been running overnight, and the bots seems to either GET & GET or HEAD & GET, all on the original URL.

Googlebot and Bingbot both did a single GET, but I’ve seen them use an expired session cookie, so they will figure it out at some point. Or they won’t; search engine traffic nowadays is a joke, so I am not sure I care.

Offline

#8 Today 03:24:32

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,241
Website GitHub Mastodon Twitter

Re: Apache: redirecting deep links

I’m wondering if you need the cookie or if it can be done without it.


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#9 Today 06:34:19

skewray
Member
From: Sunny Southern California
Registered: 2013-04-25
Posts: 240
Website Mastodon

Re: Apache: redirecting deep links

The Txp URL sets the cookie, so I use the cookie’s existence to confirm that the visitor may be human before allowing resources like PDFs and images. I have some vague recollection that Txp may have some way to prevent files and images from being deep linked?

Unfortunately, I can’t use this with the OpenGraph images. I doubt that an image fetch for a link posting on Mastodon or Bluesky is going to set a cookie.

Offline

#10 Today 06:44:39

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,241
Website GitHub Mastodon Twitter

Re: Apache: redirecting deep links

skewray wrote #340510:

The Txp URL sets the cookie, so I use the cookie’s existence to confirm that the visitor may be human before allowing resources like PDFs and images. I have some vague recollection that Txp may have some way to prevent files and images from being deep linked?

Unfortunately, I can’t use this with the OpenGraph images. I doubt that an image fetch for a link posting on Mastodon or Bluesky is going to set a cookie.

I see. Thanks!

Txp does not have a native way to prevent deep linking. /file_download/54/20_spie5523-6.pdf forces the pdf to download and to be counted but that’s as far as it goes.


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

Board footer

Powered by FluxBB