Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Duplicate content
I just discovered that example.com, example.com/, example.com/category/, example.com/author/, example.com/2020/ all have the same content. I can 301 redirect those. There may be others like that?
But I also found that example.com?=a, example.com?=b, example.com?=c etc and even example.com?=sdlkfjl and anything after the ?= also give the same content as example.com. I tried it on textpattern.com and on neme.org and it’s just the same. There’s no 404 not found.
Is there some way of redirecting ?=* for everything except ?=q so the expected result of 404 not found appears?
Offline
Re: Duplicate content
Odd it happens on textpattern.com. I wonder if the rewrite rules are not working properly? On a test server of my own (though I am admittedly running 4.8.2-dev), a link to example.org/category does indeed return a 404, while example.org/category/some-cat returns the categorised content as expected. And /YYYY/mm/dd links should definitely return those article lists!
On .com I don’t know if we’ve set up /author links to function or not. But example.org/ and example.org should resolve to the same endpoint automatically.
?=something
isn’t a valid thing we look for so that’s expected. ?q=something
resolves fine for searches in my tests.
Paging @petecooper who might be able to shed some light on this. Is it a core 4.8.1 bug or a server misconfiguration? Intriguingly, both 4.8.1 and 4.8.2-dev demo sites exhibit the same behaviour.
Last edited by Bloke (2020-07-01 10:59:01)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Online
Re: Duplicate content
I’m trying to ensure there CANNOT be duplicate content on my site. The examples I gave show a massive amount of potential for duplicate content. For example, G might view example.org/category/
or example.org/2020/
as sections and easily find them.
I remember years ago getting warnings from the G web dev console thingy about ?=something
but at the time I didn’t consider it important for some reason or another. Perhaps it is though.
Offline
Re: Duplicate content
zero wrote #324140:
But I also found that example.com?=a, example.com?=b, example.com?=c etc and even example.com?=sdlkfjl and anything after the ?= also give the same content as example.com. I tried it on textpattern.com and on neme.org and it’s just the same. There’s no 404 not found.
You won’t see a 404 because the page is rendering correctly. The ?=a
, ?=b
or whatever you tack on is a query string – but in the context you’re using (i.e ‘question mark equals something’), there is no parameter to set.
For a valid query string with a parameter, you need something like ‘question mark parameter equals something’. In the case of Textpattern sections, this works: example.com?s=lemon
.
If the URL has a query string but the parameters are either a) not set or b) have nothing to do with Textpattern (e.g utm stuff on analytics), the page will render as usual. So unless I’m misunderstanding it, no harm, no foul.
I’m trying to ensure there CANNOT be duplicate content on my site. The examples I gave show a massive amount of potential for duplicate content.
Canonical URLs are your friend, friend: moz.com/learn/seo/canonicalization
Edit: words.
Last edited by gaekwad (2020-07-01 15:11:55)
Offline
Re: Duplicate content
Yes I know about canonicals and that txp uses them well, but I want to have no duplicate content at all, hidden or not. I’m useless with .htaccess and all that deep coding stuff so that’s why I’ve asked if there’s a way to redirect the ?=* to a blank page or something, in such a way that the search is not interfered with.
Offline
Re: Duplicate content
You can try the following, at your own risk :-)
<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>
<txp:txp_die />
</txp:evaluate>
Offline
Re: Duplicate content
The workarounds and canonicals and hacks are fine but it doesn’t get away from the question: why do the invalid URLs:
example.org/category/
example.org/author/
not 404? Why do they return the home page content? Is that down to the page template setup? The server? And why on my sites do they both 404 (as expected) yet on .com and the demo site (and neme.org, it seems) they just return… well, the front page?
Makes me nervous something’s not right in core.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Online
Re: Duplicate content
Bloke wrote #324172:
And why on my sites do they both 404 (as expected) yet on .com and the demo site (and neme.org, it seems) they just return… well, the front page?
Makes me nervous something’s not right in core.
Unless your public language is not English, it does not comply with core which will (rather logically) handle example.org/category/
in the same way as example.org/?c=
.
Offline
Re: Duplicate content
etc wrote #324171:
You can try the following, at your own risk :-)
<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>...
Fantastic. 503 Service Unavailable. You just saved me hours maybe days of searching. It’s a shame that you guys use Paypal to receive donations, (with its recaptcha)
Thanks Oleg.
BTW, your site also has duplicate /category/, /author/ content.
Last edited by zero (2020-07-02 05:25:39)
Offline
Re: Duplicate content
etc wrote #324171:
You can try the following, at your own risk :-)
<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>...
Where should this be added? Also, is there a way to return a 404 rather than a 503?
Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.
Offline
Offline
Re: Duplicate content
zero wrote #324179:
It’s a shame that you guys use Paypal to receive donations, (with its recaptcha)
Shame on you to worry about SEO :-)
BTW, your site also has duplicate /category/, /author/ content.
My site uses some extra URL parameters in code examples, so forbidding non-canonical links would not work. And then I don’t care.
colak wrote #324180:
Where should this be added? Also, is there a way to return a 404 rather than a 503?
Try <txp:txp_die status="404" />
?
Offline