Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2020-07-01 10:07:18

zero
Member
From: Lancashire
Registered: 2004-04-19
Posts: 1,470
Website

Duplicate content

I just discovered that example.com, example.com/, example.com/category/, example.com/author/, example.com/2020/ all have the same content. I can 301 redirect those. There may be others like that?

But I also found that example.com?=a, example.com?=b, example.com?=c etc and even example.com?=sdlkfjl and anything after the ?= also give the same content as example.com. I tried it on textpattern.com and on neme.org and it’s just the same. There’s no 404 not found.

Is there some way of redirecting ?=* for everything except ?=q so the expected result of 404 not found appears?


Dozy P My music
Gud One My blog

Offline

#2 2020-07-01 10:55:32

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 12,024
Website GitHub

Re: Duplicate content

Odd it happens on textpattern.com. I wonder if the rewrite rules are not working properly? On a test server of my own (though I am admittedly running 4.8.2-dev), a link to example.org/category does indeed return a 404, while example.org/category/some-cat returns the categorised content as expected. And /YYYY/mm/dd links should definitely return those article lists!

On .com I don’t know if we’ve set up /author links to function or not. But example.org/ and example.org should resolve to the same endpoint automatically.

?=something isn’t a valid thing we look for so that’s expected. ?q=something resolves fine for searches in my tests.

Paging @petecooper who might be able to shed some light on this. Is it a core 4.8.1 bug or a server misconfiguration? Intriguingly, both 4.8.1 and 4.8.2-dev demo sites exhibit the same behaviour.

Last edited by Bloke (2020-07-01 10:59:01)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#3 2020-07-01 11:17:19

zero
Member
From: Lancashire
Registered: 2004-04-19
Posts: 1,470
Website

Re: Duplicate content

I’m trying to ensure there CANNOT be duplicate content on my site. The examples I gave show a massive amount of potential for duplicate content. For example, G might view example.org/category/ or example.org/2020/ as sections and easily find them.

I remember years ago getting warnings from the G web dev console thingy about ?=something but at the time I didn’t consider it important for some reason or another. Perhaps it is though.


Dozy P My music
Gud One My blog

Offline

#4 2020-07-01 15:10:20

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,539
Bitbucket GitHub

Re: Duplicate content

zero wrote #324140:

But I also found that example.com?=a, example.com?=b, example.com?=c etc and even example.com?=sdlkfjl and anything after the ?= also give the same content as example.com. I tried it on textpattern.com and on neme.org and it’s just the same. There’s no 404 not found.

You won’t see a 404 because the page is rendering correctly. The ?=a, ?=b or whatever you tack on is a query string – but in the context you’re using (i.e ‘question mark equals something’), there is no parameter to set.

For a valid query string with a parameter, you need something like ‘question mark parameter equals something’. In the case of Textpattern sections, this works: example.com?s=lemon.

If the URL has a query string but the parameters are either a) not set or b) have nothing to do with Textpattern (e.g utm stuff on analytics), the page will render as usual. So unless I’m misunderstanding it, no harm, no foul.

I’m trying to ensure there CANNOT be duplicate content on my site. The examples I gave show a massive amount of potential for duplicate content.

Canonical URLs are your friend, friend: moz.com/learn/seo/canonicalization

Edit: words.

Last edited by gaekwad (2020-07-01 15:11:55)

Offline

#5 2020-07-01 15:23:36

zero
Member
From: Lancashire
Registered: 2004-04-19
Posts: 1,470
Website

Re: Duplicate content

Yes I know about canonicals and that txp uses them well, but I want to have no duplicate content at all, hidden or not. I’m useless with .htaccess and all that deep coding stuff so that’s why I’ve asked if there’s a way to redirect the ?=* to a blank page or something, in such a way that the search is not interfered with.


Dozy P My music
Gud One My blog

Offline

#6 2020-07-01 20:55:10

etc
Developer
Registered: 2010-11-11
Posts: 5,524
Website GitHub

Re: Duplicate content

You can try the following, at your own risk :-)

<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>
    <txp:txp_die />
</txp:evaluate>

Offline

#7 2020-07-01 21:06:35

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 12,024
Website GitHub

Re: Duplicate content

The workarounds and canonicals and hacks are fine but it doesn’t get away from the question: why do the invalid URLs:

example.org/category/
example.org/author/

not 404? Why do they return the home page content? Is that down to the page template setup? The server? And why on my sites do they both 404 (as expected) yet on .com and the demo site (and neme.org, it seems) they just return… well, the front page?

Makes me nervous something’s not right in core.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#8 2020-07-01 21:31:33

etc
Developer
Registered: 2010-11-11
Posts: 5,524
Website GitHub

Re: Duplicate content

Bloke wrote #324172:

And why on my sites do they both 404 (as expected) yet on .com and the demo site (and neme.org, it seems) they just return… well, the front page?

Makes me nervous something’s not right in core.

Unless your public language is not English, it does not comply with core which will (rather logically) handle example.org/category/ in the same way as example.org/?c=.

Offline

#9 2020-07-02 05:24:04

zero
Member
From: Lancashire
Registered: 2004-04-19
Posts: 1,470
Website

Re: Duplicate content

etc wrote #324171:

You can try the following, at your own risk :-)

<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>...

Fantastic. 503 Service Unavailable. You just saved me hours maybe days of searching. It’s a shame that you guys use Paypal to receive donations, (with its recaptcha)

Thanks Oleg.

BTW, your site also has duplicate /category/, /author/ content.

Last edited by zero (2020-07-02 05:25:39)


Dozy P My music
Gud One My blog

Offline

#10 2020-07-02 05:26:59

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,316
Website GitHub Mastodon Twitter

Re: Duplicate content

etc wrote #324171:

You can try the following, at your own risk :-)

<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>...

Where should this be added? Also, is there a way to return a 404 rather than a 503?


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#11 2020-07-02 05:28:32

zero
Member
From: Lancashire
Registered: 2004-04-19
Posts: 1,470
Website

Re: Duplicate content

colak wrote #324180:

Where should this be added? Also, is there a way to return a 404 rather than a 503?

I put in the head of the default page


Dozy P My music
Gud One My blog

Offline

#12 2020-07-02 10:32:39

etc
Developer
Registered: 2010-11-11
Posts: 5,524
Website GitHub

Re: Duplicate content

zero wrote #324179:

It’s a shame that you guys use Paypal to receive donations, (with its recaptcha)

Shame on you to worry about SEO :-)

BTW, your site also has duplicate /category/, /author/ content.

My site uses some extra URL parameters in code examples, so forbidding non-canonical links would not work. And then I don’t care.

colak wrote #324180:

Where should this be added? Also, is there a way to return a 404 rather than a 503?

Try <txp:txp_die status="404" />?

Offline

Board footer

Powered by FluxBB