Duplicate content

zero · 2020-07-01 10:07:18

I just discovered that example.com, example.com/, example.com/category/, example.com/author/, example.com/2020/ all have the same content. I can 301 redirect those. There may be others like that?

But I also found that example.com?=a, example.com?=b, example.com?=c etc and even example.com?=sdlkfjl and anything after the ?= also give the same content as example.com. I tried it on textpattern.com and on neme.org and it’s just the same. There’s no 404 not found.

Is there some way of redirecting ?=* for everything except ?=q so the expected result of 404 not found appears?

Bloke · 2020-07-01 10:55:32

Odd it happens on textpattern.com. I wonder if the rewrite rules are not working properly? On a test server of my own (though I am admittedly running 4.8.2-dev), a link to example.org/category does indeed return a 404, while example.org/category/some-cat returns the categorised content as expected. And /YYYY/mm/dd links should definitely return those article lists!

On .com I don’t know if we’ve set up /author links to function or not. But example.org/ and example.org should resolve to the same endpoint automatically.

?=something isn’t a valid thing we look for so that’s expected. ?q=something resolves fine for searches in my tests.

Paging @petecooper who might be able to shed some light on this. Is it a core 4.8.1 bug or a server misconfiguration? Intriguingly, both 4.8.1 and 4.8.2-dev demo sites exhibit the same behaviour.

Last edited by Bloke (2020-07-01 10:59:01)

zero · 2020-07-01 11:17:19

I’m trying to ensure there CANNOT be duplicate content on my site. The examples I gave show a massive amount of potential for duplicate content. For example, G might view example.org/category/ or example.org/2020/ as sections and easily find them.

I remember years ago getting warnings from the G web dev console thingy about ?=something but at the time I didn’t consider it important for some reason or another. Perhaps it is though.

gaekwad · 2020-07-01 15:10:20

zero wrote #324140:

But I also found that example.com?=a, example.com?=b, example.com?=c etc and even example.com?=sdlkfjl and anything after the ?= also give the same content as example.com. I tried it on textpattern.com and on neme.org and it’s just the same. There’s no 404 not found.

You won’t see a 404 because the page is rendering correctly. The ?=a, ?=b or whatever you tack on is a query string – but in the context you’re using (i.e ‘question mark equals something’), there is no parameter to set.

For a valid query string with a parameter, you need something like ‘question mark parameter equals something’. In the case of Textpattern sections, this works: example.com?s=lemon.

If the URL has a query string but the parameters are either a) not set or b) have nothing to do with Textpattern (e.g utm stuff on analytics), the page will render as usual. So unless I’m misunderstanding it, no harm, no foul.

I’m trying to ensure there CANNOT be duplicate content on my site. The examples I gave show a massive amount of potential for duplicate content.

Canonical URLs are your friend, friend: moz.com/learn/seo/canonicalization

Edit: words.

Last edited by gaekwad (2020-07-01 15:11:55)

zero · 2020-07-01 15:23:36

Yes I know about canonicals and that txp uses them well, but I want to have no duplicate content at all, hidden or not. I’m useless with .htaccess and all that deep coding stuff so that’s why I’ve asked if there’s a way to redirect the ?=* to a blank page or something, in such a way that the search is not interfered with.

etc · 2020-07-01 20:55:10

You can try the following, at your own risk :-)

<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>
    <txp:txp_die />
</txp:evaluate>

Bloke · 2020-07-01 21:06:35

The workarounds and canonicals and hacks are fine but it doesn’t get away from the question: why do the invalid URLs:

example.org/category/
example.org/author/

not 404? Why do they return the home page content? Is that down to the page template setup? The server? And why on my sites do they both 404 (as expected) yet on .com and the demo site (and neme.org, it seems) they just return… well, the front page?

Makes me nervous something’s not right in core.

etc · 2020-07-01 21:31:33

Bloke wrote #324172:

And why on my sites do they both 404 (as expected) yet on .com and the demo site (and neme.org, it seems) they just return… well, the front page?

Makes me nervous something’s not right in core.

Unless your public language is not English, it does not comply with core which will (rather logically) handle example.org/category/ in the same way as example.org/?c=.

zero · 2020-07-02 05:24:04

etc wrote #324171:

You can try the following, at your own risk :-)

<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>...

Fantastic. 503 Service Unavailable. You just saved me hours maybe days of searching. It’s a shame that you guys use Paypal to receive donations, (with its recaptcha)

Thanks Oleg.

BTW, your site also has duplicate /category/, /author/ content.

Last edited by zero (2020-07-02 05:25:39)

colak · 2020-07-02 05:26:59

etc wrote #324171:

You can try the following, at your own risk :-)

<txp:evaluate query='"<txp:site_url trim="/" /><txp:page_url type="req" />" != "<txp:page_url context />"'>...

Where should this be added? Also, is there a way to return a 404 rather than a 503?

zero · 2020-07-02 05:28:32

colak wrote #324180:

Where should this be added? Also, is there a way to return a 404 rather than a 503?

I put in the head of the default page

etc · 2020-07-02 10:32:39

zero wrote #324179:

It’s a shame that you guys use Paypal to receive donations, (with its recaptcha)

Shame on you to worry about SEO :-)

BTW, your site also has duplicate /category/, /author/ content.

My site uses some extra URL parameters in code examples, so forbidding non-canonical links would not work. And then I don’t care.

colak wrote #324180:

Where should this be added? Also, is there a way to return a 404 rather than a 503?

Try <txp:txp_die status="404" />?

colak · 2020-07-02 13:16:56

etc wrote #324189:

Try <txp:txp_die status="404" />?

As you said, try “at your own risk.” It unfortunately does not work as ~~expected~~ desired on deeper url schemas (section/categories/article) and returns a 404 when landing on /section/cat1/cat2/ pages.

Last edited by colak (2020-07-02 13:18:11)

gaekwad · 2020-07-02 13:21:27

I will admit to being a bit baffled by this – no antagonistic intentions, just trying to understand.

If you want search engines to have non-duplicate content, totally understandable – that’s what canonical is for. I’m a bit fuzzy where the 503 / 404 errors are useful – does this mean anyone with a link to a page on your site that’s considered duplicate content get an error instead of the page content, or is intended as a search engine housekeeping exercise so any dupes will get rinsed out on the next run?

Last edited by gaekwad (2020-07-02 13:22:56)

colak · 2020-07-02 14:40:47

gaekwad wrote #324199:

I will admit to being a bit baffled by this – no antagonistic intentions, just trying to understand.

If you want search engines to have non-duplicate content, totally understandable – that’s what canonical is for. I’m a bit fuzzy where the 503 / 404 errors are useful – does this mean anyone with a link to a page on your site that’s considered duplicate content get an error instead of the page content, or is intended as a search engine housekeeping exercise so any dupes will get rinsed out on the next run?

Hi Pete,

there is an issue with the way txp understands the desired canonical url.

For example when using

<link rel="canonical" href="<txp:site_url trim="/" /><txp:page_url />" />

in the head of the document, it will parse the tags whatever the url is which renders it semantically wrong. So, what would the recommendation be in order to have the desired url in there? Also, the breadcrumbs tag, returns urls that are again not the desired ones.

I agree with many people in this community regarding the issues plaguing the search engines but at the same time, we all desire traffic from beyond those we have sent an email or gave our card to. As such we are engaged in a Hegelian master–slave dialectic that we cannot escape from. ie. There is no master unless recognised by the slaves.

In my view, maybe because I vividly remember and believe in the 90s web, search engines are the most appropriate places to have our work discovered. The idea should not be to destroy Google but to apply pressure in order to get it fixed.

As such, and having personally accepted that my relationship with Google and other search engines is a relationship of connivance, I am trying, like many others, to play ball and serve our content in a way which would increase our visibility (not our SEO).

Textpattern CMS support forum

#1 2020-07-01 10:07:18