Duplicate Content due to section and article URL

demoncleaner · 2020-05-20 16:20:40

Weird.

<txp:if_individual_article> <txp:header name value="301" /> <txp:header name="Location" value='<txp:section url />' /> </txp:if_individual_article>

seems to work in the latest 4.8.0 only.

I tested it on a couple of 4.7.x Installations and nothing happens there.

colak · 2020-05-20 16:26:20

On Team you might have 12 team members. Each with a little text but they do not have any permlink or any further information to link to. So what you want is that the URL /team/donald_duck is not been seen or indexed or creating duplicate content, right?

Got it now! I normally just use one article (with leave text untouched) for those pages but you are absolutely right that using a number of them is better as you separate the html layout tags from the articles. Your method also allows each team member to be in charge of their own bio.

demoncleaner · 2020-05-20 17:05:38

Exactly. And it is easier to change order or to take one team member out. You just need to delete the article and not fiddle around and find the text inside of a long article. For the people maintaining the site this would be much easier in most cases.

But I thought a bit longer about the txp:header aproach to come over the problem I have (because it is not totaly working for me)

The same could be done with a canoncial of course.
If there is really not any issue with having those single articles around.
They would just have a canoncial of
<link rel=“canonical“ href='<txp:section url/>'> and thats it.

Or maybe somehow give single articles a noindex by default.

I think I have to investigate a bit to be sure if there is no problem about it and what would be best practice.

demoncleaner · 2020-05-20 18:36:50

OK so what about this solution?

If there are already unwanted /section/title pages that are indexed by google probably a “Manual” 301 redirect is the best.

If it is a brandnew website than probably something with the help of arc_meta plugin in my case would work pretty well.

<txp:if_individual_article> <txp:arc_meta_robots robots="noindex,follow"/> <txp:else/> <txp:arc_meta_robots /> </txp:if_individual_article>

I would say that using canonicals for a solution in my case might not be the best. Because here it says:
Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

Less often would be still too often =)
It should not be crawled at all. It should not exist for google.

Does that make all sense to you?

Last edited by demoncleaner (2020-05-20 18:37:53)

Bloke · 2020-05-20 18:45:04

Ah it makes sense now you mention /teams. What threw me was when you said you were doing it on a /news section. Yes, preventing individual access on bio pages is perfectly valid, and etc’s approach should work well.

And I suspect your SEO guru is not completely wrong in this case. If your page layouts normally have links on, say, h2 tags to individual articles, Google may well expect that all sections behave likewise. So if it sees a section with a list of h2 tags it might presume that they link somewhere.

If it tries a few links and finds articles there (either through lucky guesses or through analysis of the way your url-titles are structured) then it might assume you simply forgot to put the links in and, being “helpful” might spider the individual articles for you. Not sure how invasive/clever it is, but there may be some truth in the advice that Big G can discover links that aren’t explicitly made by site designers.

Yes, you could make the individual articles a canonical link to the landing page of the section. I’m honestly not sure what search engines will think to that practice. Might be fine. Might not. Would be interested to know your findings.

Last edited by Bloke (2020-05-20 19:39:15)

hilaryaq · 2020-05-21 18:56:53

I find the most helpful way to think about this, as if you don’t link anywhere to an article, Google and other search engines do not know that article exists.

So what you want to do, is do not syndicate any section that is a single/static page. News or blog pages can be syndicated, and to help Google along even more, you can create your own sitemap so that the static pages are listed along with the news/blog articles. I’m pasting the code I use on my own site below for this:

- Enable use php in pages (I found mine still works without enabling this but it does display better turned on)
- Create a new page xml_sitemap
- Create a new section called sitemap (not syndicated etc), and set ‘uses page’ to xml_sitemap

Content of xml_sitemap below:

<txp:header value="application/xml; charset=utf-8" /><?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
<txp:section_list break="" exclude="sitemap,404,articles">
<url>
    <loc><txp:section url="1" /></loc>
<txp:evaluate test="article_custom">
    <lastmod><txp:article_custom section='<txp:section />' limit="1" sort="LastMod desc"><txp:modified format="%Y-%m-%d" /></txp:article_custom></lastmod>
</txp:evaluate>
</url>
</txp:section_list>
<txp:article_custom section="services,blog" limit="9999">
<url>
    <loc><txp:permlink /></loc>
    <lastmod><txp:modified format="%Y-%m-%d" /></lastmod>
</url>
</txp:article_custom>
</urlset>

Change the sections as suits your site above, so for my site I have multiple articles on the services and blog pages so I do want those to output article links but every other page will not.

So now on your static pages, you just want to make sure, your title isn’t linkable as it is an article list page technically, so textpattern templates will tend to want to make that title clickable which you don’t want. So you create a custom page form for a static page and override form at the article level, or, create a new page template and then set the static page section to use that page template.

I override the default form with a custom static page one I’ve created, that makes sure the title isn’t clickable and social sharing links etc are sharing the section link not the article link.

So at this stage you can be confident nothing can index a page that has not been linked to.

…

Let’s say it’s too late and Google has already done this, there isn’t such a thing as a duplicate content penalty per se, it’s a legend at this point because Google crawlers are pretty intelligent, all the crawler will do is decide which version of the page is the correct one, and it will present what it feels is the correct url in search without any penalty to you. You might prefer it showed the section link where it links the article page and in that case I would suggest following all the info above until Google accepts your section page as the correct canonical instead.

Hope that helps somewhat!!

hilaryaq · 2020-05-21 19:33:25

Also, if you were still concerned you could put the section canonical as the individual article canonical with something like the code below, but be aware Google can in it’s own discretion still override canonical based on what it thinks is the correct url so it doesn’t necessarily solve the problem but could help if you implemented everything above and want to help Google begin to recognize the section pages instead:

<txp:if_individual_article>
<txp:if_section name="about,contact">
<link rel="canonical" href="<txp:section url='1' />">
<txp:else />
<link rel="canonical" href="<txp:permlink />">
</txp:if_section>
<txp:else />
other code 
</txp:if_individual_article>

demoncleaner · 2020-05-23 10:58:36

Sorry for my late reply.

Thanks for your example of a single_article-aware sitemap hilaryaq.
Looks like a pretty good way. I am going to give it a try.

I am not using textpattern templates anyway so all the permlink problem is not an issue for me.

I am not sure if I can agree with

So at this stage you can be confident nothing can index a page that has not been linked to.

Because the sitemap will just be a suggestion and even if I do not link to any unwanted single_articles (and don´t have them in my sitemap) I think google has still ways to find them.

For Example:

Google may also use data it gets from browsers to start crawling. If you are using a browser with a Google Toolbar (or PageRank checker), then Google gets a list of all the pages that you visit. However, Google denies that they use toolbar data for this purpose. Google does say that a common way for “secret” URLs to be discovered is for them to link out to other sites. Those other sites then see the “secret” page in the referrer and sometimes publish a list of referrer links (a common feature of blogs).

So I think a good way of dealing with it when building a website would be:

- Use your sitemap that you gave us this nice example of (or control in any other way the unwanted orphan pages or secret URLs)
- Plus Use meta robots noindex,follow

You could use it like that for example:

<txp:if_section name="news">
	<meta name="robots" content="index, follow">
<txp:else/>
	<txp:if_individual_article>
			    <meta name="robots" content="noindex, follow">
			<txp:else/>
				 <meta name="robots" content="index, follow">
	</txp:if_individual_article>
</txp:if_section>

I might be wrong but I think the “crawling budget” might not be an issue in most cases.
Let´s imagine we have a news section and 300 news articles (and single_article URLs) over the last 5 years.
Those ones I want google to index of course.
Then we have some other articles that build together the section “Team”. Those articles could be propably crawled but not indexed as we are using “noindex” here.
We also don´t have them in our sitemap. So all good.

Maybe having them as a redirect in my htacces to /section would be most waterproof way?
Please correct me if I am wrong. I am just startig to read about it and try to find out the best and savest way.

Last edited by demoncleaner (2020-05-23 10:59:06)

hilaryaq · 2020-05-23 12:45:09

That would be extra protection alright!

I think if you have the noindex there wouldn’t be any need for a re-direct on top of that. In fact, even without the noindex, Google shouldn’t find the page (bar the caveat you mentioned above), but even if it did, your canonical is pointing to the section page, also with a noindex anyway so that seems pretty bulletproof to me!

Bloke · 2020-05-23 13:05:01

@demoncleaner. Hadn’t thought of Google using referrer links from other sites to sniff out your content and index it that way. That’s sneaky. I’ve learned something today, thank you (and your SEO guru!)

demoncleaner · 2020-05-23 14:37:58

@hilaryaq: No I ment redirects maybe as the most secure way but could be a bit anoying unless someone finds a regex that properly works (which I did not so far). So you would have redirected everything that goes to any /section/title page back to /section and you would want to be able to create exceptions easily.

I think this solution you would use in case you have a running page which you haven´t taken the precautions (noindex to all single_articles) already when setting it up and google has found some orphaned pages already.

I did not ment to use redirects in combination with noindex.

I hoped the noindex-solution will be enough for a site that is freshly setup.
What I think is not 100% perfect is the fact, that for google the site still exists and maybe it is crawled some time, it is just not indexed. So having an eye on your crawling budget this is also not a perfect solution. But I am not 100% sure… it is just the way I understood it so far.

demoncleaner · 2020-05-23 14:44:56

Writing the last thing I had this idea… wouldn´t it be cool to have a function like:

In case the meta url field of an article is filled the article exists as an URL in case it is empty, the URL (/section/title) just does not exist and the articles can be only used without any further worries of sitemap, noindex or redirects in an article_list context =)

Sounds simple. But I am sure it isn´t.
I have no idea if there is a realistic chance to create a plugin that does that.

I had to test what currently happens if you try to have an empty Meta URL field:
It does not work. If you empty and save, it will save it with your last entry/not change anything.

Last edited by demoncleaner (2020-05-23 14:45:43)

Textpattern CMS

Textpattern CMS support forum

#13 2020-05-20 16:20:40

Re: Duplicate Content due to section and article URL

#14 2020-05-20 16:26:20

Re: Duplicate Content due to section and article URL

#15 2020-05-20 17:05:38

Re: Duplicate Content due to section and article URL

#16 2020-05-20 18:36:50

Re: Duplicate Content due to section and article URL

#17 2020-05-20 18:45:04

Re: Duplicate Content due to section and article URL

#18 2020-05-21 18:56:53

Re: Duplicate Content due to section and article URL

#19 2020-05-21 19:33:25

Re: Duplicate Content due to section and article URL

#20 2020-05-23 10:58:36

Re: Duplicate Content due to section and article URL

#21 2020-05-23 12:45:09

Re: Duplicate Content due to section and article URL

#22 2020-05-23 13:05:01

Re: Duplicate Content due to section and article URL

#23 2020-05-23 14:37:58

Re: Duplicate Content due to section and article URL

#24 2020-05-23 14:44:56

Re: Duplicate Content due to section and article URL

Board footer