Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#11 2020-05-20 15:23:54

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 7,905
Website

Re: Duplicate Content due to section and article URL

So now I do not want to use excerpts and work with overview and detail pages. It is just one news site with 10 entries but each entry is an article and has the section “news”.

This will basically mean that no article will actually be able to be referenced anywhere as it might reside on the 1st page of your news section today, the second page tomorrow, and so on.

Maybe you can tell us slightly more about the project. Are you doing it for yourself or a client? ie will you be the webmaster or will it be managed by somebody else you will not have any control over? Also please tell us more regarding how you envisage the structure.


Yiannis
——————————
neme.org | hblack.net | LABS | State Machines | NeMe @ github | Covid-19; a resource
I do my best editing after I click on the submit button.

Offline

#12 2020-05-20 15:33:33

demoncleaner
Plugin Author
From: Germany
Registered: 2008-06-29
Posts: 104
Website

Re: Duplicate Content due to section and article URL

Thank you so much etc! This was exactly what I needed. Like that I can redirect all sections in which I do not want single_articles to be accessable and other sections can have that feature if needed. Perfect!

@Yannis: I am aware about the character of a news section. And that it will usually have a pgination ones it has more entries. I was just using it as an example. Maybe a bad one, sorry.

It is also not about one particular project. It is more about all the projects I have done in the past. Becuase I was not really aware that this could be an issue.

Let me give you another example of how typical site that I am usually building would look like.

Let´s say it has the menu structure:

Home | News | Service | Team | Contact

On news – as you said – you would want to use excerpts and single article pages as you said.
On Team you might have 12 team members. Each with a little text but they do not have any permlink or any further information to link to. So what you want is that the URL /team/donald_duck is not been seen or indexed or creating duplicate content, right?

This is what it was all about. And I was asking myself if my idea of structuring it in textpattern is not ideal or my understanding on how the search engines would react on that is wrong or if textpattern maybe lacks an easy way on dealing with this issue.

But for me etc gave the perfect workaround now. Thanks again.

Offline

#13 2020-05-20 16:20:40

demoncleaner
Plugin Author
From: Germany
Registered: 2008-06-29
Posts: 104
Website

Re: Duplicate Content due to section and article URL

Weird.

<txp:if_individual_article> <txp:header name value="301" /> <txp:header name="Location" value='<txp:section url />' /> </txp:if_individual_article>

seems to work in the latest 4.8.0 only.

I tested it on a couple of 4.7.x Installations and nothing happens there.

Offline

#14 2020-05-20 16:26:20

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 7,905
Website

Re: Duplicate Content due to section and article URL

On Team you might have 12 team members. Each with a little text but they do not have any permlink or any further information to link to. So what you want is that the URL /team/donald_duck is not been seen or indexed or creating duplicate content, right?

Got it now! I normally just use one article (with leave text untouched) for those pages but you are absolutely right that using a number of them is better as you separate the html layout tags from the articles. Your method also allows each team member to be in charge of their own bio.


Yiannis
——————————
neme.org | hblack.net | LABS | State Machines | NeMe @ github | Covid-19; a resource
I do my best editing after I click on the submit button.

Offline

#15 2020-05-20 17:05:38

demoncleaner
Plugin Author
From: Germany
Registered: 2008-06-29
Posts: 104
Website

Re: Duplicate Content due to section and article URL

Exactly. And it is easier to change order or to take one team member out. You just need to delete the article and not fiddle around and find the text inside of a long article. For the people maintaining the site this would be much easier in most cases.

But I thought a bit longer about the txp:header aproach to come over the problem I have (because it is not totaly working for me)

The same could be done with a canoncial of course.
If there is really not any issue with having those single articles around.
They would just have a canoncial of
<link rel=“canonical“ href='<txp:section url/>'> and thats it.

Or maybe somehow give single articles a noindex by default.

I think I have to investigate a bit to be sure if there is no problem about it and what would be best practice.

Offline

#16 2020-05-20 18:36:50

demoncleaner
Plugin Author
From: Germany
Registered: 2008-06-29
Posts: 104
Website

Re: Duplicate Content due to section and article URL

OK so what about this solution?

If there are already unwanted /section/title pages that are indexed by google probably a “Manual” 301 redirect is the best.

If it is a brandnew website than probably something with the help of arc_meta plugin in my case would work pretty well.

<txp:if_individual_article> <txp:arc_meta_robots robots="noindex,follow"/> <txp:else/> <txp:arc_meta_robots /> </txp:if_individual_article>

I would say that using canonicals for a solution in my case might not be the best. Because here it says:
Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

Less often would be still too often =)
It should not be crawled at all. It should not exist for google.

Does that make all sense to you?

Last edited by demoncleaner (2020-05-20 18:37:53)

Offline

#17 2020-05-20 18:45:04

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 9,328
Website

Re: Duplicate Content due to section and article URL

Ah it makes sense now you mention /teams. What threw me was when you said you were doing it on a /news section. Yes, preventing individual access on bio pages is perfectly valid, and etc’s approach should work well.

And I suspect your SEO guru is not completely wrong in this case. If your page layouts normally have links on, say, h2 tags to individual articles, Google may well expect that all sections behave likewise. So if it sees a section with a list of h2 tags it might presume that they link somewhere.

If it tries a few links and finds articles there (either through lucky guesses or through analysis of the way your url-titles are structured) then it might assume you simply forgot to put the links in and, being “helpful” might spider the individual articles for you. Not sure how invasive/clever it is, but there may be some truth in the advice that Big G can discover links that aren’t explicitly made by site designers.

Yes, you could make the individual articles a canonical link to the landing page of the section. I’m honestly not sure what search engines will think to that practice. Might be fine. Might not. Would be interested to know your findings.

Last edited by Bloke (2020-05-20 19:39:15)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#18 2020-05-21 18:56:53

hilaryaq
Plugin Author
Registered: 2006-08-20
Posts: 263
Website

Re: Duplicate Content due to section and article URL

I find the most helpful way to think about this, as if you don’t link anywhere to an article, Google and other search engines do not know that article exists.

So what you want to do, is do not syndicate any section that is a single/static page. News or blog pages can be syndicated, and to help Google along even more, you can create your own sitemap so that the static pages are listed along with the news/blog articles. I’m pasting the code I use on my own site below for this:

- Enable use php in pages (I found mine still works without enabling this but it does display better turned on)
- Create a new page xml_sitemap
- Create a new section called sitemap (not syndicated etc), and set ‘uses page’ to xml_sitemap

Content of xml_sitemap below:

<txp:header value="application/xml; charset=utf-8" /><?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
<txp:section_list break="" exclude="sitemap,404,articles">
<url>
    <loc><txp:section url="1" /></loc>
<txp:evaluate test="article_custom">
    <lastmod><txp:article_custom section='<txp:section />' limit="1" sort="LastMod desc"><txp:modified format="%Y-%m-%d" /></txp:article_custom></lastmod>
</txp:evaluate>
</url>
</txp:section_list>
<txp:article_custom section="services,blog" limit="9999">
<url>
    <loc><txp:permlink /></loc>
    <lastmod><txp:modified format="%Y-%m-%d" /></lastmod>
</url>
</txp:article_custom>
</urlset>

Change the sections as suits your site above, so for my site I have multiple articles on the services and blog pages so I do want those to output article links but every other page will not.

So now on your static pages, you just want to make sure, your title isn’t linkable as it is an article list page technically, so textpattern templates will tend to want to make that title clickable which you don’t want. So you create a custom page form for a static page and override form at the article level, or, create a new page template and then set the static page section to use that page template.

I override the default form with a custom static page one I’ve created, that makes sure the title isn’t clickable and social sharing links etc are sharing the section link not the article link.

So at this stage you can be confident nothing can index a page that has not been linked to.

Let’s say it’s too late and Google has already done this, there isn’t such a thing as a duplicate content penalty per se, it’s a legend at this point because Google crawlers are pretty intelligent, all the crawler will do is decide which version of the page is the correct one, and it will present what it feels is the correct url in search without any penalty to you. You might prefer it showed the section link where it links the article page and in that case I would suggest following all the info above until Google accepts your section page as the correct canonical instead.

Hope that helps somewhat!!


…………………
I <3 txp
…………………

Offline

#19 2020-05-21 19:33:25

hilaryaq
Plugin Author
Registered: 2006-08-20
Posts: 263
Website

Re: Duplicate Content due to section and article URL

Also, if you were still concerned you could put the section canonical as the individual article canonical with something like the code below, but be aware Google can in it’s own discretion still override canonical based on what it thinks is the correct url so it doesn’t necessarily solve the problem but could help if you implemented everything above and want to help Google begin to recognize the section pages instead:

<txp:if_individual_article>
<txp:if_section name="about,contact">
<link rel="canonical" href="<txp:section url='1' />">
<txp:else />
<link rel="canonical" href="<txp:permlink />">
</txp:if_section>
<txp:else />
other code 
</txp:if_individual_article>

…………………
I <3 txp
…………………

Offline

#20 2020-05-23 10:58:36

demoncleaner
Plugin Author
From: Germany
Registered: 2008-06-29
Posts: 104
Website

Re: Duplicate Content due to section and article URL

Sorry for my late reply.

Thanks for your example of a single_article-aware sitemap hilaryaq.
Looks like a pretty good way. I am going to give it a try.

I am not using textpattern templates anyway so all the permlink problem is not an issue for me.

I am not sure if I can agree with

So at this stage you can be confident nothing can index a page that has not been linked to.

Because the sitemap will just be a suggestion and even if I do not link to any unwanted single_articles (and don´t have them in my sitemap) I think google has still ways to find them.

For Example:

Google may also use data it gets from browsers to start crawling. If you are using a browser with a Google Toolbar (or PageRank checker), then Google gets a list of all the pages that you visit. However, Google denies that they use toolbar data for this purpose. Google does say that a common way for “secret” URLs to be discovered is for them to link out to other sites. Those other sites then see the “secret” page in the referrer and sometimes publish a list of referrer links (a common feature of blogs).

So I think a good way of dealing with it when building a website would be:

- Use your sitemap that you gave us this nice example of (or control in any other way the unwanted orphan pages or secret URLs)
- Plus Use meta robots noindex,follow

You could use it like that for example:

<txp:if_section name="news">
	<meta name="robots" content="index, follow">
<txp:else/>
	<txp:if_individual_article>
			    <meta name="robots" content="noindex, follow">
			<txp:else/>
				 <meta name="robots" content="index, follow">
	</txp:if_individual_article>
</txp:if_section>

I might be wrong but I think the “crawling budget” might not be an issue in most cases.
Let´s imagine we have a news section and 300 news articles (and single_article URLs) over the last 5 years.
Those ones I want google to index of course.
Then we have some other articles that build together the section “Team”. Those articles could be propably crawled but not indexed as we are using “noindex” here.
We also don´t have them in our sitemap. So all good.

Maybe having them as a redirect in my htacces to /section would be most waterproof way?
Please correct me if I am wrong. I am just startig to read about it and try to find out the best and savest way.

Last edited by demoncleaner (2020-05-23 10:59:06)

Offline

Board footer

Powered by FluxBB