Combatting link rot

gaekwad · 2020-07-08 10:09:45

I will preface this post with a note that I don’t currently use plugins (I’ve been burned in the past), but paradoxically also that I consider the following plugin territory (but I might be wrong) and I’m open to the idea of using plugins in future. Evolve or die, as they say.

I read this morning that the blogspot.in domain has been dropped by Google, and snagged by a squatter. Blogspot were acquired by Google many years ago, and have (had?) a variety of ccTLD extensions for some markets. Where I live, that would be blogspot.co.uk (active). In India, they had blogspot.co.in, until that went away a few weeks ago, and many (millions?) of links have now become broken. Whether Google buys the domain back from the squatter to reactivate it, I don’t know, but it reminded me of the relative fragility of some of the web.

I’m starting to write again after a long hiatus. I will have outgoing links, and I’m debating whether I will write the markup inline or use the Textpattern links functionality to manage it for me. This got me thinking while I was eating breakfast. I’ve previously mooted an idea of being able to publish an article in Textpattern and have it mirrored at the Wayback Machine or similar service(s) at the time of posting, but I can’t find the forum post at the moment.

This could extend to links managed in Textpattern – either at the time of posting or ad-hoc at a later time, a saved link could be processed by the Wayback Machine or similar service(s), the appropriate mirror URL stored alongside the original URL in a similar fashion to custom fields, and then extracted to an article with an appropriate attribute (e.g. <txp:link id="123" source="wayback" />).

If the original URL goes pop, there’s still a way to have a link to what was linked to originally. Sure, there’s some admin involved with switching links out after a service goes down, but it’s considerably less than if there was no mirror. Any site that doesn’t want to be archived can opt out via robots.txt, and there’s rate limiting at Wayback HQ to prevent an out of control client. Textpattern could have a maximum submissions per hour/day as part of the code to further limit anything tripping out.

What do you think?

Last edited by gaekwad (2020-07-08 10:38:17)

michaelkpate · 2020-07-08 13:39:04

gaekwad wrote #324368:

What do you think?

Back in the Spring of 1996 a classmate and I created a page of links for Young Adult Literature Resources. I spent several years trying to keep it updated before I finally gave up.

Then I put together LibraryPlanet.com – which started out as a collection of pages of links (powered by Microsoft FrontPage) before becoming a full-fledged Blog.

“The core problem is that people expect stability but the web is not stable.” – John S. Rhodes

It’s an admirable goal and I fully support it but I’ve also become somewhat fatalistic about it.

jakob · 2020-07-08 14:23:00

That sounds a bit like the way smd_remote_file works for files but for links with an additional routine that calls a wayback archiver (a separate item/script/api) and retrieves the url (or if the url is predictable, builds the url whether an actual copy exists or not and hopes it will at some point).

colak · 2020-07-08 17:08:39

gaekwad wrote #324368:

This could extend to links managed in Textpattern – either at the time of posting or ad-hoc at a later time, a saved link could be processed by the Wayback Machine or similar service(s), the appropriate mirror URL stored alongside the original URL in a similar fashion to custom fields, and then extracted to an article with an appropriate attribute (e.g. <txp:link id="123" source="wayback" />).

If the original URL goes pop, there’s still a way to have a link to what was linked to originally. Sure, there’s some admin involved with switching links out after a service goes down, but it’s considerably less than if there was no mirror. Any site that doesn’t want to be archived can opt out via robots.txt, and there’s rate limiting at Wayback HQ to prevent an out of control client. Textpattern could have a maximum submissions per hour/day as part of the code to further limit anything tripping out.

What do you think?

Sounds good and it is worth a consideration, but how do you imagine it in the front end? At the moment, txp has no way of testing if the urls return a 404 or even if the url actually contains the content we originally linked to.

Once we get the unlimited custom fields, this will be very doable. Something like

<txp:evaluate>
<txp:link_url />
<txp:else />
<txp:custom_field name="wayback" type="link" />
</txp:evaluate>

> Edit 2. How this could work, is that both urls could be filled at the same time, as time is an important factor when citing a page online. When the original url changes to a degree that it is no longer relevant, or if it is just returning a 404, the publisher can delete the link_url which will then be seamlessly replaced by the wayback one.

On my code above I made an assumption that custom_fields in links will be identified with the already existing type attribute used for categories.

Last edited by colak (2020-07-09 05:22:15)

Textpattern CMS

Textpattern CMS support forum

#1 2020-07-08 10:09:45