GDPR and Internet Archive

Destry · 2018-05-26 16:23:22

Some of you may recall, I discontinued CSF (csf.community) at the end of 2017 and took the website offline early this year. For those that don’t know, CSF was an open and volunteer-driven community centred around content strategy, which coordinated many international conferences, among other efforts that were never as popular.¹

I’ve been getting some inquiries lately from CSF collaborators about community articles we published, and making them available again. Legitimate requests. My intention has always been to use the old website’s repo as a new kind of memorial site location and convert the old Txp site to a Jekyll install. But damn that’s a lot of work I have no desire to do, nor would it happen anytime soon. And I never could get my head around that Jeckyl Liquid or water or WTF it’s called. It’s not easy like Txp tags are easier. Much easier would be to throw up a csf.wion.com domain and install a watered-down version of the site there for posterity, and past collaborators could point to it for résumé use or article reference, etc.

Would that be looked at as selfish? Maybe a GitHub site is more considerate to community? It was an open community project, after all. One person suggested I give articles to a commercial outfit popular in that field, which is not going to happen because of the commercial aspet, but it shows you where the headspace in that field is.

Getting to the point, it occurred to me that maybe it was in the Internet Archive (and it is). I don’t really like that fact for reasons of my self-auditing I’m doing, and perhaps other people in the community too, who may not want to be found there.

This lead me to investigate three things:

What is the Archive’s GDPR stance, if any.
How do you prevent the Archive from capturing/indexing your site?
How can you remove content from the archive if it’s recorded there?

To the first, the only mention of GDPR at all is What is GDPR? in their voluminous FAQs in relation to user accounts (not public bystanders or ‘normal persons’). At the top of that response they point to their ToS, which contains their privacy and copyright statements too. The ToS stuff is dated several years back. They recognize themselves as a ‘Library’ (which seems confirmed in the About) and thus has a ‘legitimate interest’ to collect and make public for provenance all personal info that gets recorded, either from bot indexing or direct user uploads. No where do they actually say they are GDPR-compliant. It all has an air that the Archive assumes being an exclusion. Even if that’s true, they should address it better. (All those FAQs are ridiculous.)

To the second, you can use robots.txt to prevent indexing, but they don’t say what to to use, exactly (that I can find). Anyone know? I don’t want my other sites indexed.

To the third, you must send a request to have content removed from the Archive, but it’s worded in such a way to sound like a request is not a sure thing it will happen. The GDPR accounts for certain situations to refuse such requests (criminal data, research, etc) and that might be a stickler here if they are in fact considered a research entity. I will probably make a request to have the whole site removed, so that it can be re-indexed from the new location after I’ve stripped content down a bit. We’ll see what happens.

Anyway, another interesting (and BIG) area in relation to the GDPR.

¹ In retrospect, an open community on content strategy was a juxtaposition, and the fact we rarely had volunteers is one evidence of it. That field is loaded with corporate and marketing influence, thus egos and visibility motivations, which in turn conflicts with all this data privacy stuff. I wasn’t seeing clearly there for a while, but I eventually woke up.

Destry · 2018-05-26 16:38:29

Okay, this is from 2009, kind of old, but it says the WayBack user agent is archive.org_bot. I presume that’s still valid?

Destry · 2018-05-26 16:45:28

Btw, I love the Internet Archive. It’s an important resource I’ve tapped many times. But it’s equally important to know how to navigate the data aspects for one’s own piece of mind.

colak · 2018-05-26 16:53:46

Destry wrote #312110:

3. How do you prevent the Archive from capturing/indexing your site?

This is from 2017 when the IA decided to ignore robots.txt.. I love the internet archive too by the way but I’m not sure what steps they are taking re GDPR.

Destry · 2018-05-26 17:22:54

colak wrote #312113:

This is from 2017 when the IA decided to ignore robots.txt.

Well, I’m not sure what to think of that. The idea that any website I put online now will be recorded whether I want it to be or not is a little unfair.

I see people making the comparison to book publishing and justifying it on the same level. But it’s not really the same. An author of a book usually puts great effort into their writing, research, editing, a publisher is potentially involved, etc. They are set out to be as remembered and popular as possible for commercial or educational reasons. Most websites have no such motivation or effort behind them. Thoughts are half-baked and pointless much of the time. People shouldn’t be held to that if they don’t want to be.

colak · 2018-05-26 18:23:06

Destry wrote #312114:

Well, I’m not sure what to think of that. The idea that any website I put online now will be recorded whether I want it to be or not is a little unfair.

There is an htaccess way.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (archive.org_bot) [NC]
RewriteRule .* - [R=403,L]

I see people making the comparison to book publishing and justifying it on the same level. But it’s not really the same. An author of a book usually puts great effort into their writing, research, editing, a publisher is potentially involved, etc. They are set out to be as remembered and popular as possible for commercial or educational reasons. Most websites have no such motivation or effort behind them. Thoughts are half-baked and pointless much of the time. People shouldn’t be held to that if they don’t want to be.

I guess I agree with you and I’m sure GDPR and the IA will provide a method to delete some of their content.

The comparison for me, is some recent US president who changed some official procedures. This was just fine for as long as he was in control, because a lot of the world community was trusting his judgement. Now, those powers are (<txp:warning>personal opinion coming up</txp:warning>) in the wrong hands. At the moment we actually trust how IA is run but who knows how it will be in 5-10 years.

ps. I guess that we are actually talking about the WayBack Machine and not the Internet archive as a whole.

Last edited by colak (2018-05-26 18:26:07)

michaelkpate · 2018-05-27 02:09:30

I only found one discussion online (and the quality of the answer is questionable): How does GDPR impact things like Internet Archive?

Reading through the guides related to archiving and research, I think the actual answer is that regulators in each EU member are free to make up their own rules. I guess we’ll find out what they decide eventually.

I am surprised this hasn’t come up before over the idiotic“Right to be Forgotten.”

Destry · 2018-05-27 09:37:51

colak wrote #312115:

ps. I guess that we are actually talking about the WayBack Machine and not the Internet archive as a whole.

Yes. The Wayback is what I’m focused on here. Thanks.

Destry · 2018-05-27 10:39:08

colak wrote #312115:

There is an htaccess way

It relies on a working user agent, and that info seems dodgy now. This was given in 2009:

User-agent: archive.org_bot

And this was given in 2015, but then removed from the IA site:

User-agent: ia_archiver

My gut is telling me IA has done away with a user agent for this very reason. Untested, of course. But to even make a test you need the right user agent.

It’s starting to look like they archive you first without consent, then you have to make a request to be de-archived, and from what I beginning to see, people have bad/hard experiences trying to do that.

It’s starting to stink, in fact.

Destry · 2018-05-27 11:36:04

This answer is given to one of their FAQs:

How can I get my site included in the Wayback Machine?

Much of our archived web data comes from our own crawls or from Alexa Internet’s crawls. Neither organization has a “crawl my site now!” submission process. Internet Archive’s crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.

Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.

Regardless of who is crawling the site, you should ensure that your site’s ‘robots.txt’ rules and in-page META robots directives do not tell crawlers to avoid your site.

The bold in the response text is mine. The last bold instance suggests the response is probably not well-maintained, especially as they have said they will ignore robots.txt files. The first two instances suggest there might still be a IA user agent (question still remains what, exactly), and that Alexa crawling should be blocked as well, so their user agent is needed too.

And the Alexa bot is ia_archiver, same as what IA said they used in 2015. Signs suggest, then, this is the user agent to try. I will put that in a robots.txt file and use the .htaccess approach and see what happens.

colak · 2018-05-27 15:11:27

Destry wrote #312126:

It relies on a working user agent, and that info seems dodgy now.
My gut is telling me IA has done away with a user agent for this very reason. Untested, of course. But to even make a test you need the right user agent.

I’m wondering if the test is as easy as asking the archive to list a page, any test page created especially for this purpose, and then checking the server logs. Would the user agent be listed there?

Destry · 2018-05-28 06:20:42

I’ve asked around on Masto. There’s a lot of dev/infosec/privacy types there. Feedback is saying ia_archiver, so I’m sure it’s right.

I’ve also heard that some are using robots.txt for years and never archived, while the file of others has been ignored. Nothing is clear, apparently. The prudent thing to do, it seems, is:

Use a robots.txt file.
Use .htaccess too.
Wait and see what happens.
Contact IA and complain if they archive the site.

Textpattern CMS

Textpattern CMS support forum

#1 2018-05-26 16:23:22