How to deal with scraped content

Destry · 2020-08-20 06:27:53

colak wrote #325348:

A couple of years ago, somebody scraped the whole of the NeMe site and posted it in their domain.

Have you talked about that more somewhere else? I’d be curious to know what you had to go through to deal with them taking it down. Being the kind of person that doesn’t let IA archive his sites, nor use CC licenses for content, I’m always conscientious of what I would have to do, short of hire a lawyer, if someone did that kind of thing for a personal project. Site design I don’t really care about, but the words I string together out of my brain, I’m rather protective of those structures. A lot of time and effort goes into them.

I’m guessing the first step is to contact the web host and hope they take the offender’s site down. Any other muscle tactics you had to do? Frustrations learned?

You can hit me with an email via the forum if you want.

Last edited by Destry (2020-08-20 06:29:22)

jakob · 2020-08-20 07:42:41

[ split from the other thread linked at the top ]

It’s probably hard to prevent when it’s just text. It also depends how much effort people put into copying content.

Here’s a method someone used when all their articles were funnelled including links to original images to another site: (FYI on Twitter: https://twitter.com/shanselman/status/1295958229848006657 ).

phiw13 · 2020-08-20 08:10:07

Destry wrote #325480:

doesn’t let IA archive his sites,

(slight side note) How do you prevent IA from archiving your site(s) ?

I find it so arrogantly annoying that every cockroach with a keyboard can submit a site / article for indexing (with deep apologies to cockroaches), but they make it very hard to block them and force you to jump through hoops to have an article removed.

Thanks you.

colak · 2020-08-20 08:21:16

Destry wrote #325480:

Have you talked about that more somewhere else? I’d be curious to know what you had to go through to deal with them taking it down. Being the kind of person that doesn’t let IA archive his sites, nor use CC licenses for content, I’m always conscientious of what I would have to do, short of hire a lawyer, if someone did that kind of thing for a personal project. Site design I don’t really care about, but the words I string together out of my brain, I’m rather protective of those structures. A lot of time and effort goes into them.

I’m guessing the first step is to contact the web host and hope they take the offender’s site down. Any other muscle tactics you had to do? Frustrations learned?

You can hit me with an email via the forum if you want.

The only sections of the site which we are very protective for, are the ‘texts’ and the ‘projects’ the reason being that they mostly do not ‘belong’ to us.

The process was simple and admittedly painless.

I found who hosted the site via whois. I first wrote directly to the site owner, who ignored my emails for 15 days. After that I wrote to their host, Amazon. Within 48 hours I received a call from a Spanish number from a very polite Amazon rep who took the matter onto her hands and a few days later, the site went off line. After I spoke with Amazon, I added this line to our htaccess (or it might be the next claster) which prevented them from hotlinking to our js.

I revisited the url about a month or so ago, and noticed that it is now populated by other content, which looks that it was scrapped as well.

colak · 2020-08-20 08:26:45

phiw13 wrote #325485:

(slight side note) How do you prevent IA from archiving your site(s) ?

I find it so arrogantly annoying that every cockroach with a keyboard can submit a site / article for indexing (with deep apologies to cockroaches), but they make it very hard to block them and force you to jump through hoops to have an article removed.

Thanks you.

ia plays by the rules. You can add in robots.txt:

User-agent: archive.org_bot
Disallow: /

You can also check what I included for neme which blocks a number of other bad bots.

Destry · 2020-08-20 08:34:40

jakob wrote #325483:

It’s probably hard to prevent when it’s just text. It also depends how much effort people put into copying content.

I’m asking from the understanding that there are vastly more crooks and thieves on the web (beginning with big tech like Google that created ad-revenue systems that motivate online thievery) than law-abiding citizens. So the question is more about what to do after being ripped off than how to prevent it.

Making public displays of being ripped off, like your shared example below, is probably one exhausting tactic.

Here’s a method someone used when all their articles were funnelled including links to original images to another site: (FYI on Twitter: https://twitter.com/shanselman/status/1295958229848006657 ).

That seems to be largely a tactic for blocking images, and a good one to employ, apparently. I never use images/photos of my own creation or ownership (public domain only), so I’m less concerned about that. Any I might create, I fully expect to be gifts to the web cesspool. Usually, however, I might modify images I use from public domain, within limits of fair use, but I don’t care if they get spread.

What I would care about though, is thieves and scrapers getting more traffic (due to their other kinds of cheating) and making money from my content, when I intentionally don’t try to profit off it at my own site.

Yeah, it’s a ‘whack-a-mole’ game you can’t win, but that doesn’t mean one should make it easy for them.

Destry · 2020-08-20 08:41:08

phiw13 wrote #325485:

How do you prevent IA from archiving your site(s) ?

You can write them directly, from an email on the domain you are writing about as evidence, and ask them not to archive your domain, or some sub-directory of it, or whatever.

I asked them two or three years ago to not archive anything on my domain, including all subdomains, and they obliged. Including the removal of anything they had archived. So it appears that owner requests in writing will trump cockroach requests on the fly.

Last edited by Destry (2020-08-20 08:42:06)

Destry · 2020-08-20 08:54:08

colak wrote #325486:

I found who hosted the site via whois. I first wrote directly to the site owner, who ignored my emails for 15 days. After that I wrote to their host, Amazon. Within 48 hours I received a call from a Spanish number from a very polite Amazon rep who took the matter onto her hands and a few days later, the site went off line. After I spoke with Amazon, I added this line to our htaccess (or it might be the next claster) which prevented them from hotlinking to our js.

Thank you. That’s the routine I was expecting. I’m surprised by the phone call, but I guess that adds a level of legitimacy to it.

I also suspect it depends on how cooperative the web host is. And if the scraper is also hosting their own servers, you’re really screwed. Then it’s lawyer time, I guess.

I revisited the url about a month or so ago, and noticed that it is now populated by other content, which looks that it was scrapped as well.

Yeah, this is usually the problem with content theft; it’s done by sites that don’t care about law and will keep on doing it to make money in the background. I blame Google for this kind of thing.

phiw13 · 2020-08-20 09:30:56

colak wrote #325487:

ia plays by the rules. You can add in robots.txt:

User-agent: archive.org_bot…

unfortunately they do not (anymore). Their own statement from 2017. This has been discussed at nauseam everywhere. One or the most recent reasonable articles on that.

Destry wrote #325489:

You can write them directly, from an email on the domain you are writing about as evidence, and ask them not to archive your domain, or some sub-directory of it, or whatever.

Yeah, that is why I said, “ jumping through hoops”… It kinda works yes, but is it slow and tedious. There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.

gaekwad · 2020-08-20 09:36:20

phiw13 wrote #325493:

There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.

Do they advertise their crawler IP range anywhere? Firewall block, perhaps?

phiw13 · 2020-08-20 09:42:42

colak wrote #325486:

The process was simple and admittedly painless.

[…]

you were rather lucky that the scrapped content was hosted on Amazon servers. For a friend we’ve had to deal with pirates hosted in Ukraine. Until recently the had not been resolved. Until recently, we think their hosting services when bust.

Destry · 2020-08-20 13:40:05

phiw13 wrote #325493:

Yeah, that is why I said, “ jumping through hoops”… It kinda works yes, but is it slow and tedious. There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.

Some might call a business email ‘jumping through hoops’. I think of it as a legal tender way of dealing with an otherwise dodgy technological situation (e.g. blocking a bot). And by my experience, writing an email works. Not kinda works. Works. The 20 minutes spent on the email (editor fuss) was well worth the piece of mind since. It took a little over a week, if I recall correctly, before I got the response and confirmation of change, but I wasn’t put off or surprised by that considering they are mostly a volunteer crew. Should they ever back-pedal on their promise, then I’ll have a different opinion about it all.

colak · 2020-08-20 13:55:12

What about blocking the bot via htaccess?

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (archive.org_bot) [NC]
RewriteRule .* - [R=403,L]

If you want to block more bots. This is how I do it.

Destry · 2020-08-20 13:57:48

phiw13 wrote #325497:

For a friend we’ve had to deal with pirates hosted in Ukraine.

Yeah, that’s my worst concern. And I can name a few other countries that probably fit the bill, where even official/appropriate points of contact are going to be a questionable goose chase with equally questionable results.

Destry · 2020-08-20 14:05:09

colak wrote #325502:

What about blocking the bot via htaccess?

I don’t think it’s about their bot, Colak, regardless how you do it. As 13 alluded to, anyone can suggest a URL to be indexed, in which case it probably becomes somewhat manual at that point. Don’t know. Don’t care at this point.

I do block them via robots.txt and .htaccess and I wrote them a direct request to which they confirmed (the legal tender). The latter is what they ask you to do, so my bases are covered, evidently.

Last edited by Destry (2020-08-20 14:06:01)

Textpattern CMS support forum

#1 2020-08-20 06:27:53