Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2020-08-20 06:27:53

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

How to deal with scraped content

colak wrote #325348:

A couple of years ago, somebody scraped the whole of the NeMe site and posted it in their domain.

Have you talked about that more somewhere else? I’d be curious to know what you had to go through to deal with them taking it down. Being the kind of person that doesn’t let IA archive his sites, nor use CC licenses for content, I’m always conscientious of what I would have to do, short of hire a lawyer, if someone did that kind of thing for a personal project. Site design I don’t really care about, but the words I string together out of my brain, I’m rather protective of those structures. A lot of time and effort goes into them.

I’m guessing the first step is to contact the web host and hope they take the offender’s site down. Any other muscle tactics you had to do? Frustrations learned?

You can hit me with an email via the forum if you want.

Last edited by Destry (2020-08-20 06:29:22)

Offline

#2 2020-08-20 07:42:41

jakob
Admin
From: Germany
Registered: 2005-01-20
Posts: 4,595
Website

Re: How to deal with scraped content

[ split from the other thread linked at the top ]

It’s probably hard to prevent when it’s just text. It also depends how much effort people put into copying content.

Here’s a method someone used when all their articles were funnelled including links to original images to another site: (FYI on Twitter: https://twitter.com/shanselman/status/1295958229848006657 ).


TXP Builders – finely-crafted code, design and txp

Offline

#3 2020-08-20 08:10:07

phiw13
Plugin Author
From: Japan
Registered: 2004-02-27
Posts: 3,079
Website

Re: How to deal with scraped content

Destry wrote #325480:

doesn’t let IA archive his sites,

(slight side note) How do you prevent IA from archiving your site(s) ?

I find it so arrogantly annoying that every cockroach with a keyboard can submit a site / article for indexing (with deep apologies to cockroaches), but they make it very hard to block them and force you to jump through hoops to have an article removed.

Thanks you.


Where is that emoji for a solar powered submarine when you need it ?
Sand space – admin theme for Textpattern

Offline

#4 2020-08-20 08:21:16

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,011
Website GitHub Mastodon Twitter

Re: How to deal with scraped content

Destry wrote #325480:

Have you talked about that more somewhere else? I’d be curious to know what you had to go through to deal with them taking it down. Being the kind of person that doesn’t let IA archive his sites, nor use CC licenses for content, I’m always conscientious of what I would have to do, short of hire a lawyer, if someone did that kind of thing for a personal project. Site design I don’t really care about, but the words I string together out of my brain, I’m rather protective of those structures. A lot of time and effort goes into them.

I’m guessing the first step is to contact the web host and hope they take the offender’s site down. Any other muscle tactics you had to do? Frustrations learned?

You can hit me with an email via the forum if you want.

The only sections of the site which we are very protective for, are the ‘texts’ and the ‘projects’ the reason being that they mostly do not ‘belong’ to us.

The process was simple and admittedly painless.

I found who hosted the site via whois. I first wrote directly to the site owner, who ignored my emails for 15 days. After that I wrote to their host, Amazon. Within 48 hours I received a call from a Spanish number from a very polite Amazon rep who took the matter onto her hands and a few days later, the site went off line. After I spoke with Amazon, I added this line to our htaccess (or it might be the next claster) which prevented them from hotlinking to our js.

I revisited the url about a month or so ago, and noticed that it is now populated by other content, which looks that it was scrapped as well.


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#5 2020-08-20 08:26:45

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,011
Website GitHub Mastodon Twitter

Re: How to deal with scraped content

phiw13 wrote #325485:

(slight side note) How do you prevent IA from archiving your site(s) ?

I find it so arrogantly annoying that every cockroach with a keyboard can submit a site / article for indexing (with deep apologies to cockroaches), but they make it very hard to block them and force you to jump through hoops to have an article removed.

Thanks you.

ia plays by the rules. You can add in robots.txt:

User-agent: archive.org_bot
Disallow: /

You can also check what I included for neme which blocks a number of other bad bots.


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#6 2020-08-20 08:34:40

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: How to deal with scraped content

jakob wrote #325483:

It’s probably hard to prevent when it’s just text. It also depends how much effort people put into copying content.

I’m asking from the understanding that there are vastly more crooks and thieves on the web (beginning with big tech like Google that created ad-revenue systems that motivate online thievery) than law-abiding citizens. So the question is more about what to do after being ripped off than how to prevent it.

Making public displays of being ripped off, like your shared example below, is probably one exhausting tactic.

Here’s a method someone used when all their articles were funnelled including links to original images to another site: (FYI on Twitter: https://twitter.com/shanselman/status/1295958229848006657 ).

That seems to be largely a tactic for blocking images, and a good one to employ, apparently. I never use images/photos of my own creation or ownership (public domain only), so I’m less concerned about that. Any I might create, I fully expect to be gifts to the web cesspool. Usually, however, I might modify images I use from public domain, within limits of fair use, but I don’t care if they get spread.

What I would care about though, is thieves and scrapers getting more traffic (due to their other kinds of cheating) and making money from my content, when I intentionally don’t try to profit off it at my own site.

Yeah, it’s a ‘whack-a-mole’ game you can’t win, but that doesn’t mean one should make it easy for them.

Offline

#7 2020-08-20 08:41:08

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: How to deal with scraped content

phiw13 wrote #325485:

How do you prevent IA from archiving your site(s) ?

You can write them directly, from an email on the domain you are writing about as evidence, and ask them not to archive your domain, or some sub-directory of it, or whatever.

I asked them two or three years ago to not archive anything on my domain, including all subdomains, and they obliged. Including the removal of anything they had archived. So it appears that owner requests in writing will trump cockroach requests on the fly.

Last edited by Destry (2020-08-20 08:42:06)

Offline

#8 2020-08-20 08:54:08

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: How to deal with scraped content

colak wrote #325486:

I found who hosted the site via whois. I first wrote directly to the site owner, who ignored my emails for 15 days. After that I wrote to their host, Amazon. Within 48 hours I received a call from a Spanish number from a very polite Amazon rep who took the matter onto her hands and a few days later, the site went off line. After I spoke with Amazon, I added this line to our htaccess (or it might be the next claster) which prevented them from hotlinking to our js.

Thank you. That’s the routine I was expecting. I’m surprised by the phone call, but I guess that adds a level of legitimacy to it.

I also suspect it depends on how cooperative the web host is. And if the scraper is also hosting their own servers, you’re really screwed. Then it’s lawyer time, I guess.

I revisited the url about a month or so ago, and noticed that it is now populated by other content, which looks that it was scrapped as well.

Yeah, this is usually the problem with content theft; it’s done by sites that don’t care about law and will keep on doing it to make money in the background. I blame Google for this kind of thing.

Offline

#9 2020-08-20 09:30:56

phiw13
Plugin Author
From: Japan
Registered: 2004-02-27
Posts: 3,079
Website

Re: How to deal with scraped content

colak wrote #325487:

ia plays by the rules. You can add in robots.txt:

User-agent: archive.org_bot…

unfortunately they do not (anymore). Their own statement from 2017. This has been discussed at nauseam everywhere. One or the most recent reasonable articles on that.

Destry wrote #325489:

You can write them directly, from an email on the domain you are writing about as evidence, and ask them not to archive your domain, or some sub-directory of it, or whatever.

Yeah, that is why I said, “ jumping through hoops”… It kinda works yes, but is it slow and tedious. There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.


Where is that emoji for a solar powered submarine when you need it ?
Sand space – admin theme for Textpattern

Offline

#10 2020-08-20 09:36:20

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,137
GitHub

Re: How to deal with scraped content

phiw13 wrote #325493:

There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.

Do they advertise their crawler IP range anywhere? Firewall block, perhaps?

Offline

#11 2020-08-20 09:42:42

phiw13
Plugin Author
From: Japan
Registered: 2004-02-27
Posts: 3,079
Website

Re: How to deal with scraped content

colak wrote #325486:

The process was simple and admittedly painless.

[…]

you were rather lucky that the scrapped content was hosted on Amazon servers. For a friend we’ve had to deal with pirates hosted in Ukraine. Until recently the had not been resolved. Until recently, we think their hosting services when bust.


Where is that emoji for a solar powered submarine when you need it ?
Sand space – admin theme for Textpattern

Offline

#12 2020-08-20 13:40:05

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,909
Website

Re: How to deal with scraped content

phiw13 wrote #325493:

Yeah, that is why I said, “ jumping through hoops”… It kinda works yes, but is it slow and tedious. There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.

Some might call a business email ‘jumping through hoops’. I think of it as a legal tender way of dealing with an otherwise dodgy technological situation (e.g. blocking a bot). And by my experience, writing an email works. Not kinda works. Works. The 20 minutes spent on the email (editor fuss) was well worth the piece of mind since. It took a little over a week, if I recall correctly, before I got the response and confirmation of change, but I wasn’t put off or surprised by that considering they are mostly a volunteer crew. Should they ever back-pedal on their promise, then I’ll have a different opinion about it all.

Offline

Board footer

Powered by FluxBB