Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#11 2020-08-20 09:42:42

phiw13
Plugin Author
From: Japan
Registered: 2004-02-27
Posts: 2,114
Website

Re: How to deal with scraped content

colak wrote #325486:

The process was simple and admittedly painless.

[…]

you were rather lucky that the scrapped content was hosted on Amazon servers. For a friend we’ve had to deal with pirates hosted in Ukraine. Until recently the had not been resolved. Until recently, we think their hosting services when bust.


Where is that emoji for a solar powered submarine when you need it ?

Offline

#12 2020-08-20 13:40:05

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,460
Website

Re: How to deal with scraped content

phiw13 wrote #325493:

Yeah, that is why I said, “ jumping through hoops”… It kinda works yes, but is it slow and tedious. There should be an automatic mechanism in place to prevent indexing if the site owner prefers so.

Some might call a business email ‘jumping through hoops’. I think of it as a legal tender way of dealing with an otherwise dodgy technological situation (e.g. blocking a bot). And by my experience, writing an email works. Not kinda works. Works. The 20 minutes spent on the email (editor fuss) was well worth the piece of mind since. It took a little over a week, if I recall correctly, before I got the response and confirmation of change, but I wasn’t put off or surprised by that considering they are mostly a volunteer crew. Should they ever back-pedal on their promise, then I’ll have a different opinion about it all.

Offline

#13 2020-08-20 13:55:12

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 8,269
Website

Re: How to deal with scraped content

What about blocking the bot via htaccess?

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (archive.org_bot) [NC]
RewriteRule .* - [R=403,L]

If you want to block more bots. This is how I do it.


Yiannis
——————————
neme.org | hblack.net | LABS | State Machines | NeMe @ github | Covid-19; a resource
I do my best editing after I click on the submit button.

Offline

#14 2020-08-20 13:57:48

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,460
Website

Re: How to deal with scraped content

phiw13 wrote #325497:

For a friend we’ve had to deal with pirates hosted in Ukraine.

Yeah, that’s my worst concern. And I can name a few other countries that probably fit the bill, where even official/appropriate points of contact are going to be a questionable goose chase with equally questionable results.

Offline

#15 2020-08-20 14:05:09

Destry
Member
From: Haut-Rhin
Registered: 2004-08-04
Posts: 4,460
Website

Re: How to deal with scraped content

colak wrote #325502:

What about blocking the bot via htaccess?

I don’t think it’s about their bot, Colak, regardless how you do it. As 13 alluded to, anyone can suggest a URL to be indexed, in which case it probably becomes somewhat manual at that point. Don’t know. Don’t care at this point.

I do block them via robots.txt and .htaccess and I wrote them a direct request to which they confirmed (the legal tender). The latter is what they ask you to do, so my bases are covered, evidently.

Last edited by Destry (2020-08-20 14:06:01)

Offline

#16 2020-08-20 14:53:52

jakob
Admin
From: Germany
Registered: 2005-01-20
Posts: 3,934
Website

Re: How to deal with scraped content

I keep on thinking that there’s little you can do about determined thieves, but you can perhaps use non-tech or low-tech solutions in your content to make sure readers of stolen content find their way back to the original site, and unmask thieving sites to their readers. In a perverse way, you might even benefit from their SEO efforts (though not from the ad income they make).

If thieves don’t go to all that much effort, having many back links to your own site or sites in your text would be an easy way of at least discovering thieves and directing readers back to your site. It needn’t be gratuitous, it could for example be footnotes with your domain in the link.

I don’t know much about scrapers but if they happen to convert all links to the original domain to their own, maybe use short urls that direct back to your own site. With luck they won’t be converted and people find you when they click on a link.

——

Another idea: I’ve not though this through, but might it be possible to use the hot-linking method in reverse?

  • Include a link to a “this is the work of … . Find it here: …” image in your articles.
  • Use htaccess or whatever to prevent it showing on your own site but let it appear when used from another server/domain.

Of course, if you do want to allow sharing, that won’t help.

Also, if you provide a feed or allow ReadItLater/Pocket aggregating, you might want the “this is my work” image to not be too obnoxious so as not alienate bona-fide interested viewers.

——

Another idea for images:

I’ve always found image watermarks pretty obnoxious but understand why photographers want them. For example, a relative found a photograph of his through tineye being used for a psychotherapy practice in Canada (they didn’t know it was from him). His solution was not to put anything of moderately high-res online but the upshot is his site looks fuzzy on retina displays.

I always wondered if one could add a copyright strip to the bottom/top of photos, but display them on one’s own site with that bit cropped off so they look pristine. Anyone else who embeds them, and google’s image search, sees the copyright notice along the top/bottom without the image being ruined by an unsightly watermark. People can, of course, manually trim off that information bar but it means they have to actively, purposely cut your details off the image.

It turns out, that’s quite doable for fixed layouts but as soon as you want to use a lightbox or css object-fit or responsive images where the height varies with the image display width, it gets quite hard to reliably crop the container box to ensure the information bar doesn’t display on one’s own site. Somewhere you always end up with a sliver of it showing. I last investigated that several years ago, so maybe the aspect ratio methods now used for responsive images (those that use percentage padding on a container) could work.


TXP Builders – finely-crafted code, design and txp

Offline

#17 2020-08-21 09:15:33

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 8,269
Website

Re: How to deal with scraped content

I know that it is like working with the devil, but g has a process in place for reporting scraped content.
https://support.google.com/legal/answer/3110420?visit_id=637335976919973417-1020628965

>Edit: also this might be of interest
https://petapixel.com/2020/08/17/google-images-licensable-badge-to-help-photographers-sell-photos/

Last edited by colak (2020-08-22 09:17:44)


Yiannis
——————————
neme.org | hblack.net | LABS | State Machines | NeMe @ github | Covid-19; a resource
I do my best editing after I click on the submit button.

Offline

#18 2020-08-24 06:26:44

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 8,269
Website

Re: How to deal with scraped content


Yiannis
——————————
neme.org | hblack.net | LABS | State Machines | NeMe @ github | Covid-19; a resource
I do my best editing after I click on the submit button.

Offline

Board footer

Powered by FluxBB