How to deal with scraped content

colak · 2020-08-20 13:55:12

What about blocking the bot via htaccess?

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (archive.org_bot) [NC]
RewriteRule .* - [R=403,L]

If you want to block more bots. This is how I do it.

Destry · 2020-08-20 13:57:48

phiw13 wrote #325497:

For a friend we’ve had to deal with pirates hosted in Ukraine.

Yeah, that’s my worst concern. And I can name a few other countries that probably fit the bill, where even official/appropriate points of contact are going to be a questionable goose chase with equally questionable results.

Destry · 2020-08-20 14:05:09

colak wrote #325502:

What about blocking the bot via htaccess?

I don’t think it’s about their bot, Colak, regardless how you do it. As 13 alluded to, anyone can suggest a URL to be indexed, in which case it probably becomes somewhat manual at that point. Don’t know. Don’t care at this point.

I do block them via robots.txt and .htaccess and I wrote them a direct request to which they confirmed (the legal tender). The latter is what they ask you to do, so my bases are covered, evidently.

Last edited by Destry (2020-08-20 14:06:01)

jakob · 2020-08-20 14:53:52

I keep on thinking that there’s little you can do about determined thieves, but you can perhaps use non-tech or low-tech solutions in your content to make sure readers of stolen content find their way back to the original site, and unmask thieving sites to their readers. In a perverse way, you might even benefit from their SEO efforts (though not from the ad income they make).

If thieves don’t go to all that much effort, having many back links to your own site or sites in your text would be an easy way of at least discovering thieves and directing readers back to your site. It needn’t be gratuitous, it could for example be footnotes with your domain in the link.

I don’t know much about scrapers but if they happen to convert all links to the original domain to their own, maybe use short urls that direct back to your own site. With luck they won’t be converted and people find you when they click on a link.

——

Another idea: I’ve not though this through, but might it be possible to use the hot-linking method in reverse?

Include a link to a “this is the work of … . Find it here: …” image in your articles.
Use htaccess or whatever to prevent it showing on your own site but let it appear when used from another server/domain.

Of course, if you do want to allow sharing, that won’t help.

Also, if you provide a feed or allow ReadItLater/Pocket aggregating, you might want the “this is my work” image to not be too obnoxious so as not alienate bona-fide interested viewers.

——

Another idea for images:

I’ve always found image watermarks pretty obnoxious but understand why photographers want them. For example, a relative found a photograph of his through tineye being used for a psychotherapy practice in Canada (they didn’t know it was from him). His solution was not to put anything of moderately high-res online but the upshot is his site looks fuzzy on retina displays.

I always wondered if one could add a copyright strip to the bottom/top of photos, but display them on one’s own site with that bit cropped off so they look pristine. Anyone else who embeds them, and google’s image search, sees the copyright notice along the top/bottom without the image being ruined by an unsightly watermark. People can, of course, manually trim off that information bar but it means they have to actively, purposely cut your details off the image.

It turns out, that’s quite doable for fixed layouts but as soon as you want to use a lightbox or css object-fit or responsive images where the height varies with the image display width, it gets quite hard to reliably crop the container box to ensure the information bar doesn’t display on one’s own site. Somewhere you always end up with a sliver of it showing. I last investigated that several years ago, so maybe the aspect ratio methods now used for responsive images (those that use percentage padding on a container) could work.

colak · 2020-08-21 09:15:33

I know that it is like working with the devil, but g has a process in place for reporting scraped content.
https://support.google.com/legal/answer/3110420?visit_id=637335976919973417-1020628965

>Edit: also this might be of interest
https://petapixel.com/2020/08/17/google-images-licensable-badge-to-help-photographers-sell-photos/

Last edited by colak (2020-08-22 09:17:44)

colak · 2020-08-24 06:26:44

Regarding G and images

jakob · 2021-02-25 21:48:33

One of the web hosts I use has started including Botguard with some of their servers. Turns out anyone can use it, albeit for a small fee. At 1€/month maybe it’s of interest to some of the things discussed in this thread…

bici · 2021-02-26 07:13:46

jakob wrote #329057:

One of the web hosts I use has started including Botguard with some of their servers. Turns out anyone can use it, albeit for a small fee. At 1€/month maybe it’s of interest to some of the things discussed in this thread…

nice find!. I am going to check with our hosting company if this something thatchy can offer.

colak · 2021-02-26 10:43:53

We have this in our htaccess and this in the robots.txt. I’m sure that they are not as complete as botguard, but they may be of help for some. Because of our content, we like having our site archived by way back machine, and admittedly it saved my neck a couple of times when I inadvertently deleted parts of articles.

Also you may be interested of this list maintained by somebody else, or this one which only allows search engines and disallows all other bots.

towndock · 2021-02-26 19:55:44

jakob wrote #325505:

His solution was not to put anything of moderately high-res online but the upshot is his site looks fuzzy on retina displays.

Sometimes the punishment doesn’t fit the crime, and that is a perfect example. I’m annoyed when our content occasionally gets used without permission. But preventing thieves from getting high res images by preventing every possible legitimate reader from seeing them? That’s not a logical tradeoff.

My key site has over 30,000 images (over a dozen years of news stories). If an image of ours was being used by a psychotherapy practice on the other side of the globe, amusement might be the best response.

Our biggest challenge is our stuff getting “borrowed” on social media. If I tried to chase that all down that would be all I’d do.

Textpattern CMS

Textpattern CMS support forum

#13 2020-08-20 13:55:12

Re: How to deal with scraped content

#14 2020-08-20 13:57:48

Re: How to deal with scraped content

phiw13 wrote #325497:

#15 2020-08-20 14:05:09

Re: How to deal with scraped content

colak wrote #325502:

#16 2020-08-20 14:53:52

Re: How to deal with scraped content

#17 2020-08-21 09:15:33

Re: How to deal with scraped content

#18 2020-08-24 06:26:44

Re: How to deal with scraped content

#19 2021-02-25 21:48:33

Re: How to deal with scraped content

#20 2021-02-26 07:13:46

Re: How to deal with scraped content

jakob wrote #329057:

#21 2021-02-26 10:43:53

Re: How to deal with scraped content

#22 2021-02-26 19:55:44

Re: How to deal with scraped content

jakob wrote #325505:

Board footer