Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2018-07-09 08:37:23

planeth
Plugin Author
From: Nantes, France
Registered: 2009-03-19
Posts: 234
Website GitHub Mastodon

A page of my website is being crawled a lot by the same IP

This is a question to the community I trust most.

Some context :
As I said here before, I maintain a listing of providers claiming to be GDPR compliant with links to their privacy policy and DPAs.
It’s here: gdpr4saas.eu/providers-list
This page is being crawled by different IPs (1 is from AWS) like 4 times an hour. For a page that barely changes once in 24 hours.
This page represents a lot of manual work to update.
I am not sure I feel comfortable having this work sucked up in such an obvious way…

Question:
What should I do?
Let the scrapping happen? Prevent it? How?
Any other suggestion is welcome.

Offline

#2 2018-07-09 11:08:56

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,259
GitHub

Re: A page of my website is being crawled a lot by the same IP

planeth wrote #312919:

This page represents a lot of manual work to update.
I am not sure I feel comfortable having this work sucked up in such an obvious way…

Put in restrictions to greatly reduce crawlers. There will always be bad bots that ignore the restrictions, but you can prevent most, including common IP addresses.

What web server are you running? You can (usually) find this from the Textpattern diagnostics panel.

Offline

#3 2018-07-09 11:38:36

planeth
Plugin Author
From: Nantes, France
Registered: 2009-03-19
Posts: 234
Website GitHub Mastodon

Re: A page of my website is being crawled a lot by the same IP

gaekwad wrote #312920:

What web server are you running? You can (usually) find this from the Textpattern diagnostics panel.

Apache/2.2
Re restrictions, do you mean putting a deny directive in my .htaccess?

Offline

#4 2018-07-09 11:46:30

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,259
GitHub

Re: A page of my website is being crawled a lot by the same IP

planeth wrote #312921:

Apache/2.2
Re restrictions, do you mean putting a deny directive in my .htaccess?

Yes. If there’s no value to the bot, and / or if you’re not sure of its intentions, then block it at an IP address level. You don’t owe it anything, and if you want to protect your work then it’s a straightforward step to take in blocking access.

Offline

#5 2018-07-09 12:01:17

planeth
Plugin Author
From: Nantes, France
Registered: 2009-03-19
Posts: 234
Website GitHub Mastodon

Re: A page of my website is being crawled a lot by the same IP

Yes, I was thinking of that.
One thing, though. Is there a chance that these IP addresses be common to other websites? (Being from AWS)

Offline

#6 2018-07-09 12:17:13

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,259
GitHub

Re: A page of my website is being crawled a lot by the same IP

planeth wrote #312923:

Is there a chance that these IP addresses be common to other websites? (Being from AWS)

In my experience, some AWS instances use a static IP address, and some use one or more pools of IP addresses. From your description, it sounds like somebody has spun up an AWS instance to scrape your content, which is a common thing these days.

Your content has a value to you, and a value to others – you decide how much the content is worth on both sides, and you are very much permitted (expected?) to impose restrictions on content access to your own standards.

Simply, if something is scraping your content without permission, then it’s arguably stealing. You can choose to restrict access to that content.

Offline

#7 2018-07-09 15:43:49

colak
Admin
From: Cyprus
Registered: 2004-11-20
Posts: 9,090
Website GitHub Mastodon Twitter

Re: A page of my website is being crawled a lot by the same IP

Hi planeth
You could just try to stop the bad bots for now. I maintain a list in my htaccess file on github.com/colak/neme/blob/master/.htaccess#L77


Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.

Offline

#8 2018-07-10 11:31:22

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,259
GitHub

Re: A page of my website is being crawled a lot by the same IP

colak wrote #312930:

You could just try to stop the bad bots for now.

I was quietly hoping you’d appear, colak – you’re a professional at bot blocking in my eyes!

Offline

Board footer

Powered by FluxBB