Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
A page of my website is being crawled a lot by the same IP
This is a question to the community I trust most.
Some context :
As I said here before, I maintain a listing of providers claiming to be GDPR compliant with links to their privacy policy and DPAs.
It’s here: gdpr4saas.eu/providers-list
This page is being crawled by different IPs (1 is from AWS) like 4 times an hour. For a page that barely changes once in 24 hours.
This page represents a lot of manual work to update.
I am not sure I feel comfortable having this work sucked up in such an obvious way…
Question:
What should I do?
Let the scrapping happen? Prevent it? How?
Any other suggestion is welcome.
Offline
Re: A page of my website is being crawled a lot by the same IP
planeth wrote #312919:
This page represents a lot of manual work to update.
I am not sure I feel comfortable having this work sucked up in such an obvious way…
Put in restrictions to greatly reduce crawlers. There will always be bad bots that ignore the restrictions, but you can prevent most, including common IP addresses.
What web server are you running? You can (usually) find this from the Textpattern diagnostics panel.
Offline
Re: A page of my website is being crawled a lot by the same IP
gaekwad wrote #312920:
What web server are you running? You can (usually) find this from the Textpattern diagnostics panel.
Apache/2.2
Re restrictions, do you mean putting a deny directive in my .htaccess?
Offline
Re: A page of my website is being crawled a lot by the same IP
planeth wrote #312921:
Apache/2.2
Re restrictions, do you mean putting a deny directive in my .htaccess?
Yes. If there’s no value to the bot, and / or if you’re not sure of its intentions, then block it at an IP address level. You don’t owe it anything, and if you want to protect your work then it’s a straightforward step to take in blocking access.
Offline
Re: A page of my website is being crawled a lot by the same IP
Yes, I was thinking of that.
One thing, though. Is there a chance that these IP addresses be common to other websites? (Being from AWS)
Offline
Re: A page of my website is being crawled a lot by the same IP
planeth wrote #312923:
Is there a chance that these IP addresses be common to other websites? (Being from AWS)
In my experience, some AWS instances use a static IP address, and some use one or more pools of IP addresses. From your description, it sounds like somebody has spun up an AWS instance to scrape your content, which is a common thing these days.
Your content has a value to you, and a value to others – you decide how much the content is worth on both sides, and you are very much permitted (expected?) to impose restrictions on content access to your own standards.
Simply, if something is scraping your content without permission, then it’s arguably stealing. You can choose to restrict access to that content.
Offline
Re: A page of my website is being crawled a lot by the same IP
Hi planeth
You could just try to stop the bad bots for now. I maintain a list in my htaccess file on github.com/colak/neme/blob/master/.htaccess#L77
Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.
Offline
Re: A page of my website is being crawled a lot by the same IP
colak wrote #312930:
You could just try to stop the bad bots for now.
I was quietly hoping you’d appear, colak – you’re a professional at bot blocking in my eyes!
Offline