ROFLMAO.
Claude decided to crawl one of the sites on my new server, where known bots are redirected to an iocaine maze. Claude has been in the maze for 13k requests so far, over the course of 30 minutes.
I will need to fine tune the rate limiting, because it didn't hit any rate limits - it scanned using 902 different client IPs. So simply rate limiting by IP doesn't fly. I'll rate limit by (possibly normalized) agent (they all used the same UA).
Over the course of this 30 minutes, it downloaded about ~300 times less data than if I would've let it scrape the real thing, and each request took about the tenth of the time to serve than the real thing would have. So I saved bandwidth, saved processing time, likely saved RAM too, and served garbage to Claude.
Job well done.
Claude is back!
17.5k requests made today, between 05:00 and 18:50. 7.5k of those hit the rate limit, 10k did not. It started by requesting /robots.txt
, then /
, and went from there. It doesn't look like it visited any URLs it looked at in the previous scan, but I didn't do a proper comparison, just eyeballed things so far.
No other AI visitor came by yet.
I will tweak the rate limits further. And I will need to do some deep dives into the stats, there are many questions I wish to find answers for!
Hope to do that this coming weekend, and post a summary on my blog, along with a writeup of what I'm trying to accomplish, and how, and stuff like that.
Might end up posting the trap logs too, so others can dive into the data too. IP addresses will be included, as a service.
@algernon Probably a dumb question, but is it ignoring robots.txt? To be clear, I don't think it's your (or my) obligation to block specific bots in robots.txt, I'm just curious why they are fetching it.
@bremner I'm unsure whether it ignores it or not. I never gave it a chance to obey, it was blocked before the site it now tries to crawl had an A record. (And as such, even /robots.txt
is generated bee movie garbage, and thus, unparsable for the robots)