this post was submitted on 21 Mar 2025
1249 points (99.4% liked)

Technology

67151 readers
5899 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] Buelldozer@lemmy.today 11 points 1 day ago (2 children)

and try to slam your site with like 200+ requests per second

Your solution would do nothing to stop the crawlers that are operating 10ish rps. There's ones out there operating at a mere 2rps but when multiple companies are doing it at the same time 24x7x365 it adds up.

Some incredibly talented people have been battling this since last year and your solution has been tried multiple times. It's not effective in all instances and can require a LOT of manual intervention and SysAdmin time.

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

[–] confusedbytheBasics@lemmy.world 4 points 23 hours ago

Yep. After you ban all the easy to spot ones you're still left with far too many hard to ID bots. At least if your site is popular and large.

[–] dual_sport_dork@lemmy.world 2 points 1 day ago

It's worked alright for me. Your mileage may vary.

If someone is scraping my site at a low crawl rate I honestly don't care so long as it doesn't impact my performance for everyone else. If I hosted anything that was not just public knowledge or copy regurgitated verbatim from the bumf provided by the vendors of the brands I sell, I might oppose to it ideologically. But I don't. So I don't.

If parallel crawling from multiple organizations legitimately becomes a concern for us I will have to get more creative. But thus far it hasn't, and honestly just wholesale blocking Amazon from our shit instantly solved 90% of the problem.