Are AI Web Crawlers 'Destroying Websites' In Their Hunt for Training Data? - Slashdot

Slashdot

●Stories

●Firehose

●All

●Popular

●Polls

●Software

●Thought Leadership

Submit

Search Slashdot

● Login

●or

● Sign up

●Topics:

●Devices

●Build

●Entertainment

●Technology

●Open Source

●Science

●YRO

●Follow us:

●RSS

●Facebook

●LinkedIn

●Twitter

● Youtube

● Mastodon

●Bluesky

Follow Slashdot blog updates by subscribing to our blog RSS feed

Nickname:

Password:

Public Terminal

Forgot your password?
Close

This discussion has been archived. No new comments can be posted.

Are AI Web Crawlers 'Destroying Websites' In Their Hunt for Training Data? More Login

Are AI Web Crawlers 'Destroying Websites' In Their Hunt for Training Data?

Load All Comments

Full Abbreviated Hidden

/Sea

Score:

5

4

3

2

1

0

-1

More Login

Nickname:

Password:

Public Terminal

Forgot your password?
Close

Close

Search 85Comments Log In/Create an Account

Comments Filter:

● All

● Insightful

● Informative

● Interesting

● Funny

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Destroying Websites? (Score:-1, Troll)

bySlashbotAgent ( 6477336 ) writes:

Destroying website? No, that's bullshit.
Destroying website page views by giving the user the data without attribution or even visiting the site? Yea. that's totally happening.
It's not damaging any sites. It's damaging the revenue of a few sites and their pissed. Perhaps rightly so. But the horses have left the barn and the barn has burned down.

Re:Destroying Websites? (Score:5, Interesting)

bychristerk ( 141744 ) writes: on Sunday August 31, 2025 @02:06PM (#65628552)

As someone who's actively fighting this type of traffic, let me share my perspective.
I have been running a small-ish website with user peaks at around 50 requests per second. Over the last couple of months, my site is getting hit with loads of up to 300 requests per second by these kinds of bots. They're using distributed IPs, and random user agents making it hard to block.
My site has a lot of data and pages to scan, and despite an appropriate robots.txt, these things ignore that and just scan endlessly. My website isn't designed to be for profit, and I do this more or less as a hobby and therefore has trouble handling a nearly 10x increase in traffic. My DNS costs have gone up significantly, with 150 or so million DNS requests being done this month.
The net effect is that my website slows down and gets unresponsive by these scans, and I am looking at spending more money just to manage this excess traffic.
Is it destroying my site? No, not really. But it absolutely increases costs and forces me to spend more money and hours on infrastructure than I would have needed to. These things are hurting smaller communities generating significant cost increases onto those who may have difficulties covering those costs, so calling it bullshit isn't exactly accurate.

Parent Share
twitter facebook

Re: Destroying Websites? (Score:2)

byreanjr ( 588767 ) writes:

I know you're saying it's coming from lots of IP addresses, but I wonder if anyone has looked into geofencing to throttle any requests coming out of major data center cities. Normal users would get full speed access, but anyone in the valley or in Ashburn, VA would experience difficulty scraping.

Re: Destroying Websites? (Score:4, Interesting)

byHalo1 ( 136547 ) writes: on Sunday August 31, 2025 @03:53PM (#65628736)

It's not just data centres, many of the requests from regular broadband IP addresses. I think they're using "services" of bottom feeders like Scraper API [scraperapi.com], or buying from the authors of malicious web browser extensions [arstechnica.com].

Parent Share
twitter facebook

Re: (Score:2)

byh33t l4x0r ( 4107715 ) writes:

Yeah and it just gets worse if you try to block them because now instead of something like Python requests they're using Selenium/ Playwright to get around those blocks which means loading your css / images / whatever as well like a regular visitor would

Re: (Score:2)

bysound+vision ( 884283 ) writes:

It's really no holds barred, with the amount of money we're talking about. This is an industry that's spent the last 3, 4 decades telling us how terrible unauthorized copying and systems access are. (Don't copy that floppy!) Those rules get thrown right out the window when they're eyeing the types of cash they think AI will bring.
Working with malware distributors and botnet admins would not surprise me at all. Particularly in this Project 2025 era where the government's been purchased, whole-hog, by tech br

Re: (Score:3)

byCigamit ( 200871 ) writes:

I have had to fight off several, one of which I recorded over 1 million unique IPs, all random and coming out of nearly every Vietnam and Indonesian subnet, mostly residential. My site normally gets 5-10 requests per second and was now getting over 1000+ for 12-14 hours per day for 3 weeks straight. It always started at the same time of day, almost like it was on a timer. Luckily, that one all used a User Agent with the same old version of Chrome in the string and was easily blocked. But the attack conti

Re: (Score:2)

bysound+vision ( 884283 ) writes:

"Major data center cities" are also generally major population cities, with a minor in geography. Rate-limiting by GeoIP would also rate-limit big chunks of real users. It might provide a marginal windfall to users outside of those locations - at least until the bots start using rural IP addresses.
There is a boom in rural areas spinning up data centers. That is to say, any random small city may now or in the near future suddenly become a "major data center city" at the behest of a singular tech bro.

Re: Destroying Websites? (Score:2)

byreanjr ( 588767 ) writes:

""Major data center cities" are also generally major population cities"
Yes and no. Data centers are usually NEAR cities. But the economics of data centers keep them out in the suburbs. Data centers are more likely to be surrounded by fields than a high rise apartment.
But it sounds like from other comments that the requests are actually much more diffuse.

Re:Destroying Websites? (Score:5, Interesting)

byHalo1 ( 136547 ) writes: on Sunday August 31, 2025 @03:55PM (#65628738)

Anubis [github.com] has worked well for us to get rid of most of the scrapers from our wiki, including the ones faking regular user agents.

Parent Share
twitter facebook

Re: (Score:2)

byallo ( 1728082 ) writes:

Anubis has the side effect that it stops the internet archive crawler.

Re: (Score:2)

byHalo1 ( 136547 ) writes:

Anubis has the side effect that it stops the internet archive crawler.
Even though it whitelists [github.com] the IA crawlers by default?

Re: (Score:2)

byallo ( 1728082 ) writes:

I am not sure what they are whitelisting, but I've seen pages in the archive that only show the anubis girl. Maybe it was from an older version, but I saw the page less than 5 month ago.

Re: (Score:2)

byh33t l4x0r ( 4107715 ) writes:

Have you considered offering a rss feed? Bots would rather consume that than html. It tastes better.

Re:Destroying Websites? (Score:4, Interesting)

bylarryjoe ( 135075 ) writes: on Sunday August 31, 2025 @04:54PM (#65628870)

Someone should build an AI tool to detect these AI web crawlers and then send back corrupted information (not misspelling but actual falsehoods). The only way to stop the unneighborly actions is to eliminate the expectation of a reward.

Parent Share
twitter facebook

Re: (Score:3)

bysound+vision ( 884283 ) writes:

Cloudflare built it, and it's called "AI Labyrinth". I'd like to deploy a similar webpage generator on my Apache server, without Cloudflare. If you know of any such scripts, link me and I'll check them out.

Re: (Score:2)

byCigamit ( 200871 ) writes:

I built something like this a decade ago with PHP and a dictionary file. The problem that you run into, is the more bots you trap in the Labyrinth, the most CPU you end up using, because they will blindly just keep slurping up what you are giving them.

In the end I shut it down, as I would rather just block them to begin with instead of wasting CPU cycles for no real gain on my part.

Re: (Score:1)

byWidjettyOne ( 10203247 ) writes:

Someone should build an AI tool to detect these AI web crawlers and then send back corrupted information (not misspelling but actual falsehoods). The only way to stop the unneighborly actions is to eliminate the expectation of a reward.
There's Nepenthes [zadzmo.org], and it's open source, though it sends back slow, Markov-chain nonsense rather than actual falsehoods.

●

1 reply beneath your current threshold.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Slashdot

Moderate Moderator Help Delete

● Get more comments

● Submit Story

It is much harder to find a job than to keep one.

●FAQ

●Story Archive

●Hall of Fame

●Advertising

●Terms

●Privacy Statement

●About

●Feedback

●Mobile View

●Blog

Icon

Do Not Sell or Share My Personal Information

Copyright © 2026 Slashdot Media. All Rights Reserved.

×

Close

Close

Slashdot

Working...