Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

Tea · 4 months ago

Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

_cryptagion [he/him] · 4 months ago

Now this is a AI trap worth using. Don’t waste your money and resources hosting something yourself, let Cloudflare do it for you if you don’t want AI scraping your shit.

@[email protected] · 4 months ago

I swear someone released this exact thing a few weeks ago

@[email protected] · 4 months ago

We want names

@[email protected] · 4 months ago

https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/

@[email protected] · edit-2 13 days ago

deleted by creator

@[email protected] · 4 months ago

This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?

@[email protected] · 4 months ago

Once a technology or even an idea is there, you can’t really make it go away - ai is here to stay. The generative LLM are just a small part.

petaqui · 4 months ago

As for everything, it has good things, and bad things. We need to be careful and use it in a proper way, and the same thing applies to the ones creating this technology

@[email protected] · 4 months ago

The problem is, how? I can set it up on my own computer using open source models and some of my own code. It’s really rough to regulate that.

@[email protected] · 4 months ago

while allowing legitimate users and verified crawlers to browse normally.

What is a “verified crawler” though? What I worry about is, is it only big companies like Google that are allowed to have them now?

@[email protected] · 4 months ago

Cloudflare isn’t the best at blocking things. As long as your crawler isn’t horribly misconfigured you shouldn’t have much issues.

@[email protected] · 4 months ago

I assume a crawler which adheres to robots.txt

@[email protected] · 4 months ago

I would love to think so. But the word “verified” suggests more.

@[email protected] · 4 months ago

IP verification is a not uncommon method for commercial crawlers

AtomicHotSauce · 4 months ago

That’s just BattleBots with a different name.

aviationeast · 4 months ago

You’re not wrong.

@[email protected] · 4 months ago

Ok, I now need a screensaver that I can tie to a cloudflare instance that visualizes the generated “maze” and a bot’s attempts to get out.

@[email protected] · 4 months ago

You probably just should let an AI generate that.

@[email protected] · edit-2 4 months ago

No, it is far less environmentally friendly than rc bots made of metal, plastic, and electronics full of nasty little things like batteries blasting, sawing, burning and smashing one another to pieces.

IninewCrow · 4 months ago

They should program the actions and reactions of each system to actual battle bots and then televise the event for our entertainment.

Singletona082 · 4 months ago

Then get bored when it devolves into a wedge meta.

@[email protected] · 4 months ago

Somehow one of them still invents Tombstone.

Singletona082 · 4 months ago

Putting a chopped down lawnmower blade in front of a thing, and having it spin at harddrive speeds is honestly kinda terrifying…

@[email protected] · 4 months ago

Will this further fuck up the inaccurate nature of AI results? While I’m rooting against shitty AI usage, the general population is still trusting it and making results worse will, most likely, make people believe even more wrong stuff.

@[email protected] · edit-2 4 months ago

The article says it’s not poisoning the AI data, only providing valid facts. The scraper still gets content, just not the content it was aiming for.

E:

It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

ObsidianZed · 4 months ago

Until the AI generating the content starts hallucinating.

@[email protected] · 4 months ago

hey look it’s that “zip bomb” I mentioned.

fuck cloudflare though.

@[email protected] · 4 months ago

removed by mod

fmstrat · 4 months ago

And this, ladies and gentleman, is how you actually make profits on AI.

oppy1984 · 4 months ago

Spiderman pointing at Spiderman meme.

baltakatei · 4 months ago

Relevant excerpt from part 11 of Anathem (2008) by Neal Stephenson:

Artificial Inanity

Note: Reticulum=Internet, syndev=computer, crap~=spam

“Early in the Reticulum—thousands of years ago—it became almost useless because it was cluttered with faulty, obsolete, or downright misleading information,” Sammann said.

“Crap, you once called it,” I reminded him.

“Yes—a technical term. So crap filtering became important. Businesses were built around it. Some of those businesses came up with a clever plan to make more money: they poisoned the well. They began to put crap on the Reticulum deliberately, forcing people to use their products to filter that crap back out. They created syndevs whose sole purpose was to spew crap into the Reticulum. But it had to be good crap.”

“What is good crap?” Arsibalt asked in a politely incredulous tone.

“Well, bad crap would be an unformatted document consisting of random letters. Good crap would be a beautifully typeset, well-written document that contained a hundred correct, verifiable sentences and one that was subtly false. It’s a lot harder to generate good crap. At first they had to hire humans to churn it out. They mostly did it by taking legitimate documents and inserting errors—swapping one name for another, say. But it didn’t really take off until the military got interested.”

“As a tactic for planting misinformation in the enemy’s reticules, you mean,” Osa said. “This I know about. You are referring to the Artificial Inanity programs of the mid–First Millennium A.R.”

“Exactly!” Sammann said. “Artificial Inanity systems of enormous sophistication and power were built for exactly the purpose Fraa Osa has mentioned. In no time at all, the praxis leaked to the commercial sector and spread to the Rampant Orphan Botnet Ecologies. Never mind. The point is that there was a sort of Dark Age on the Reticulum that lasted until my Ita forerunners were able to bring matters in hand.”

“So, are Artificial Inanity systems still active in the Rampant Orphan Botnet Ecologies?” asked Arsibalt, utterly fascinated.

“The ROBE evolved into something totally different early in the Second Millennium,” Sammann said dismissively.

“What did it evolve into?” Jesry asked.

“No one is sure,” Sammann said. “We only get hints when it finds ways to physically instantiate itself, which, fortunately, does not happen that often. But we digress. The functionality of Artificial Inanity still exists. You might say that those Ita who brought the Ret out of the Dark Age could only defeat it by co-opting it. So, to make a long story short, for every legitimate document floating around on the Reticulum, there are hundreds or thousands of bogus versions—bogons, as we call them.”

“The only way to preserve the integrity of the defenses is to subject them to unceasing assault,” Osa said, and any idiot could guess he was quoting some old Vale aphorism.

“Yes,” Sammann said, “and it works so well that, most of the time, the users of the Reticulum don’t know it’s there. Just as you are not aware of the millions of germs trying and failing to attack your body every moment of every day. However, the recent events, and the stresses posed by the Antiswarm, appear to have introduced the low-level bug that I spoke of.”

“So the practical consequence for us,” Lio said, “is that—?”

“Our cells on the ground may be having difficulty distinguishing between legitimate messages and bogons. And some of the messages that flash up on our screens may be bogons as well.”

@[email protected] · 4 months ago

Read Anathema last year, really enjoyed it!

@[email protected] · 4 months ago

Imagine how much power is wasted on this unfortunate necessity.

Now imagine how much power will be wasted circumventing it.

Fucking clown world we live in

@[email protected] · 4 months ago

From the article it seems like they don’t generate a new labyrinth for every single time: Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval."

@[email protected] · 4 months ago

On on hand, yes. On the other…imagine frustration of management of companies making and selling AI services. This is such a sweet thing to imagine.

@[email protected] · 4 months ago

I just want to keep using uncensored AI that answers my questions. Why is this a good thing?

@[email protected] · 4 months ago

Because it only harms bots that ignore the “no crawl” directive, so your AI remains uncensored.

@[email protected] · edit-2 4 months ago

Good I ignore that too. I want a world where information is shared. I can get behind the

@[email protected] · 4 months ago

don’t worry, information is still shared. but with people. not with capitalist pigs

@[email protected] · 4 months ago

Capitalist pigs are paying media to generate AI hatred to help them convince you people to get behind laws that all limit info sharing under the guise of IP and copyright

Echo Dot · 4 months ago

That’s not what the no follow command means

@[email protected] · 4 months ago

Get behind the what?

Perhaps an AI crawler crashed Melvin’s machine halfway through the reply, denying that information to everyone else!

@[email protected] · 4 months ago

Because it’s not AI, it’s LLMs, and all LLMs do is guess what word most likely comes next in a sentence. That’s why they are terrible at answering questions and do things like suggest adding glue to the cheese on your pizza because somewhere in the training data some idiot said that.

The training data for LLMs come from the internet, and the internet is full of idiots.

@[email protected] · 4 months ago

LLM is a subset of AI

@[email protected] · 4 months ago

That’s what I do too with less accuracy and knowledge. I don’t get why I have to hate this. Feels like a bunch of cavemen telling me to hate fire because it might burn the food

@[email protected] · 4 months ago

Because we have better methods that are easier, cheaper, and less damaging to the environment. They are solving nothing and wasting a fuckton of resources to do so.

It’s like telling cavemen they don’t need fire because you can mount an expedition to the nearest valcanoe to cook food without the need for fuel then bring it back to them.

The best case scenario is the LLM tells you information that is already available on the internet, but 50% of the time it just makes shit up.

@[email protected] · 4 months ago

Wasteful?

Energy production is an issue. Using that energy isn’t. LLMs are a better use of energy than most of the useless shit we produce everyday.

@[email protected] · 4 months ago

Did the LLMs tell you that? It’s not hard to look up on your own:

Data centers, in particular, are responsible for an estimated 2% of electricity use in the U.S., consuming up to 50 times more energy than an average commercial building, and that number is only trending up as increasingly popular large language models (LLMs) become connected to data centers and eat up huge amounts of data. Based on current datacenter investment trends,LLMs could emit the equivalent of five billion U.S. cross-country flights in one year.

https://cse.engin.umich.edu/stories/power-hungry-ai-researchers-evaluate-energy-consumption-across-models

Far more than straightforward search engines that have the exact same information and don’t make shit up half the time.

@[email protected] · 4 months ago

deleted by creator

@[email protected] · 4 months ago

I…uh…frick.

@[email protected] · 4 months ago

@[email protected] · 4 months ago

[email protected]

@[email protected] · 4 months ago

this is some fucking stupid situation, we somewhat got a faster internet and these bots messing each other are hugging the bandwidth.

IninewCrow · 4 months ago

It’s what I’ve been saying about technology for the past decade or two … we’ve hit an upper limit to our technological development … that limit is on individual human greed where small groups of people or massively wealthy people hinder or delay any further development because they’re always trying to find ways to make money off it, prevent others from making money off it, monopolize an area or section of society … capitalism is literally our world’s bottleneck and it’s being choked off by an oddly shaped gold bar at this point.

dual_sport_dork 🐧🗡️ · edit-2 4 months ago

Especially since the solution I cooked up for my site works just fine and took a lot less work. This is simply to identify the incoming requests from these damn bots – which is not difficult, since they ignore all directives and sanity and try to slam your site with like 200+ requests per second, that makes 'em easy to spot – and simply IP ban them. This is considerably simpler, and doesn’t require an entire nuclear plant powered AI to combat the opposition’s nuclear plant powered AI.

In fact, anybody who doesn’t exhibit a sane crawl rate gets blocked from my site automatically. For a while, most of them were coming from Russian IP address zones for some reason. These days Amazon is the worst offender, I guess their Rufus AI or whatever the fuck it is tries to pester other retail sites to “learn” about products rather than sticking to its own domain.

Fuck 'em. Route those motherfuckers right to /dev/null.

Buelldozer · 4 months ago

and try to slam your site with like 200+ requests per second

Your solution would do nothing to stop the crawlers that are operating 10ish rps. There’s ones out there operating at a mere 2rps but when multiple companies are doing it at the same time 24x7x365 it adds up.

Some incredibly talented people have been battling this since last year and your solution has been tried multiple times. It’s not effective in all instances and can require a LOT of manual intervention and SysAdmin time.

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

dual_sport_dork 🐧🗡️ · 4 months ago

It’s worked alright for me. Your mileage may vary.

If someone is scraping my site at a low crawl rate I honestly don’t care so long as it doesn’t impact my performance for everyone else. If I hosted anything that was not just public knowledge or copy regurgitated verbatim from the bumf provided by the vendors of the brands I sell, I might oppose to it ideologically. But I don’t. So I don’t.

If parallel crawling from multiple organizations legitimately becomes a concern for us I will have to get more creative. But thus far it hasn’t, and honestly just wholesale blocking Amazon from our shit instantly solved 90% of the problem.

@[email protected] · 4 months ago

Yep. After you ban all the easy to spot ones you’re still left with far too many hard to ID bots. At least if your site is popular and large.

@[email protected] · 4 months ago

Cloudflare offers that too, but you can’t always tell

@[email protected] · 4 months ago

Geez, that’s a lot of requests!

dual_sport_dork 🐧🗡️ · 4 months ago

It sure is. Needless to say, I noticed it happening.

desktop_user [they/them] · 4 months ago

the only problem with that solution being applied to generic websites is schools and institutions can have many legitimate users from one IP address and many sites don’t want a chance to accidentally block one.

dual_sport_dork 🐧🗡️ · 4 months ago

This is fair in those applications. I only run an ecommerce web site, though, so that doesn’t come into play.

@[email protected] · 4 months ago

Jokes on them. I’m going to use AI to estimate the value of content, and now I’ll get the kind of content I want, though fake, that they will have to generate.

Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

Trapping misbehaving bots in an AI Labyrinth