Skip to main content


We apologize for the long performance degradation today.
Finally, we identified all of the 'tricks' that AI crawlers found today. They no longer bypass the anubis proof of work challenges.

A novelty for us was that AI crawlers seem to not only crawl URLs that are actually presented to them by our frontend, but they converted the URLs into a format that bypassed our filter rules.

By the way, you can track the changes we have been doing via

codeberg.org/Codeberg-Infrastr…

in reply to Codeberg

Ah I did notice a super long push time this morning. That explains it. Sucks that we have to deal with that crap. But thanks for the transparency.
in reply to Codeberg

this arms race is so bizarre and absolutely horrendous for our historical record of the evolution of human information technology. The fact that robots txt is the only defence (that mostly does not work) is maddening.

ML companies are basically injesting work they do not own and without permission. DDOSing the creators, hosts and providers and then selling that information.

Surely there should be a universally accepted way to signal no scraping by now???

in reply to Codeberg

Now OpenAI is giving real users their Atlas browser, so they can scrape while users bypass the security and provide them with logins.

Disguisting.

in reply to Codeberg

I'm going to try a normal, respectful, old fashioned crawl and see if that's possible.
This is the search engine for indyradio, it has documents google won't find and for me usually works better. excuse the plug.
I'll let you know what results I got.
lookdown.org
This entry was edited (1 week ago)
in reply to Codeberg

Already witnessing that the harm and impact of AI for all-day life is way beyond it's benefits!
in reply to Codeberg

I think your langue here is way to soft. It’s not crawlers what found it, but some company actively worked on breaking the mechanisms. It’s nothing else than malicious cracking. Behind every such incident is someone who made the decision to crack you.
in reply to Codeberg

AI companies crawl our websites.

We ask that they stop by using the industry standard robots.txt

AI companies ignore those rules.

We start blocking the companies themselves with conventional tools like IP rules.

AI companies start working around those blocks.

We invent ways to specifically make life harder for their crawlers (stuff like Anubis).

AI companies put considerable resources into circumventing that, too.

This industry seriously needs to implode. Fast.

in reply to Claudius πŸŽƒ

As a next step, AI companies are now offering "their" browser (read: Chromium ever so slightly themed with some company bullshit built in)

In part, this is certainly done to have yet another way to crawl the web, but this time user-directed and indistinguishable from actual human requests.

in reply to Claudius πŸŽƒ

we need crawler honeytrap that sends back random racist 4chan posts as the response.
in reply to Claudius πŸŽƒ

@claudius Frankly, I'm kind of half-OK with that one: There's still the troubling copyright aspect, but at least being the browser and loading nothing but user viewed content at least gets their load off our servers.
in reply to chrysn

@chrysn second step: DDoS. If they are on the computer anyway, why not deputize them for crawling?
in reply to Claudius πŸŽƒ

@claudius If there's actual page-consuming users behind every single request, it'd take a colossal effort to pull of DDoS. Cloudflare (whose business interest admittedly is to over-report DoS attacks) clocks even 2010-level attacks at 600k requests per second, so even with low-attention-span users (maybe 5s/page), that'd take 3 million humans for the duration of the attack. If someone can just so convince 3M people to constantly click through slow-loading pages, we have bigger issues than DoS.
in reply to chrysn

@claudius Of course, if their browsers load content *beyond* what the viewed page is including and the explicit preload links, then those users turned their hosts into part of a botnet willingly, and need to expect blocking like any other botnet.
in reply to Claudius πŸŽƒ

@claudius At some point we're going to start paying a lawyer a few dollars to send the AI companies a registered return-receipt-requested letter saying "You are denied access to my web site. I have taken every step possible to prevent you from accessing it. If you continue to circumvent these measures and access my site anyway, you will be billed $1000/access. This fee will take effect 14 days after you receive this notice."

Then start sending bills.

in reply to Claudius πŸŽƒ

@claudius
More folks need to begin adopting ... unorthodox solutions for those groups which have been so wonderful as to ignore robots.txt. Disguised petabyte ZIP bombs. Poisoned pages. Image folders chock full of Nightshade.

The legal argument to be made and adopted here is that if the companies weren't willfully breaking the law, then they wouldn't have subjected themselves to those attacks. It certainly doesn't even fall under entrapment in most cases.

in reply to Claudius πŸŽƒ

I would like to believe that if the US federal government weren't completely fucked up right now then OpenAI and the other AI parasites with a nexus in the US would have been criminally charged by now with violating the #CFAA by actively circumventing the crawling protections added recently to websites specifically to block them.
Alas, the government is too busy engaging in vindictive prosecution of #Trump's enemies who aren't actively bribing him.
#infosec #AI
Ref: darmstadt.social/@claudius/115…


AI companies crawl our websites.

We ask that they stop by using the industry standard robots.txt

AI companies ignore those rules.

We start blocking the companies themselves with conventional tools like IP rules.

AI companies start working around those blocks.

We invent ways to specifically make life harder for their crawlers (stuff like Anubis).

AI companies put considerable resources into circumventing that, too.

This industry seriously needs to implode. Fast.


in reply to Claudius πŸŽƒ

@claudius
I feel like we are working towards a point where you have to redesign the whole web to account for AI ignoring rules.

New browsers, new protocols, etc.

in reply to YourShadowDani

@YourShadowDani

I look back at the good old days, when one day I client asked me to bulletproof their websites and computers so they could never steal something, and I went under the desk and unplugged their first computer.
They learned.

But now with AI it's a whole other level.

@claudius @Codeberg

in reply to Codeberg

it’s ok I still admire you’re job, thanks πŸ™
in reply to Codeberg

I think you should file criminal complaints. What they're doing is a computer crime (trying to circumvent protections/security of an automated work).
in reply to Codeberg

this explains why I don't see y'all in malware either 🫢
# Codeberg is not a CDN. Less than 0.1% of the requests (500K req over the last 6 days) looks non CDN related.<br>    @cloudstream expression path('/cloudstream/*') && path('*/raw/*')<br>    respond @cloudstream "Codeberg is not a CDN." 403<br>
in reply to Codeberg

how about forcing the user to login to see any repos, otherwise just display the homepage?

unfortunately we have to take a page out of paywall tactics and make it so that only logged in users can view public repos. otherwise its a cat and mouse game

easy to rate limit and ban individuals too when theyre logged in

⇧