mastodon.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
The original server operated by the Mastodon gGmbH non-profit

Administered by:

Server stats:

357K
active users

The worst part? These companies are ignoring well-established robots.txt standards. They have been caught ignoring the robots.txt file multiple times and are now creating issues for the rest of us. They use residency proxy services even if you block all data center ranges. It is like a war for them to collect all code, images, PDFs, books, videos, and stuff on the Internet to train their LLM without paying for anything. It is wild how they are getting away with copyright violations.

@nixCraft lmao at the guy irritated at the anubis mascot

@Jonly @nixCraft

Great some comic relief for a serious topic, an issue report because "My girlfriend is gonna be mighty upset if she thinks I'm into that kinda thing" LMAO

@adamsaidsomething @Jonly ngl i also think twice when sharing anything that has furries in it 😂

@nixCraft my own personal gitea instance is currently IP-locked because LLM scrapers were hammering it so hard, which sucks because it hosts release and build cache files I wanted to serve publicly.

@nixCraft We had to implement firewall level blocks on all these AI bots - at one point it was like a DDOS attack on our IDP service and generating enormous logs on our side

@nixCraft Big Tech AI is a culture of kleptocracy. Their GenAI bullshit generators can't generate bullshit without tons of stolen goods.

@AjayCyanide @nixCraft Is this true of all AI use? I just started toying with the idea of using a local instance based on deepseek or something else I can run on PC. It's decent at annoying copy/paste, translate from C++ to Python nonsense. Is it all going to be supporting those guys and this nonsense?

@crazyeddie @nixCraft I can't speak on DeepSeek, but all models depend on the parameters on which they were trained. If the out-of-the-box set of parameters used were stolen, then well...

Google, Meta, and OpenAI (off the top of my head) have used copyrighted works for training data, but that's for the hosted versions of their models, not necessarily for what is available for the average user to download and deploy locally.

@schrotthaufen .... did you bother clicking on the link. it literally mentions anubis.....

@nixCraft

I might be wise to keep records of their traversal of your site in hopes that in the future (if we're lucky) a court case will require them to license all of the content.

@nixCraft I remember someone being mad that their crawler got trapped by something malicious but only because the crawler ignored the robots.txt
Hilarious

@Jonly @nixCraft What's a good way to construct such a bomb? Would be even better if you could not just trap it, but actually poison it and then let it go.

@nixCraft imagine a world where we stayed offline unless we really needed a network connection. That’s where I’m at these days. Starve the beast.

@csgraves @nixCraft there were a couple of open source projects for this purpose floated around in the last few weeks. Unfortunately, I don’t recall their names. Let me see if I can find something, if perhaps I favorited or bookmarked something.

@nixCraft They're afraid they'll loose the fair use argument. They should, but I doubt they will. At any rate, once it's all in they can claim that it would be too difficult to comply, blah blah, and it's a second chance for a judge to let it slip through.

We are all going to have to use PKI to communicate and use white lists. They're not going to let us have this.

@nixCraft Yup. I spent the better part of a month building a web app firewall to keep these buggers away from my internet backwater blogs and stuff.
(For example, if your ASN is even remotely associated with Bytedance, go fish).

Then Facebook hammered my robots.txt for a while, unhappy with being given a 500 in response, somehow it thought 2000+ subsequent requests would give a different answer?

If they escalate, I swear I'm going to give them the pages they want... after using a Markov chain to mix it up with excerpts from something like The Hitchhiker's Guide to the Galaxy, just to pollute their index!

@alan
Use the works of authors, that are dead at least 100 years. Otherwise you might infringe copyright yourself.

Milton's "Paradise Lost" should be a good start. Mix it with Dante's "Inferno" at random should give a nice poisening.¹ Add other works at your descretion. :-)

@nixCraft

¹ Of cours all works in the original language.

@PiiiepsBrummm
I looked at that but the use if archaic words and sentence structures might be too easy to detect. Vaguely related Wikipedia articles seem like the best option.
@nixCraft

@alan
Yes, that might be the case.

On the other hand it would be a piece of art itself. Each verse from a different author. Bonus points if it conforms to a classic form of poetry.

Unfortunately I don't have the time to program it myself.

@PiiiepsBrummm I've written the code to generate the text, but it's sufficiently compute-intensive that I'd kill my little VPS. I figure the best approach would be to pre-compute the "alternative" articles, cache them, and then send those when the AI scapers come calling.

It's on my "when I'm bored and/or devious and/or really pissed off" list, but at the moment other things have my attention.

@nixCraft I always wondered why people really thought that companies would be honest and follow robots.txt without cheating. To me it's a good reminder that you can never ever trust companies to do the right thing if the wrong thing is possible.

@nixCraft because trash can violate copyright but if we do the same to them they sue the fuck out of you

@nixCraft

And it's not just about copyright violations, but also about real money lost. It's like if someone stole your cash - in some cases, the AI "attacks" might raise hosting costs for the website owner.
Think about the website owner who paid ~500/month and suddenly get a 5k bill bc of an AI company. And if their services are off, they will face loss of revenue.
Record everything that is happening in your systems. If you can't file a complaint for some reason, share the info with others.

@nixCraft I'm pretty sure this is also a violation of federal law. Deliberate bypassing of access mechanisms is a felony.

Anyone who runs a site being hit by this likely has standing to seek charges and, in some jurisdictions in the US, you could even prosecute a criminal case even if the State refuses.

(disclaimer: NAL)

@nixCraft It's amaizing, good people invent something, and bad people exploit this something until the last drop of blood, only for shitty amount of money.

@eickot that is how they are rich and powerful. you have to be that shameless person to exploits others hard work.

@nixCraft yeah, as allways, you need to slave people to get rich. I hate it! If everybody was really working together, we will be traveling among the stars already!

@nixCraft @eickot
I read this in a book recently, what you’re saying reminds me of it.

“But the fact of the matter is that Mammon will demand our blood, the blood of our brothers and sisters, and the blood of our neighbors. Our accumulation necessitates exploitation. And the exploitation that upholds our current economy of racialized capitalism requires murder.”
The Anti-Greed Gospel by Malcolm Foley

@nixCraft they're not just filling most of the web with crap, they're also filling most of the web _traffic_ with crap. Fuckin' seriously.

@nixCraft I think that directing them to and serving up poisoned data is going to be critical as well.

Especially if you can serve up subtly poisoned data. Where there’s a fatal flaw that passes a quick first-glance. “This code works, but after it runs so many times, you start to leak memory.”

@DeltaWye I read that as "start to leak money" and I thought, that's what we need
@nixCraft

@nixCraft We live in interesting times. Companies that steal content to train their AI, that steal user data under the false pretext of protecting it, sell this data and bear no responsibility, fight ad blockers and do what they want. But as soon as you download something via torrent, you are immediately a thief, a criminal, you MUST pay fines or even lose your freedom... We live in interesting times, in interesting...

@nixCraft it is wild how once upon a time the response to this kind of problem was a trust-based system to avert disaster.

New AI generation of script kiddies will take it all down and accuse us of dragging our feet.

My company relies on respectful crawling (of court websites) and we work very hard not to be bad citizens.

The clay feet of giants, and when they crumble...

@nixCraft
Who would win:
The entire of LLM scrapers
or
One (1) Anime Jackal girl?

@nixCraft Not all that long ago, everything I saw about "how to build a web spider" was VERY clear about how important it was to rate limit yourself and set a user-agent to explain what you're doing and make it easy for an admin to block you

@nixCraft we should fight back. To all this crawlers we should have the http responses with all the shit possible, but not code, not the real thing.

@nixCraft my reason i deleted my page permanently (i'm on other side of problem)

@nixCraft

allow me to turn this image around for elucidation; from the linked article (thx)

@nixCraft more:
"…a post by Dennis Schubert about the Diaspora (an Open Source decentralized social network) infrastructure, where he says that "looking at the traffic logs made him impressively angry".

"In the blogpost, he claims that one fourth of his entire web traffic is due to bots with an OpenAI user agent, 15% is due to Amazon, 4.3% is due to Anthropic, and so on. Overall, we're talking about 70% of the entire requests being from AI companies…."

@nixCraft Oh my side I just ban everything that looks slightly suspicious. So much for the public, if people really want access to something I host and get ban they can email me. If they can't send an email it's their loss.

@nixCraft Also honeyspot to kill everything is amazing. Build a bunch of links hidden to the public on each page, bot crawls it. ban everything that get the urls.

@nixcraft Anubis is the website equivalent of hashcash for email.

It is sad that we have to waste computational cycles (meaning more electricity) to fight these things, but when they don't respect robots.txt there is little other choice.

@kln @nixCraft Honestly, I'd prefer Anubis everywhere over ReCaptcha. Even at it supposedly slowest (never experienced that; I use Gnome's gitlab, where I encountered it) it's still faster than solving 3 ReCaptchas in a row with those slowly fading images.

@jernej__s @nixcraft
True, and there is less chance of you being used to create training data.

Though knowing the Internet, someone is going to make a version that mines for <cryptocurrency>. Because if it isn't one it is the other these days. :p

@nixCraft i said that before, and ill say it again: allowing #meta to enter #fediverse is a fucking bad idea because they will train their algorithms on our freely available data.

@brawnybunkbedbuddy @nixCraft

I agree that federating with Threads is a bad idea for many reasons, but Meta and other big tech companies are gonna scrape fedi data and train they're A.I. with it anyway, they don't need federation or permission for that unfortunately