196

17369 readers

772 users here now

Be sure to follow the rule before you head out.

Rule: You must post before you leave.

Other rules

Behavior rules:

No bigotry (transphobia, racism, etc…)
No genocide denial
No support for authoritarian behaviour (incl. Tankies)
No namecalling
Accounts from lemmygrad.ml, threads.net, or hexbear.net are held to higher standards
Other things seen as cleary bad

Posting rules:

No AI generated content (DALL-E etc…)
No advertisements
No gore / violence
Mutual aid posts require verification from the mods first

NSFW: NSFW content is permitted but it must be tagged and have content warnings. Anything that doesn't adhere to this will be removed. Content warnings should be added like: [penis], [explicit description of sex]. Non-sexualized breasts of any gender are not considered inappropriate and therefore do not need to be blurred/tagged.

If you have any questions, feel free to contact us on our matrix channel or email.

Other 196's:

founded 2 years ago

MODERATORS

[email protected]

remotelove

[email protected]

322

rulebots.txt (lemmy.world)

submitted 7 months ago by [email protected] to c/[email protected]

33 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 8 points 7 months ago (4 children)

As annoying as this is, it's to prevent LLMs from training themselves using Reddit content, and that's probably the greater of the two evils.

[–] [email protected] 37 points 7 months ago (1 children)

That's all well and good, but how many LLMs do you think actually respect robots.txt?

[–] [email protected] 14 points 7 months ago

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i'd never notice unless i checked the logs, at least.

[–] [email protected] 32 points 7 months ago

I thought major LLMs ignored robots.txt

[–] [email protected] 25 points 7 months ago

It's to profit from training LLMs: https://arstechnica.com/information-technology/2024/02/your-reddit-posts-may-train-ai-models-following-new-60-million-agreement/

[–] [email protected] 12 points 7 months ago

It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

FTFY