this post was submitted on 21 Aug 2024
322 points (100.0% liked)

196

17369 readers
772 users here now

Be sure to follow the rule before you head out.


Rule: You must post before you leave.



Other rules

Behavior rules:

Posting rules:

NSFW: NSFW content is permitted but it must be tagged and have content warnings. Anything that doesn't adhere to this will be removed. Content warnings should be added like: [penis], [explicit description of sex]. Non-sexualized breasts of any gender are not considered inappropriate and therefore do not need to be blurred/tagged.

If you have any questions, feel free to contact us on our matrix channel or email.

Other 196's:

founded 2 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 8 points 7 months ago (4 children)

As annoying as this is, it's to prevent LLMs from training themselves using Reddit content, and that's probably the greater of the two evils.

[–] [email protected] 37 points 7 months ago (1 children)

That's all well and good, but how many LLMs do you think actually respect robots.txt?

[–] [email protected] 14 points 7 months ago

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i'd never notice unless i checked the logs, at least.

[–] [email protected] 32 points 7 months ago

I thought major LLMs ignored robots.txt

[–] [email protected] 12 points 7 months ago

It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

FTFY