this post was submitted on 07 Aug 2023
57 points (96.7% liked)

Lemmy.ca Support / Questions

507 readers
15 users here now

Support / Questions specific to lemmy.ca.

For support / questions related to the lemmy software itself, go to [email protected]

founded 4 years ago
MODERATORS
 

Right now, robots.txt on lemmy.ca is configured this way

User-Agent: *
  Disallow: /login
  Disallow: /login_reset
  Disallow: /settings
  Disallow: /create_community
  Disallow: /create_post
  Disallow: /create_private_message
  Disallow: /inbox
  Disallow: /setup
  Disallow: /admin
  Disallow: /password_change
  Disallow: /search/
  Disallow: /modlog

Would it be a good idea privacy-wise to deny GPTBot from scrapping content from the server?

User-agent: GPTBot
Disallow: /

Thanks!

all 22 comments
sorted by: hot top controversial new old
[–] ono 21 points 2 years ago

Yes, please.

We can't stop LLM developers from scraping our conversations if they're determined to do so, but we can at least make our wishes clear. If they respect our wishes, then great. If they don't, then they'll be unable to plead ignorance, and our signpost in the road (along with those from other instances) might influence legislation as it's drafted in the coming years.

[–] Shadow 19 points 2 years ago (1 children)

I'm on board for this, but I feel obliged to point out that it's basically symbolic and won't mean anything. Since all the data is federated out, they have a plethora of places to harvest it from - or more likely just run their own activitypub harvester.

I've thrown a block into nginx so I don't need to muck with robots.txt inside the lemmy-ui container.

# curl -H 'User-agent: GPTBot' https://lemmy.ca/ -i
HTTP/2 403
[–] skankhunt42 3 points 2 years ago

I imagine they rate limit their requests too so I doubt you'll notice any difference in resource usage. OVH is Unmetered* so bandwidth isn't really a concern either.

I don't think it will hurt anything but adding it is kind of pointless for the reasons you said.

[–] nbailey 18 points 2 years ago (2 children)

Yes. Ban them.

if ($http_user_agent = "GPTBot") {
  return 403;
}
[–] [email protected] 6 points 2 years ago (3 children)

Probably want == instead else we will all be forbidden

[–] Shadow 3 points 2 years ago* (last edited 2 years ago)

I would have thought so too, but == failed the syntax check

2023/08/07 15:36:59 [emerg] 2315181#2315181: unexpected "==" in condition in /etc/nginx/sites-enabled/lemmy.ca.conf:50

You actually want ~ though because GPTBot is just in the user agent, it's not the full string.

[–] nbailey 2 points 2 years ago

Strangely, = works the same as == with nginx. It's a very strange config format...

https://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if

[–] [email protected] 1 points 2 years ago

Look at me! I'm the GPTBot now!

[–] Shadow 4 points 2 years ago

Thanks for empowering my lazyness =)

[–] [email protected] 11 points 2 years ago

1000% yes. Please block them.

[–] sndmn 8 points 2 years ago (1 children)

Is this even possible without all federated instances also prohibiting them?

[–] mp3 14 points 2 years ago

You take action where you can ;)

[–] Crocrodile 6 points 2 years ago
[–] narF 5 points 2 years ago (2 children)

Are they even respecting those files?

But yeah, sure, it's worth trying!

[–] mp3 3 points 2 years ago* (last edited 2 years ago)
[–] EhForumUser 1 points 2 years ago

Worth trying for what reason?

[–] Sunshine 2 points 3 months ago

Yes, please prevent them from using our conversations.

[–] [email protected] 2 points 2 years ago (1 children)

Just out of curiosity, why is everyone so up in arms about this? I mean sure it's just another corp but any other reasons?

[–] corsicanguppy 6 points 2 years ago (1 children)

Server load spent on a bot scraping our contributions to be used to make money.

There's so much there that it's gonna offend someone.

[–] [email protected] 1 points 2 years ago

Wouldn't it just be scraped once (per company)? That doesn't sound like such a problem.

[–] EhForumUser -2 points 2 years ago* (last edited 2 years ago)

No, definitely not. Our work posted in the open is done so because we want it to be open!

It is understandable that not all work wants to be open, but access would already be appropriately locked down for all robots (and humans!) who are not a member of the secret club in those cases. There is no need for special treatment here.