this post was submitted on 21 May 2025
507 points (99.2% liked)

Technology

70199 readers
4604 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active.

Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called "Searchcord" based on a different data set that shows non-anonymized chat histories.

top 50 comments
sorted by: hot top controversial new old
[–] [email protected] 7 points 1 hour ago

"scraped" via API? I don't think It means what you think it means.

[–] [email protected] 39 points 16 hours ago (1 children)

That’s good news. Internet archiving is an important endeavor because you never know when they‘ll pull the plug. Now it‘s a little more secured and probably far more useful than in Discord‘s hands alone.

[–] [email protected] -1 points 46 minutes ago (1 children)

Not for messages that are supposed to be private lol. Let me just make a copy of all texts you've sent over the last decade, for "archiving".

[–] [email protected] 5 points 42 minutes ago

This says it was done via the API so they wouldn't be private messages.

[–] [email protected] 107 points 19 hours ago

So basically discord finally got a usable search. I count that as a win.

[–] [email protected] 61 points 19 hours ago (3 children)

I see a lot of drama here in the thread, people decrying data leaks, how Discord is very very bad, and a number of people wanting the "good old days" of forums.

Yes. I like forums too, but, uh...

These researchers scraped publicly posted messages. Keyword here being "public". How would anything similarly public, like a forum, be better?

I actually remember the times when forums were at their peak. I hung out on BZPower for Bionicle things, and the Relic News Forum for Homeworld modding. You know what they had? Google bots that scraped messages, looked for certain words, and populated websites with advertisements based on what it could scrape from forums.

Pretty sure Lemmy doesn't do encryption either, unless there's some very special, private Lemmy server that nobody has access to. So the researchers could've just as well scraped the fediverse.

[–] [email protected] 3 points 35 minutes ago

People in general have no idea and just wasn't to get spun up in drama and manufactured outrage.

Same thing happened when people started scrapping Twitter 10-15 years ago.

[–] [email protected] 7 points 15 hours ago

How would anything similarly public, like a forum, be better?

Forums were the primary way that groups would talk with one another pre-global scale social media.

They could contain public subforums, but the majority of all of the forums that I've been a part of were not viewable without an account, which was manually approved or required a small payment (to make bans have a chance to actually stick).

[–] [email protected] 17 points 19 hours ago (1 children)

Yeah this being just as easy on bb forums or literally any webpage with a public comment section was my first thought as well..

Isn't most of the internet scraped anyways, by the internet archive? The concerning part is that this is 100% going to be used to train some coomer brained AI. Scraping, botting, scamming: all those things are going to happen on large public communities.

[–] [email protected] 5 points 16 hours ago* (last edited 16 hours ago)

Yeah, a lot of this push is about ushering in new laws to prevent data scraping.

Propaganda spreads easily through fake accounts—but how do we detect large-scale operations if they’re constantly creating and deleting accounts or trying to blend in with the rest of us? We’d need access to massive data sets to mine for patterns and expose coordinated behavior.

But the powers that benefit from shaping the narrative are the same ones pushing the idea that all scraping is bad. They want people to hate it, so they can justify laws that lock down access. That’s the end game.

[–] [email protected] 241 points 1 day ago (3 children)

Probably our only chance to find solutions to problems with open source software that uses Discord as their forum

[–] [email protected] 129 points 1 day ago (3 children)

Seriously. It's beyond painful when some open source project only uses Discord for communication. You have to hope that you post your question at a time when the right people are online, and that there's not a more interesting conversation going on, otherwise it just gets lost. Index that whole dataset.

[–] [email protected] 1 points 27 minutes ago* (last edited 27 minutes ago)

Index that whole dataset

I've seen a few projects doing just that with answeroverflow.com and they have come up in my web searches. Not really a solution but at least a stopgap.

[–] [email protected] 15 points 1 day ago (4 children)

Given some similar issues, why is it some projects still use IRC then?

[–] [email protected] 49 points 23 hours ago

there's a difference between using irc for livetime troubleshooting and not having a forum at all and directing everyone to your livechat discord. i'm sure some sicko out there has run an OSS project on only IRC, but their project likely got no traction because a history of problemsolving posts is important in open source. generally speaking, you need:

  • a wiki
  • a static indexable searchable forum
  • a live chat place for real time communication for novel problems

too many projects these days only have that last one in the form of discord

[–] [email protected] 8 points 19 hours ago

For projects I am involved with all irc chats are archived and searchable. There is nothing private, no registration needed and searchable.

Quite a bit different.

[–] [email protected] 11 points 23 hours ago

That would be equally annoying. Probably a better signal to noise ratio on IRC though; Discord descends into memes almost instantly.

[–] phoenixz 12 points 1 day ago

Because IRC is awesome, always has been

[–] [email protected] 6 points 21 hours ago

I've always wanted to contribute to The Cutting Room Floor wiki but they hide registration behind a Discord server bot that will give the registration code.

[–] [email protected] 16 points 22 hours ago (5 children)

I spent nearly three hours today between discord and matrix trying to figure out how to get these two pieces of software to talk using a certain protocol.

Imagine if there were online indexable platforms where people could publish this information so it’s easily accessible rather than having to scour through message logs hoping to find the right keywords. Such a technology surely doesn’t exist already, right?

I hate discord.

[–] [email protected] 35 points 22 hours ago (1 children)

I don't hate Discord, I simply hate that so many projects and companies have unanimously decided to use it as the wrong tool for the wrong job.

It's fine for its intended use case, which is bickering with my friends about video games and fiction, and spamming each other with .gifs and meme images.

[–] [email protected] 18 points 22 hours ago (3 children)

Discord is genuinely a great tool for what I used to use Skype for. Talking to my friends, and sharing dumb memes with them in a groupchat format. Companies need to learn that using it as a forum, a Q&A service, a wiki or any other information sharing purpose, is simply fucking removed.

load more comments (3 replies)
load more comments (4 replies)
[–] [email protected] 12 points 1 day ago

Lol, I've read this headline and thought "thank fuck, probably the only option to have Discord's content readable", I like how universal this opinion is

[–] [email protected] 10 points 15 hours ago

wtf…… going to get worse after IPO!

[–] [email protected] 86 points 1 day ago (3 children)

Well yeah, it's not encrypted. It would be the same as 10 years of Reddit posts or Lemmy posts scraped

[–] [email protected] 79 points 1 day ago (7 children)

This isn't even them scraping private chats and small servers, they just scraped public servers in the discovery tab. None of that information was ever private, and every user can browse the chat history there.

[–] [email protected] 35 points 1 day ago

Yeah, exactly. It may sound scary or like a violation of privacy, but there is no privacy when posting to public online areas.

[–] [email protected] 24 points 1 day ago

"Researchers scrape thousands of hours of news footage from their TVs!" is about as big a deal, honestly.

load more comments (5 replies)
[–] [email protected] 4 points 15 hours ago

There's literally no difference. Each Discord server is like a tiny chunk of Reddit. If anyone expected any privacy on these servers, they're nuts.

load more comments (1 replies)
[–] toastmeister 7 points 17 hours ago

Great news for open source AI.

[–] [email protected] 19 points 21 hours ago (1 children)

So this is:

'Uh guys, Discord chats leaked..."

For... what, just literally everyone who used Discord between 2015 and 2017, everyone who was an early adopter?

Dear fucking god.

I used to say 'someday, people will learn', but fucking no obviously not, no they won't, almost everyone is an idiot and/or truly doesn't care.

... I guess this'll be fodder for a whole bunch of dramatubers / pedohunters for the next year or so...

[–] [email protected] 30 points 20 hours ago (4 children)

The disappearance of forum public discussion to unsearchable, unpreserved, discord semi-private discussion chambers is probably the largest informational catastrophe of the internet so far.

[–] [email protected] 8 points 20 hours ago (2 children)

I potentially agree, but as a possible competitor, I submit:

Everything DOGE has done in the last 3 months.

load more comments (2 replies)
load more comments (3 replies)
[–] [email protected] 12 points 20 hours ago

Saving this article for the next time someone says "Just message me on discord its easier".

[–] [email protected] 10 points 19 hours ago

Ooh! Do Teams next

[–] [email protected] 32 points 1 day ago

If they aren't comfortable with their Discord messages being public, perhaps they shouldn't have posted those messages in a public forum that the public can access.

[–] [email protected] 9 points 19 hours ago
[–] [email protected] 11 points 20 hours ago* (last edited 17 hours ago)

So how does this work? Like how did they get those messages through API calls? Also, is this not something that Discord would dislike since it dilutes the value of their data horde?

[–] [email protected] 14 points 1 day ago

Meanwhile AI scrapers: This will be a fine addition to my collection.

[–] [email protected] 15 points 1 day ago

🚩

marked safe

from Brazilian mass discord message leak

(never used discord)

[–] [email protected] 10 points 1 day ago

They just wanted to find new slurs.

load more comments
view more: next ›