this post was submitted on 29 Jan 2025
53 points (98.2% liked)

Technology

37922 readers
653 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] masterspace 5 points 1 day ago* (last edited 1 day ago) (14 children)

Lol, Ed Zirtron is very paralleled.

He's pessimistic and cynical to the point of being conspiratorial and delusional.

He's someone to listen to when you want to hear someone go on an unhinged rant about the tech industry, not someone you listen to when you want to actually understand how it works.

I mean look at this trash article, he spends 5000 words saying effectively nothing. Things he could have explained by just linking to pre-existing, better written articles, instead, he rehashes everything in a snarky tone while skipping over some of the most important points (like training through distillation).

[–] [email protected] 20 points 1 day ago (7 children)

Wanting a better world, and holding up a light to the current one to show the differences between what could be and what is, is not at all what "cynical" means. "Cynical" is the opposite of what you mean. "Pessimistic" or "negative" is definitely more apt, yes.

Also:

Now, you've likely seen or heard that DeepSeek "trained its latest model for $5.6 million," and I want to be clear that any and all mentions of this number are estimates. In fact, the provenance of the "$5.58 million" number appears to be a citation of a post made by NVIDIA engineer Jim Fan in an article from the South China Morning Post, which links to another article from the South China Morning Post, which simply states that "DeepSeek V3 comes with 671 billion parameters and was trained in around two months at a cost of US$5.58 million" with no additional citations of any kind. As such, take them with a pinch of salt.

While there are some that have estimated the cost (DeepSeek's V3 model was allegedly trained using 2048 NVIDIA h800 GPUs, according to its paper), as Ben Thompson of Stratechery made clear, the "$5.5 million" number only covers the literal training costs of the official training run (and this is made fairly clear in the paper!) of V3, meaning that any costs related to prior research or experiments on how to build the model were left out.

While it's safe to say that DeepSeek's models are cheaper to train, the actual costs — especially as DeepSeek doesn't share its training data, which some might argue means its models are not really open source — are a little harder to guess at. Nevertheless, Thompson (who I, and a great deal of people in the tech industry, deeply respect) lays out in detail how the specific way that DeepSeek describes training its models suggests that it was working around the constrained memory of the NVIDIA GPUs sold to China (where NVIDIA is prevented by US export controls from selling its most capable hardware over fears they’ll help advance the country’s military development):

Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense using H800s.

Tell me: What should I be reading, instead, if I want to understand the details of this sort of thing, instead of that type of unhinged, pointless, totally uninformative rant about the tech industry?

[–] masterspace 3 points 1 day ago* (last edited 1 day ago) (6 children)

Wanting a better world, and holding up a light to the current one to show the differences between what could be and what is, is not at all what "cynical" means. "Cynical" is the opposite of what you mean. "Pessimistic" or "negative" is definitely more apt, yes.

No, I said cynical and I meant cynical.

I don't care that he criticizes the tech industry, I care that he feels the innate need to portray everyone in it as moustache twirling villains, rather than normal people caught up in the same capitalist systems and pressures as everyone else.

Even here, he spends all the article focusing on rumours about Chinese researchers making novel ways to outperform OpenAI and the like, and just makes a dismissive joke about the accusations that they effectively trained their model using OpenAI's model. Regardless of whether or not you agree with the morality of ignoring copyright to copy a copier, it's an incredibly important point because that is not a replicable strategy for actually creating new models. But rather than address that in any way, he dismisses it in a paragraph to spend another couple thousand words trying to dunk on the western tech industry in the snarkiest tone possible.

[–] [email protected] 10 points 1 day ago (1 children)

But it's not just that "they effectively trained their model using OpenAI's model". The point Ed goes on to make is why hasn't OpenAI done the same thing? The marvel of DeepSeek is how much more efficient it is, whereas Big Tech keeps insisting that they need ever bigger data centers.

[–] masterspace 1 points 1 day ago

They HAVE done that. It's one of the techniques they use to produce things like o1 mini models and the other mini models that run on device.

But that's not a valid technique for creating new foundation models, just for creating refined versions of existing models. You would never have been able to create for instance, an o1 model from Chat PT 3.5 using distillation.

load more comments (4 replies)
load more comments (4 replies)
load more comments (10 replies)