They HAVE done that. It's one of the techniques they use to produce things like o1 mini models and the other mini models that run on device.
But that's not a valid technique for creating new foundation models, just for creating refined versions of existing models. You would never have been able to create for instance, an o1 model from Chat PT 3.5 using distillation.
Picking out random people to lionize too much while you demonize literally everyone else, is still being cynical.
Because the paper does not prove what DeepSeek is claiming. The paper outlines a number of clever techniques that might help to improve efficiency, but most researchers are still incredibly skeptical that they would add up to a full order of magnitude less compute power required for training.
Until someone else uses DeepSeek's techniques to openly train a comparable model off non-distilled data, we have no reason to believe their method is replicable.
Extraordinary claims require extraordinary evidence ( or really just concrete, replicable, evidence), and we don't have that, at least not yet.