Original version You can also find out more about the following: this story The following appeared: Quanta Magazine.
Chinese AI firm DeepSeek launched a chatbot called R1 earlier this year, which attracted a lot of attention. It’s mostly a chatbot focused on the fact The stock prices of many Western tech companies plummeted as a result. Nvidia, which sells the chips that run leading AI models, was among the worst hit. The stocks of Western technology companies plunged as a result. Nvidia sells chips used to run the leading AI models. lost more stock value in a single day More than any other company in history
A certain amount of this attention was accompanied by accusations. Sources alleged You can also find out more about us here. DeepSeek had obtainedOpenAI’s proprietary model o1 can be accessed without authorization by using the distillation technique. Much of the news coverage This possibility was framed as a shocking event for the AI industry. It implied that DeepSeek’s AI had been improved.
Distillation is also known as knowledge distillation. It has been a topic of research in computer science for over a decade, and it’s a tool used by big tech firms on their models. “Distillation is one of the most important tools that companies have today to make models more efficient,” The following are some of the ways to get in touch with each other Enric Boix-AdseraA researcher at Wharton School of the University of Pennsylvania who studies distillation.
Dark Knowledge
It was a distillation that began the idea of distillation. a 2015 paper Three Google researchers, among them Geoffrey Hinton – the godfather of AI – and 2024 Nobel laureate. At the time, researchers often ran ensembles of models—”many models glued together,” The following are some of the ways to get in touch with each other Oriol Vinyals, a principal scientist at Google DeepMind and one of the paper’s authors—to improve their performance. “But it was incredibly cumbersome and expensive to run all the models in parallel,” “Vinyals” “We were intrigued with the idea of distilling that onto a single model.”
The researchers believed they would make significant progress if they addressed a weak point of machine-learning algorithms. Wrong answers, no matter how bad they were, were viewed the same. For example, in an image classification model. “confusing a dog with a fox was penalized the same way as confusing a dog with a pizza,” Vinyals said. Researchers suspected ensemble models contained information on which incorrect answers are worse than others. It is possible that a smaller number of wrong answers are worse than others. “student” The model can use data from the large “teacher” Hinton called this model in order to better understand the categories into which pictures should be sorted. Hinton referred to this as “dark knowledge,” Invoking analogy to cosmological Dark Matter
Vinyals and Hinton discussed this option, then developed a solution to have the teacher’s model pass on more image information to a student model. Focusing on the key aspect was crucial. “soft targets” in the teacher model—where it assigns probabilities to each possibility, rather than firm this-or-that answers. Another model is, for instance, calculated There was a 30% chance of an image showing a dog. A 20% chance showed that the picture showed a cat. By utilizing these probabilities the model taught the students that dogs and cats are very similar. Cows and cars, however, differ quite a bit. This information, according to the researchers, would allow students to more easily identify images of cows, cars, or dogs. It is possible to reduce a big and complex model into a smaller one without losing accuracy.
Explosive Growth
It was not a hit right away. Vinyals became discouraged after his paper was turned down by a conference. The distillation process came to an important point. The engineers discovered that neural networks were more efficient the more data they used to train them. Models grew in size and complexity. capabilitiesThe cost of operating them increased in proportion to their size.
Researchers have used distillation to create smaller models. Google researchers, for example, unveiled in 2018 a powerful model of language called BERTThe company began to use it for parsing billions of searches on the web. BERT, however, was too expensive to maintain, and so developers developed a more manageable version called DistilBERT. It became popular in both business and research. It became common practice to distill alcohol, which is now provided by many companies. Google, OpenAIThen, Amazon. The original distillation papers, which were only published on arxiv.org’s preprint server until now, have been republished. been cited more than 25,000 times.
Since the data distillation process requires the access to the inner workings of the model teacher, it is not possible for DeepSeek to do this. That said, a student model could still learn quite a bit from a teacher model just through prompting the teacher with certain questions and using the answers to train its own models—an almost Socratic approach to distillation.
Other researchers are also finding new uses. NovaSky at UC Berkeley launched its first satellite in January. showed that distillation works well for training chain-of-thought reasoning modelsMultistep “thinking” To better answer complex questions. It says that its Sky-T1 fully open-source model costs less than $400 to train and achieves similar results as a larger open-source model. “We were genuinely surprised by how well distillation worked in this setting,” The following are some of the ways to get in touch with each other Dacheng Li, A Berkeley doctoral candidate and co-student of the NovaSky group. “Distillation is a fundamental technique in AI.”
Original story This article has been reprinted by permission. Quanta Magazine, The Independent is an independent editorial publication. Simons Foundation The magazine’s mission is to increase public awareness of science through the coverage of research trends, developments and advances in mathematics and physical and biological sciences.

