AI Model Distillation: A Literature Review
Introduction: Model distillation, often called knowledge distillation, is a technique where a smaller student model is trained to replicate the behavior of a larger teacher model. Since the advent of the Transformer architecture in 2017
blog.google, knowledge distillation has become vital for compressing massive language models into more efficient versions. This review covers four key areas of recent (post-2017) research on model distillation: performance benefits, cost savings, industry adoption, and algorithmic innovations, with an emphasis on Large Language Models (LLMs). Citations to academic papers and industry reports are included to substantiate each point.
1. Performance Benefits of Model Distillation
Distillation has been empirically shown to improve or preserve model performance across several dimensions, while using a smaller model:
-
Inference Speed and Efficiency: Distilled models run significantly faster than their teachers. For example, DistilBERT (a distilled BERT-base) has 40% fewer parameters than BERT and achieves about 60% faster inference
arxiv.org. Despite this compression, it retains 97% of BERT’s language understanding performance on the GLUE benchmarkarxiv.org. Similarly, TinyBERT (a 2019 two-stage distilled BERT) is reported to be 7.5× smaller and 9.4× faster at inference than BERT-base, while achieving comparable results on GLUEopenreview.net. Another example, MobileBERT (2020), is 4.3× smaller and 4× faster than BERT-base, yet only about 0.6 points lower on GLUE score; in fact, on the SQuAD QA benchmark its F1 score slightly exceeds the teacher’s (by 1.5–2 points) despite the drastic size reductionopenreview.net. These cases demonstrate that distillation can accelerate inference substantially without severely sacrificing accuracy.
-
Accuracy and Generalization: A well-distilled student often maintains accuracy close to the teacher’s. As noted, DistilBERT retained 97% of BERT’s accuracy on GLUE
arxiv.org, dropping only a few points on tasks like SQuAD (within ~3.9 points)arxiv.org. In some scenarios, students even match or outperform their teachers. Research on born-again networks (Furlanello et al. 2018) found that a student model with identical architecture to its teacher can surprisingly surpass the teacher’s accuracy after distillationaclanthology.org. This effect is attributed to the regularization benefit of training on the teacher’s “soft” outputs (which provide richer information than hard labels, sometimes called the teacher’s “dark knowledge”aclanthology.org). Distillation can thus improve generalization; for instance, a multi-task distilled model was shown to outperform all of its individual single-task teachers on NLP tasks by leveraging knowledge from eachaclanthology.orgaclanthology.org.
-
Robustness and Other Metrics: Distillation can also affect properties like model robustness and calibration. Early studies of defensive distillation (pre-2017) suggested distilled models might resist adversarial attacks, though later work found standard distillation alone may reduce adversarial robustness
openreview.net. Recent research has addressed this: for example, KD with input gradient alignment was proposed to ensure a student inherits the teacher’s adversarial robustness, allowing the student to match or exceed the teacher’s robustness against attacksopenreview.net. Beyond robustness, distilled LLMs have shown improvements in calibration and reduced bias. A 2024 study on MiniLLM (distilling open-source LLMs) found the distilled models produced more precise and calibrated responses, with lower exposure bias in text generationopenreview.net. In sum, with the right techniques, a distilled model can be not only faster, but sometimes more robust and better calibrated than the original.
-
Model Size vs. Capability Trade-off: Distillation helps navigate the trade-off between model size and capability. It effectively compresses knowledge from a very large model into a smaller one that punches above its weight. For instance, a distilled 13B-parameter model can approach the performance of a 175B-parameter model on certain tasks. This was demonstrated in LLM distillation for instruction-following: a LLaMA-13B model fine-tuned on dialogues from GPT-3.5 was judged to achieve about 90% of ChatGPT’s quality in responses
lmsys.org. Such outcomes highlight that much of a large model’s knowledge can be captured by a much smaller network through distillation, yielding strong accuracy-per-parameter efficiency.
2. Cost Savings Through Distillation
By improving efficiency, model distillation directly translates to cost savings in both training and deployment:
-
Reduced Computational Requirements: Distilled models require far less compute to train and run. The creators of DistilBERT reported that training it from scratch (using the original BERT as a teacher) took only 8 GPUs for 90 hours, whereas training a comparably performing model (RoBERTa) took on the order of 1024 GPUs for a day
arxiv.org– a massive reduction in computational cost. The resulting model is also much smaller (66M vs 110M parameters for BERT-base), which means less memory and cheaper hardware can be used for inference. A smaller model can often run on CPU or mobile devices, eliminating the need for costly GPUs in deployment. For example, DistilBERT’s 40% size reduction brings its model down to ~200MB, enabling on-device NLP appsarxiv.orgarxiv.org.
-
Lower Inference Cost (Latency & Energy): Faster inference not only improves user experience but also reduces the compute time (and thus energy) per query. TinyBERT’s 9× speedup
openreview.netand MobileBERT’s 4× speedupopenreview.netimply that far fewer server resources are needed to handle the same throughput of requests. This directly translates to cost savings in cloud environments where billing is tied to computation time. Anthropic noted that by distilling their larger Claude model into a smaller one, customers could run “sophisticated tasks” at a fraction of the cost – achieving nearly the same accuracy as the larger model but at the price and speed of the smaller modelwww.anthropic.comwww.anthropic.com. In practical terms, if a model is 60% faster (as with DistilBERTarxiv.org), one could serve ~1.6× more requests with the same compute infrastructure, or equivalently cut inference costs by 37.5%. These savings scale up in industry settings where model usage is heavy.
-
Energy Efficiency and Deployment Scalability: A smaller, distilled model usually consumes less power, which is critical for both data-center efficiency and deploying AI on edge devices. Google’s PaLM 2 family, for instance, includes a small “Gecko” variant (<2B parameters) that is so lightweight it can run on a mobile phone
blog.google. While Google hasn’t publicly detailed Gecko’s training, such extreme efficiency likely involved techniques like distillation to retain strong performance in a tiny model. The ability to run advanced models on-device opens the door to applications without continuous cloud access, saving costs on cloud inference. Qualcomm’s AI research echoes that sub-10B parameter models (e.g. PaLM 2 Gecko) enable on-device generative AI with low latency and no server costsblog.google. Distillation is a key enabler in producing these compact, energy-efficient models.
-
Training Data and Development Cost: Distillation can also reduce the amount of human-annotated data or experimentation needed to build a model. Instead of collecting enormous labeled datasets or tuning a gigantic model, one can use a powerful teacher (potentially an ensemble or an expensive model) to generate training signals for a student. This approach has proven extremely cost-effective in the LLM domain. A striking example is Stanford Alpaca/Vicuna: by using OpenAI’s text-davinci-003 (GPT-3.5) to generate 50K+ instruction-response examples, researchers fine-tuned a 7B–13B model that replicated much of the teacher’s capability for a few hundred dollars in compute
crfm.stanford.edulmsys.org. In one report, training the Vicuna-13B chatbot (distilled from ChatGPT conversations) was estimated at only ~$300 worth of GPU timelmsys.org, a tiny fraction of what training a large model from scratch would cost (often millions of dollars). Thus, distillation not only saves runtime costs but also democratizes model development, allowing smaller organizations to build high-performance models cheaply by leveraging APIs or public outputs of big models as teachers.
In summary, model distillation yields significant cost savings by compressing models (smaller, cheaper hardware), speeding up inference (less compute per result), and leveraging existing models to reduce training effort. These efficiency gains are crucial for making advanced AI practical at scale in industry.
3. Industry Adoption of Model Distillation
Many leading AI organizations have embraced knowledge distillation to deploy large models more efficiently or to transfer knowledge between models. Below are notable instances of industry-scale LLMs using distillation techniques:
-
OpenAI (GPT Series): OpenAI has increasingly leaned on distillation to optimize its model offerings. In late 2024, OpenAI introduced an official Model Distillation feature in its API, reflecting what many users already did manually
community.openai.com. This workflow allows developers to use outputs from a larger model (like GPT-4) to fine-tune a smaller model that achieves similar performance on a specific task, at lower costcommunity.openai.com. OpenAI’s own smaller GPT-3.5 variants (e.g. the “Ada” and “Babbage” models) are believed to have been produced by compressing larger models or training with distilled knowledge, though details aren’t public. There is evidence that OpenAI’s instruction-tuned models (like text-davinci-003) have served as teachers for open-source replicas – for example, Stanford’s Alpaca project used GPT-3.5 to train a 7B model that mimics ChatGPT’s stylecrfm.stanford.edu. OpenAI’s public stance via their API guides is that distillation can “deliver similar performance on specific tasks at a lower cost”community.openai.com, underscoring its importance in making large models more affordable to deploy.
-
Google (Bard and PaLM/Gemini models): Google’s LLMs have multiple size tiers and have employed distillation and related compression techniques to make them deployable. PaLM 2, which powers Google’s Bard chatbot, comes in sizes from Unicorn (largest) down to Gecko (smallest)
blog.google. Gecko is so lightweight it can run on a smartphone offlineblog.google– an achievement likely made possible by aggressive distillation or architecture search to retain as much capability as possible in <2B parameters. While Google hasn’t explicitly detailed their distillation methods for PaLM 2, it’s known that model compression is standard practice for on-device AI at Google (earlier, they distilled BERT into smaller models for Google Assistant, Translate, and Gboard voice typingarxiv.orgarxiv.org). Industry observers expect that Google’s upcoming Gemini model (which will have multiple sizes) will also leverage knowledge distillation to transfer knowledge from ultra-large training runs into more efficient deployment models. In research, Google has explored sequence-level knowledge distillation for translation using PaLM 2 models of varying sizesarxiv.org, and found that the classic “capacity gap” (huge teacher vs small student) can be mitigated with improved distillation algorithms. All this indicates that distillation is integral to Google’s strategy for delivering LLM capabilities in production environments, from cloud servers to mobile devices.
-
Meta (LLaMA Series and Derivatives): Meta’s LLaMA and LLaMA-2 open-source LLMs have spurred widespread use of distillation in the community and likely internally. While Meta’s papers focus on training from scratch, the fine-tuned chat versions (like LLaMA-2-Chat) benefit from distilled knowledge. In the open-source realm, model distillation has been the key to aligning smaller models with instruction-following ability. Notably, Stanford Alpaca distilled the behavior of OpenAI’s text-davinci-003 into a 7B LLaMA model by training on 52K generated examples
crfm.stanford.edu. The resulting Alpaca model demonstrated surprisingly high-quality instruction responses “similar to text-davinci-003” despite its small sizecrfm.stanford.edu. Building on this, the Vicuna model (based on LLaMA-13B) was fine-tuned on about 70K user-shared ChatGPT conversations, effectively distilling ChatGPT’s conversational prowess. Vicuna’s creators reported it achieved over 90% of ChatGPT’s quality as evaluated by GPT-4, and this open model was released at a tiny fraction of the original ChatGPT’s costlmsys.orglmsys.org. These are clear cases of industry-scale distillation: taking the knowledge from state-of-the-art proprietary models (GPT-3.5/ChatGPT) and injecting it into smaller Meta models. Meta has tacitly acknowledged this trend by not opposing these efforts, and even incorporating some community improvements. It’s plausible that Meta’s own LLaMA-2-Chat training involved leveraging outputs or feedback from larger models (e.g., using GPT-4 to judge responses during fine-tuning, a form of distilled feedback). In summary, Meta’s LLM ecosystem has widely adopted distillation – either directly, or indirectly via the open-source community – to produce smaller models (7B–13B) that approximate much larger models’ capabilities.
-
DeepSeek (R1 Model Distillation): DeepSeek is a Chinese AI startup that openly centers its approach on knowledge distillation. Their flagship DeepSeek-R1 is a large reasoning-focused LLM, and the company distilled this model into several smaller variants based on Meta’s LLaMA and Alibaba’s Qwen models
www.ibm.com. In fact, DeepSeek used a two-step process: first applying reinforcement learning to train a powerful R1 teacher, then using knowledge distillation to fine-tune multiple smaller models (e.g., LLaMA 8B and Qwen 7B) on data generated by R1www.ibm.comwww.ibm.com. The distilled DeepSeek-R1 models retain strong reasoning abilities (especially in math and coding tasks) while being much more efficient to run, which has made them popular for deployment on cloud platforms (IBM WatsonX, Amazon Bedrock, etc. have integrated these distilled models)www.ibm.comwww.ibm.com. This is a cutting-edge example of industry use: a custom large model distilled down into several smaller backbone models to cover different efficiency trade-offs. The success of DeepSeek’s distilled models – reportedly rivaling OpenAI’s models on some benchmarkswww.ibm.com– also sparked discussion about intellectual property: OpenAI has suggested that DeepSeek’s process effectively distilled knowledge from OpenAI’s own models via their outputs, blurring the lines of IP in the era of model distillation. Regardless of those debates, DeepSeek’s case highlights that distillation is a key tool for new entrants in the LLM space to train competitive models cheaply and deploy them widely.
-
Anthropic (Claude models): Anthropic has explicitly used knowledge distillation to optimize its Claude AI assistant. In 2024, Anthropic announced a distilled version of Claude 2 (called Claude 3 Haiku) for Amazon Bedrock, which achieved Claude 3.5-level accuracy at the speed/cost of the smaller Claude 3 model
www.anthropic.com. By transferring knowledge from the powerful Claude 3.5 “Sonnet” (teacher) to the faster Claude 3 “Haiku” (student), they reported “significant performance gains” such that the student’s quality approaches the larger model on targeted taskswww.anthropic.com. This distilled Claude runs with much lower latency (optimized on AWS Trainium hardware) and is offered at a cheaper price point, enabling high-volume enterprise use of Claude’s capabilities at a fraction of the original inference costwww.anthropic.com. Anthropic even built automation around this: Amazon Bedrock’s distillation service can generate synthetic Q&A pairs from the teacher, fine-tune the student, and deploy it, all in a pipelinewww.anthropic.com. This marks one of the first public distillation-as-a-service offerings. It shows Anthropic’s commitment to distillation as a means to deliver “frontier” model performance in a cost-effective manner, following the pattern of other industry leaders.
In addition to the above, many other industry-scale deployments use knowledge distillation behind the scenes. For instance, Microsoft has used distillation in its DeepSpeed optimization library to compress transformer models for production, and Alibaba’s Qwen LLM family (which has 14B and 7B versions) likely involved distilling the 14B model’s knowledge into the 7B variant for efficiency (as evidenced by Qwen being used in DeepSeek’s distilled models)
www.ibm.com. Even beyond NLP, companies apply distillation to speech recognition models, recommender systems, and more to shrink models before deployment. The widespread adoption across OpenAI, Google, Meta, Anthropic, and others underlines that distillation is a standard practice for industrial AI: it enables the use of cutting-edge models under real-world constraints of latency, memory, and cost.
4. Algorithmic Innovations in Distillation (2018–2024)
Research in the past several years has greatly expanded and improved the original idea of knowledge distillation (Hinton et al., 2015). Innovations span new distillation objectives, training schemes, and use-cases, especially geared toward large transformers and LLMs. Key developments include:
-
Improved Loss Functions and Knowledge Types: The classic distillation approach uses the teacher’s soft logits as targets for the student (response-based KD). Newer methods explore distilling more than just final outputs. For example, intermediate feature distillation transfers knowledge from the teacher’s hidden layers or attention maps to the student. Romero et al. (2015) introduced FitNets to train a student to match teacher’s intermediate representations, a technique later adopted in TinyBERT, which distilled not only the output probabilities but also the transformer layer outputs and attention of BERT into a smaller model
openreview.net. More recently, contrastive distillation objectives have been proposed to better preserve the structure of the teacher’s knowledge. Instead of a simple L2 loss on features, Contrastive Distillation on Intermediate Representations (CoDIR) formulates a task where the student must distinguish the teacher’s representation of the correct sample from many negative samplesaclanthology.org. This forces the student to capture fine-grained similarities in teacher embeddings. CoDIR achieved state-of-the-art compression results on BERT models, outperforming earlier methods on GLUE benchmarksaclanthology.org. Likewise, researchers have applied contrastive learning in vision distillation to retain relational knowledge among data points (Tian et al. 2019). In summary, by distilling features, attention patterns, and relationships (not just outputs), these methods provide the student with richer cues, leading to higher accuracy and better mimicry of the teacher’s behavior.
-
Self-Distillation and Iterative Teaching: A striking discovery in late 2010s was that a model could benefit from distilling knowledge from itself. In self-distillation (or born-again networks), the same model architecture is first trained to convergence (teacher), then a new instance of that model (student) is initialized and trained on the teacher’s outputs. Furlanello et al. (2018) found that repeating this process can progressively improve results – the student often outperforms the original teacher despite having the same capacity
aclanthology.org. This suggests the soft targets provide a form of regularization or curriculum that standard training didn’t. Extensions of this idea appeared in online distillation and mutual learning (Zhang et al. 2018), where multiple models learn from each other’s predictions in parallel, essentially each acting as teacher and student. While self-distillation initially showed gains in image classification, it has also been applied in NLP. For instance, Clark et al. (2019) applied a born-again distillation to multi-task learning: individual task models taught a single multi-task model, which with a technique called “teacher annealing” was able to surpass the accuracy of the single-task teachers on the GLUE benchmarkaclanthology.orgaclanthology.org. The benefit of self-distillation in LLMs is an area of ongoing research – it hints that even without a larger teacher, a model can teach itself to better polish its knowledge (possibly by learning from its own high-confidence outputs on unlabeled data, akin to “noisy student” training in semi-supervised learning (Xie et al. 2020)). These approaches blur the line between distillation and regular training, essentially using the model’s own knowledge as supplemental training signal to improve generalization.
-
Progressive and Multi-Teacher Distillation: One challenge identified in knowledge distillation is the capacity gap: if the teacher is too large relative to the student, the student struggles to replicate it, often getting overwhelmed by the teacher’s knowledge
arxiv.org. In 2020, Mirzadeh et al. formalized this problem and introduced Teacher Assistant Knowledge Distillation (TAKD) as a solutionarxiv.orgarxiv.org. In TAKD, instead of distilling directly from a huge teacher to a tiny student, one trains an intermediate-sized assistant teacher (or a chain of decreasing-size teachers) to bridge the gap. For example, to distill a 24-layer BERT into a 2-layer student, one might first distill BERT into a 6-layer intermediate, then distill that into the 2-layer model. Their experiments showed the student’s performance improved significantly when using an intermediate teacher, nearly closing the gap to the large teacher in some cases. This idea of progressive distillation (distilling in multiple steps or stages) is especially relevant for LLMs: a 100B-parameter model might be distilled into a 10B, then into a 1B. Each step is more manageable for the student. Recent research on sequence-to-sequence models also confirms the “capacity gap curse” – distillation effectiveness drops if teacher is far larger – and suggests that techniques like multiple rounds of distillation or curriculum distillation can helparxiv.org. Along similar lines, multi-teacher distillation has been explored, where an ensemble of teachers or multiple expert models teach a single student. Instead of one teacher, the student learns to match outputs of several teachers (or a teacher that is an ensemble). This can enrich the student’s knowledge. For example, Yang et al. (2019) distilled an ensemble of QA models into one smaller QA model, achieving better accuracy than using any single teacherarxiv.org. Tsai et al. (2019) used a multilingual ensemble to train one compact multilingual studentarxiv.org. Such multi-teacher approaches are computationally heavier (since you need multiple forward passes), but they approximate the benefit of model ensembling in a single distilled network.
-
Distillation for Complex Abilities (Reasoning and Chain-of-Thought): As LLMs began to demonstrate advanced reasoning (with techniques like chain-of-thought prompting), a question arose: can we distill these reasoning abilities to smaller models? This led to chain-of-thought distillation research. The idea is to have the teacher not only provide final answers but also the reasoning steps (intermediate chains) as a teaching signal. One straightforward approach is to fine-tune a student to mimic the teacher’s entire reasoning trace. However, recent work has identified challenges: (i) not all tokens in a reasoning chain are equally important (some key steps are critical, others can be skipped), and (ii) generating a long chain step-by-step might be hard for the student if done all at once, akin to learning a complex task without a curriculum
arxiv.org. To address this, Feng et al. (2023) proposed Keypoint-Based Progressive CoT Distillation (KPOD)arxiv.org. In KPOD, the student is first taught to reproduce only the key steps or “critical points” of the teacher’s chain-of-thought, and gradually learns to fill in more steps. They introduce a weighting mechanism so that the student focuses on accurately imitating the most salient tokens of the rationalearxiv.org. They also use a progressive schedule: train the student to get the final answer and last few steps correct, then progressively include earlier stepsarxiv.org. This mirrors a curriculum from easy to hard (final answers are easier once you know the teacher’s last step, then you add more steps to generate fully). Empirical results on math and commonsense reasoning benchmarks showed large gains from this method – the distilled student significantly outperformed prior distillation baselines, better capturing the teacher’s reasoning abilityarxiv.org. Other teams (e.g. Li et al. 2023) have distilled step-by-step reasoning by fine-tuning 7B–13B models on traces generated by GPT-4, yielding students that can perform multi-step arithmetic or logic at a level previously unattainable by such small models. This is a rapidly evolving frontier: distilling reasoning and problem-solving skills (which go beyond matching single outputs) requires clever strategies like selective imitation, feedback mechanisms, or even distilling the policy of a multi-step solver. Nonetheless, early successes indicate even complex emergent abilities of large models can be partially transplanted into smaller models via specialized distillation techniques.
-
Distillation in Generative Model Training: Another innovation is adapting distillation to generative modeling settings and reinforcement learning settings. Traditional KD is often a one-way, offline transfer (teacher fixed, student learns). But with LLMs, we see scenarios like distilling a dialogue policy from a model fine-tuned with Reinforcement Learning from Human Feedback (RLHF) into a new model that can be trained with supervised learning. For instance, OpenAI researchers have discussed distilling their RLHF-trained chat model into a regular language model to eliminate the need for RL at deployment. Algorithmically, there are tweaks like using Reverse-KL distillation for language generation. The MiniLLM paper (Gu et al. 2023) pointed out that the standard KL divergence loss used for classification distillation might cause a student to overemphasize low-probability tokens in generation. They proposed minimizing the reverse KL (making the student’s distribution a subset of the teacher’s), which aligns better with how generation should behave
openreview.netopenreview.net. Their method yielded students with lower exposure bias and better long-text generation performance compared to naive distillationopenreview.net. This kind of objective tweaking is important for LLMs, where we care about distributional properties (like avoiding nonsense completions). Similarly, some works have combined distillation with reward models, effectively distilling not just the teacher’s outputs but its preference rankings (e.g., distill a model that imitates a teacher’s preference-based decisions, which can be seen in some RLHF distillation research). Another line is data-free distillation, where one doesn’t even use the original training data – instead, one generates synthetic inputs (perhaps via a generator network or randomly) and uses the teacher to label them, then trains the student. This was demonstrated by Micaelli & Storkey (2019) for images and has potential for proprietary LLMs: if you can’t use the original data of GPT-4, you can query it with crafted prompts to create a pseudo-dataset for the student. Overall, the algorithms of distillation have diversified: from better loss functions (contrastive, RKD, etc.) to clever training schedules (progressive, curriculum, multi-step) to new application scenarios (distilling dialog policies, multilingual skills, etc.). Each innovation aims to narrow the gap between student and teacher, making it possible to retain more of the teacher’s strengths in a much smaller model.
Conclusion: In the transformer era, model distillation has emerged as a cornerstone technique for maximizing AI models’ performance/cost ratio. Literature from 2017 onwards overwhelmingly shows that distillation can produce smaller, faster models that preserve most of the accuracy and sometimes even improve certain aspects like generalization or robustness
arxiv.orgarxiv.org. This yields substantial cost savings in computation, enabling broader deployment of AI technologiesarxiv.orgwww.anthropic.com. It’s no surprise that virtually all major AI labs – from OpenAI and Google to Meta, Amazon, Alibaba, and startups like DeepSeek – have adopted knowledge distillation as part of their model development pipeline. Moreover, ongoing research continues to refine distillation algorithms, tackling limitations and extending the paradigm to new frontiers (such as distilling reasoning, or distilling knowledge without original data). The synergy of these academic advances and industry applications means that knowledge distillation will likely remain a key ingredient in making the next generations of LLMs more efficient, accessible, and widely usable. As one survey succinctly put it, “knowledge distillation has emerged as an effective technique to enhance inference speed without greatly compromising accuracy”arxiv.org, a benefit that is driving both the science and engineering of AI model compression forward.