Self-Distillation for LLMs
UniSD icon

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

1Georgia Institute of Technology    2University of California, Los Angeles   
3Carnegie Mellon University    4William & Mary
*Equal contribution    Corresponding authors
Georgia Tech UCLA CMU William & Mary
UniSD teaser figure
Figure 0. Overview of the UniSD study: three axes of self-distillation mechanisms, evaluated across six benchmarks and three model families, culminating in the UniSD* pipeline.
Overview

Abstract

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear.

We propose UniSD, a Unified framework to systematically study Self-Distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping.

Across six benchmarks and six models from three families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSD*, an integrated pipeline that combines complementary components and improves over the base model by +5.4 and over the strongest baseline by +2.8, highlighting self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

Motivation

Why Self-Distillation Is Hard

Adapting LLMs without stronger external teachers is desirable but unreliable. Three open challenges have so far prevented a coherent picture of how self-distillation should work.

01

Open-Ended Generation

LLM outputs are free-form trajectories rather than fixed targets. Multiple valid answers exist; each prefix changes the conditioning state, making correctness assessment task-dependent.

02

Unreliable Self-Supervision

On-policy trajectories expose the model to its own errors. The teacher signal evolves with the student, and transient mistakes or overconfident predictions can be reinforced over time.

03

No Systematic Picture

Existing self-distillation methods study mechanisms in isolation. It is unclear which factors drive improvement, how they interact, and when each component is actually beneficial.

The Framework

UniSD: Three Complementary Axes

UniSD casts self-distillation as a reliability-aware self-correction process over on-policy trajectories. The student first attempts a completion, then learns through comparison and supervision across multiple teacher views — weighting reliable signals and consolidating the resulting knowledge.

AXIS 01
Supervision Reliability
  • Multi-Teacher Agreement
  • Token-Level Contrastive Learning
AXIS 02
Representation Alignment
  • Feature Matching
AXIS 03
Training Stability
  • EMA Teacher Stabilization
  • Divergence Clipping
UniSD framework diagram
Figure 1. The UniSD framework. Reliability-aware self-correction integrates multi-teacher agreement, contrastive learning, feature matching, EMA stabilization, and divergence clipping into a single, modular pipeline.
Integrated Pipeline

UniSD*: Combining Complementary Components

UniSD* chains the complementary mechanisms identified in our analysis into a single training pipeline that achieves the strongest overall performance across all six benchmarks.

Multi-Teacher Agreement Contrastive Learning Feature Matching EMA Teacher Divergence Clipping UniSD*
Highlights

Key Contributions

Unified Framework

The first extensible framework that organizes self-distillation in autoregressive LLMs along three axes — supervision reliability, representation alignment, and training stability.

Systematic Study

Extensive evaluation across six benchmarks and six models from three families reveals which components drive gains and how they interact across robustness, transfer, and retention.

UniSD* Pipeline

Guided by the analysis, UniSD* integrates complementary components to achieve the strongest overall performance, beating the base model by +5.4 and the strongest baseline by +2.8.

Headline Results

UniSD* Sets a New Bar

Average accuracy across six benchmarks (Qwen2.5-7B-Instruct base). UniSD* outperforms strong distillation baselines while preserving the base model's distribution.

+5.4
Improvement over base model
(67.9 → 73.3 overall)
vs. raw Qwen2.5-7B
+2.8
Over the strongest baseline GKD
(70.5 → 73.3 overall)
vs. best prior method
6 × 3
Benchmarks × model families
covered in our study
Qwen · Llama · Gemma
Generalization

Consistent Across Model Families

UniSD* transfers cleanly between model families: every backbone we tested gains overall accuracy without family-specific tuning.

Performance across model families
Figure 2. UniSD* improves the average benchmark score across three model families.
+5.4
Qwen2.5-7B
67.9 → 73.3 overall
+3.1
Llama-3.1-8B
strong gains on ID and OOD
+2.2
Gemma-3-4B
improvement without overfitting
Component Analysis

What Drives the Gains?

Each axis contributes in distinct ways. Multi-teacher agreement and EMA stabilization deliver the largest individual jumps, while contrastive learning is the most uniformly beneficial. Divergence clipping is the cheapest, and feature matching shines when combined with output-level alignment.

Component effectiveness
Figure 3. Average effect of each UniSD component versus standard SFT.
Accuracy vs. training time
Figure 4. Accuracy vs. training-time tradeoff across UniSD variants.
Distribution Retention

Improving Without Forgetting

UniSD* raises task accuracy while staying close to the base model's behavior. It achieves lower JSD than SFT on 70.3% of examples and higher base-model log-probability on 60.6% — a clear improvement-with-retention profile.

Token-level radar
Figure 5. Token-level agreement — per-benchmark profile for Qwen2.5-7B.
Sequence-level radar
Figure 6. Sequence-level agreement — smoother, more stable profile.
Distribution retention
Figure 7. Distribution of base-scored perplexity and token-level Jensen--Shannon divergence (JSD). UniSD* preserves the base distribution far better than SFT, while still improving accuracy.
Cite

BibTeX

BibTeX
@article{jin2026unisd,
  title={UniSD: Towards a Unified Self-Distillation Framework for Large Language Models},
  author={Jin, Yiqiao and Wang, Yiyang and Fu, Lucheng and Xiao, Yijia and Luo, Yinyi and Liu, Haoxin and Prakash, B Aditya and Hester, Josiah and Wang, Jindong and Kumar, Srijan},
  journal={arXiv preprint arXiv:2605.06597},
  year={2026}
}