On the Convergence and Performance Analysis of Adaptive
Kullback-Leibler Divergence in Knowledge Distillation for Large
Language Models
Joseph Zhong
May 8, 2025
Abstract
In this work, a unified theoretical proof and analysis of divergence-based losses for knowledge distillation in large language models is presented. Building on the previously proposed Adaptive KL (AKL) framework—a convex combination of forward and reverse Kullback–Leibler divergences, shown in earlier work to retain the same global convergence guarantees as its constituent losses—this study investigates its theoretical properties in greater depth. Novel convergence-rate bounds for forward KL (FKL) and reverse KL (RKL) are derived using tools from stochastic nonconvex optimization and Rademacher complexity. The analysis indicates that FKL converges more rapidly when student distributions overestimate teacher probabilities predominantly in high-mass (head) regions, whereas RKL’s rate is dictated by discrepancies in low-mass (long-tail) regions. Additionally, the Jensen–Shannon divergence (JSD), as an alternative to AKL, is incorporated to provide a symmetric, bounded option, and its convergence properties are examined under the same optimization regime. These findings clarify how different divergences emphasize distinct parts of the teacher–student gap and offer a theoretical rationale for the adaptive weighting strategy employed by AKL. The work concludes by suggesting that alternative gap measures might inspire even more effective divergence losses and outlines directions for further theoretical and empirical investigation.
Recent advancements in Large Language Models (LLMs) have brought remarkable improvements
in various natural language tasks. These advancements can be attributed to the growth of the
models in size in terms of the number of parameters. For example, open-source GPT-3 has about
175B parameters while GPT-4 is speculated to have 10x more parameters than its predecessor.
Such large LLMs are extremely resource-intensive and their substantial computational demands
have driven active research into efficient model compression techniques such as Knowledge
Distillation (KD).
Knowledge Distillation (KD) involves training a smaller “student” model to mimic the behavior
of a larger “teacher” model. One of the key steps in KD is defining a loss function that captures
how well the student matches the teacher. Traditionally, this has been done using the
Kullback-Leibler (KL) divergence, which measures the difference between the probability
distributions output by the teacher and student models.
The paper [1] challenges the use of KL divergence and shows its limitations when applied to
LLMs. It can be overly sensitive to small differences in the output probabilities, leading
the student model to focus too much on matching inconsequential details rather than
the core semantics. To address this, the authors propose an alternative strategy that
better captures the meaningful similarities between the teacher and student models.
Building on this work, we aim to explore some theoretical properties of the Adaptive KL (AKL) divergence framework proposed in the paper.
In this section, we try to show the convergence of the proposed AKL algorithm for KD in LLMs.
We use notations similar to the original paper.
Beginning with the definitions of the Forward and Reverse KL Divergence in the context of Knowledge Distillation for LLMs.
Definition 2.1: Forward KL Divergence (FKL)
The forward KL divergence loss function is
Where is the student model
that is being trained and
is the gound truth from the teacher model. For the student network,
is the output of a LLM
with parameter set
and the input of a softmax function for multi-class probabiliy logits output as q.
After removing the constant negative entropy part, the softmax function along with the KL divergence
is denoted as
its gradient is
where
Definition 2.2: Reverse KL Divergence (RKL)
The reverse KL divergence Loss function is:
set
Its gradient is:
Definition 2.3: Adaptive KL Divergence (AKL)
where we define the constants as the following:
In the above is a mask which has value 1 when belongs to the head part.
Lemma 2.1
The sufficient and necessary condition for the convergence of the Adaptive KL Divergence framework is:
First we give JS Divergence by definition
Definition 3.1: Jensen Shannon Divergence
Lemma 3.1: JS Divergence Derivative
Then we prove its convergence globally
Lemma 3.2: JS Divergence Convergence
Necessity.
So JS divergence should be able to work in a similar same way as FKL, RKL and AKL. Actually since it has a similar derivative (see ??) with RKL(see ??) under LLM KD context, its performance can be analyzed the same way as Section 4.4.
We use the convergence rate theorem [2–4] under the nonconvex stochastic gradient descent setting, which is applicable to our problem hypothesis.
Theorem 4.1
After T iterations the expected gradient norm is bounded by
where the gradient–Lipschitz constant is the smallest value such that
In nonconvex optimization, it is not guaranteed for the convergence to a global minimum, but a standard goal is to show that the algorithm converges to a stationary point, i.e., a point where the gradient is close to zero. So for the inequality in ??, the LHS is the average expected squared gradient norm over T steps of SGD, which has a upperbound of the RHS. For the limited number of traininig epoches in the LLM knowledge distillation problem, the lower the upperbound of the gradient norm is, the faster the convergence rate may be.
Although other optimizers are usually used in LLM trainings, our focus is not the optimizer but to compare the convergence rates of KL divergence loss functions under different head and tail gap distributions. So we assume fixed step-size and similar results can be obtained to varying step-size for different gradient-based optimizers. We also assume the variance of the sampled gradient is bounded by which is usually brought by the sampling process for a batch gradient update. is the initial loss and is the loss after T steps, which we both assume are fixed constants. So only , which is the gradient–Lipschitz constant of the loss function, is a variable that has an impact to the convergence rate.
Using the definitions of FKL and RKL under the LLM KD context from definitions 4.1 and 4.2.
Then we give the upperbound of ,
Lemma 4.1
where
From ?? we can see that M and are bounded by the the 1st and 2nd order parameter derivative norm of LLM, which is trivial to our problem. In this work, we focus on bounding B under difference divergence loss and different head and tail distribution gap to analyze their correlations. For FKL, as we can see in ?? is irrelavant to the head and tail tap. For RKL, is a complicated but bounded small parameter which we leave it to possible future work.
Suppose the LLM’s model class is
The empirical Rademacher complexity of in our problem definition can be bounded using the
Rademacher contraction lemma
|
R(F
R(
4.3 On the Convergence Rate of FKLKL-divergence as a loss function has various convergence rates under different distributions of and . Any distribution property could have an impact to the convergence rate so that it may be difficult to find a way to change the head and tail gap of the two distributions without bringing other property difference of the two distributions. Due to the limited resources for LLM training, it is usually unrealistic to run as many as possible of epochs that could reach optimal performance. Therefore, we mainly focus on the beginning stage of the training and hypothesize some conditions that are not too harsh but also under which it is more fair to compare the performance differences largely brought by the head and tail distribution gap.
Definition 4.1: Hypothesis Conditions Suppose there is a constant , And we give the definition of head H and tail T of a probability as Definition 4.2 the cutoff index between head and tail is , which may vary in different scenarios. Each of Head and tail can be furthur divided into to groups as Figure 1: Two scenarios for head and tail gap
And define the two types of gaps
and
between p and q as Definition 4.3 Lemma 4.2 In this work, we focus on
as the gap for FKL, which is also the same one used in [1]. We can
see there can be only two possible scenarios for the head and tail gap,
or
. They
can be both true but never be both false. See Figure 1 for a better description. Lemma 4.3: FKL’s gap properties Then we prove B in ?? is bounded by the total gap and is not related to the gap. Lemma 4.4: The upperbound of Finally we show that B is in different upperbound when the difference between head gap and tail gap varies. Lemma 4.5 We have By ?? and ??, we prove that when the head gap is larger than the tail gap, B has a lower upperbound so that FKL converges faster. 4.4 On the Convergence Rate of RKLSimilar to last section, we need to analyze B under RKL by different head and tail distribution
gap. Unfortunately, RKL’s gradient is much more complicated than FKL hence we didn’t find a
way to do so using the gap in ??, which is from the AKL paper[1]. We need to start from finding a
better way to describe the head and tail distribution gap between q and p for RKL.
Stimulated by the fact of that it is more susceptible to the probability mass of q, we give the RKL’s definition of head H and tail T as Definition 4.4: RKL’s head and tail Suppose q is in non-increasing order, Definition 4.5: RKL’s gap Lemma 4.6: RKL’s gap properties ?? can be interpreted as RKL ’s head and tail gap difference is centered at . So the gap of RKL should have a center of mass that is closer to the tail part. Then we give the B’s uppperbound for RKL as the total gap. Lemma 4.7: The upperbound 1 of So B is actually not only bounded by the total gap, but also reverse KL divergence. By combining ?? with ??, we may have a bound that is more meaningful. Lemma 4.8: The upperbound 2 of ?? tells that only if the total gap gets smaller for the classes with , which actually accounts for the gap of the ”long tail” part, B has a lower upperbound then RKL may converges faster. That said, RKL’s performance may be mostly bounded by the tail gap rather than the difference between the head and tail gap, which matches the common understanding for RKL that it is more vulnerable for the part where p has a low value.
4.5 SummaryIntuitively, a KL divergence loss function aims to minimize the gap between two distributions so that the total gap should be highly related to the complexity and performance. However, the total gap is usually different from the KL divergence in value. Firstly we use the convergence rate theorem and Rademacher complexity contraction lemma to analyze this problem. Then we give the appropriate definitions of gap for FKL and RKL respectively. By strictly analysis of FKL, we prove that the total gap is indeed a key factor to the performance under different head and tail gap distributions for it. For RKL, we find that it works better when the long tail gap is smaller, which consists of the tail part and also some of the head part where . Although it is still difficult to compare the complexity of FKL and RKL for a more strict proof of AKL’s performance, another interesting extension may be that can we use the gap definitions in this work as a loss function instead of FKL, RKL or AKL?
5 ConclusionIn this work, we have advanced the theoretical understanding of knowledge distillation for large
language models by revisiting and extending the analysis of KL divergence loss functions. First, we
proved that the proposed Adaptive KL framework retains the same global convergence properties
as both forward and reverse KL. Then, by introducing the convergence guarantee of
Jensen–Shannon divergence , we provided one more option of divergence loss as an alternative
symmetric and bounded objective. Lastly, we invested quite some effort to carefully derive explicit
bounds for forward and reverse KL through convergence-rate analysis based on stochastic
nonconvex optimization and Rademacher complexity, which reveals that FKL converges
faster when the student’s overestimation concentrates more in the head region than
the tail, and RKL’s convergence may be predominantly governed by the gap in the
long-tail. Together, these results may be hopefully helpful on understanding if and why each divergence favors different regions of the discrepancies between the teacher and student distribution, partly justifying the adaptive convex-combination strategy of AKL. The performance analysis result for RKL in this work implies that it may be bounded mostly by the tail part, which does not necessarily support the idea that RKL may performs better if the tail gap is larger than the head gap. Beyond validating AKL’s theoretical soundness, we raised a question whether any of the gap definitions in this work helps in building a better divergence loss function. For potential future work, one may try to look for more theoretical insights and empirical evidence on these gaps.
References
[1] Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models, 2024. URL https://arxiv.org/abs/2404.02657. [2] Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming, 2013. URL https://arxiv.org/abs/1309.5549. [3] Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods, 2024. URL https://arxiv.org/abs/2301.11235. [4] Liam Madden, Emiliano Dall’Anese, and Stephen Becker. High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise, 2024. URL https://arxiv.org/abs/2006.05610.
A ?? ProofThe convergence condition for AKL is given by, We can consider the coefficients for AKL formula as scalar weights such that Sufficiency Condition. Put in the above equation to evaluate its value, Necessity Condition. We need to show that if the differential is , then it must hold that Let , then and (2) - (1) we have However log(x) is a strictly concave function so that its secant slope cannot always be a constant of unless Added above together for all j This completes the proof.
B ?? ProofFor the first term For the second term Putting these together
C ?? Proof
D ?? ProofIf , If , So CitationIf you find this analysis useful and want to cite it in your work, you can use the following BibTeX entry: @misc{zhong2025convergence,
title={On the Convergence and Performance Analysis of Adaptive Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models},
author={Joseph Zhong},
year={2025},
howpublished={\url{https://josephzhong.github.io/writings/kd/main.html}}
}
|