On the Convergence and Performance Analysis of Adaptive KL Divergence in Knowledge Distillation for LLMs

On the Convergence and Performance Analysis of Adaptive Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Joseph Zhong
May 8, 2025

Abstract

In this work, a unified theoretical proof and analysis of divergence-based losses for knowledge distillation in large language models is presented. Building on the previously proposed Adaptive KL (AKL) framework—a convex combination of forward and reverse Kullback–Leibler divergences, shown in earlier work to retain the same global convergence guarantees as its constituent losses—this study investigates its theoretical properties in greater depth. Novel convergence-rate bounds for forward KL (FKL) and reverse KL (RKL) are derived using tools from stochastic nonconvex optimization and Rademacher complexity. The analysis indicates that FKL converges more rapidly when student distributions overestimate teacher probabilities predominantly in high-mass (head) regions, whereas RKL’s rate is dictated by discrepancies in low-mass (long-tail) regions. Additionally, the Jensen–Shannon divergence (JSD), as an alternative to AKL, is incorporated to provide a symmetric, bounded option, and its convergence properties are examined under the same optimization regime. These findings clarify how different divergences emphasize distinct parts of the teacher–student gap and offer a theoretical rationale for the adaptive weighting strategy employed by AKL. The work concludes by suggesting that alternative gap measures might inspire even more effective divergence losses and outlines directions for further theoretical and empirical investigation.

1 Introduction

Recent advancements in Large Language Models (LLMs) have brought remarkable improvements in various natural language tasks. These advancements can be attributed to the growth of the models in size in terms of the number of parameters. For example, open-source GPT-3 has about 175B parameters while GPT-4 is speculated to have 10x more parameters than its predecessor. Such large LLMs are extremely resource-intensive and their substantial computational demands have driven active research into efficient model compression techniques such as Knowledge Distillation (KD).

Knowledge Distillation (KD) involves training a smaller “student” model to mimic the behavior of a larger “teacher” model. One of the key steps in KD is defining a loss function that captures how well the student matches the teacher. Traditionally, this has been done using the Kullback-Leibler (KL) divergence, which measures the difference between the probability distributions output by the teacher and student models.

The paper [1] challenges the use of KL divergence and shows its limitations when applied to LLMs. It can be overly sensitive to small differences in the output probabilities, leading the student model to focus too much on matching inconsequential details rather than the core semantics. To address this, the authors propose an alternative strategy that better captures the meaningful similarities between the teacher and student models.

Building on this work, we aim to explore some theoretical properties of the Adaptive KL (AKL) divergence framework proposed in the paper.

2 AKL Convergence Analysis

In this section, we try to show the convergence of the proposed AKL algorithm for KD in LLMs. We use notations similar to the original paper.

Beginning with the definitions of the Forward and Reverse KL Divergence in the context of Knowledge Distillation for LLMs.

Definition 2.1: Forward KL Divergence (FKL)

The forward KL divergence loss function is

L (𝜃) = \sum_{i} p_{i} \log \frac{p_{i}}{q_{𝜃, i}}, q_{𝜃} = s o f t m a x (z (𝜃)),

Where $q_{𝜃, i}$ is the student model that is being trained and $p_{i}$ is the gound truth from the teacher model. For the student network, $z (𝜃)$ is the output of a LLM with parameter set $𝜃$ and the input of a softmax function for multi-class probabiliy logits output as q.
After removing the constant negative entropy part, the softmax function along with the KL divergence is denoted as $F_{F K L}$

F_{F K L} (z) = - \sum_{i} p_{i} \ln (\frac{\exp (z_{i})}{\sum_{j} \exp (z_{j})}) = \log (\sum_{j} \exp (z_{j})) - p^{⊤} z,

its gradient is

\nabla F_{F K L} (z) = \frac{\exp (z)}{\sum_{j} \exp (z_{j})} - p = q (z) - p,

where

q_{i} (z) = \frac{\exp (z_{i})}{\sum_{j} \exp (z_{j})}

Definition 2.2: Reverse KL Divergence (RKL)

The reverse KL divergence Loss function is:

L_{R K L} (𝜃) = D_{K L} (q_{𝜃} | p) = \sum_{i} q_{𝜃, i} \ln \frac{q_{𝜃, i}}{p_{i}}, q_{𝜃} = s o f t m a x (z (𝜃)) .

set

F_{R K L} (z) = \sum_{i} q_{i} (z) \ln \frac{q_{i} (z)}{p_{i}} = \sum_{i} \frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}} \ln (\frac{e^{z_{i}} ∕ \sum_{j} e^{z_{j}}}{p_{i}}) .

Its gradient is:

\nabla_{z} F_{R K L} (z) = q (l o g \frac{q}{p} - F_{R K L} (z))

Now starting the proof for the convergence of the proposed Adaptive KL Divergence.

Definition 2.3: Adaptive KL Divergence (AKL)

A K L (p, q_{𝜃}) = \frac{g_{h e a d}}{g_{h e a d} + g_{t a i l}} F K L (p, q_{𝜃}) + \frac{g_{t a i l}}{g_{h e a d} + g_{t a i l}} R K L (p, q_{𝜃})

where we define the constants $g_{h e a d}, g_{t a i l}$ as the following:

g_{h e a d} = \sum_{i = 0}^{V} M [i] 𝜖 (p (Y_{i} | y < t), q_{𝜃} (Y_{i} | y < t));

g_{t a i l} = \sum_{i = 0}^{V} (1 - M [i]) 𝜖 (p (Y_{i} | y < t), q_{𝜃} (Y_{i} | y < t))

In the above $M [i]$ is a mask which has value 1 when $Y_{i}$ belongs to the head part.

Lemma 2.1

The sufficient and necessary condition for the convergence of the Adaptive KL Divergence framework is:

q_{𝜃} (Y_{i} | y < t) = p (Y_{i} | y < t) \forall i \in {1, 2, \dots, V}

Proof. See Section A.
From ?? we can see that the same condition holds true for the Adaptive KL as for the Forward and Reverse KL divergences.

3 Jensen Shannon Divergence for LLM KD

First we give JS Divergence by definition

Definition 3.1: Jensen Shannon Divergence

D_{J S} (q ∥ p) = \frac{1}{2} D_{K L} (q ∥ m) + \frac{1}{2} D_{K L} (p ∥ m), where m_{i} = \frac{p_{i} + q_{i}}{2}

Then we give JS Divergence’s derivative with respect to the output of LLM before the softmax layer.

Lemma 3.1: JS Divergence Derivative

\frac{\partial L_{J S}}{\partial z_{i}} = \frac{1}{2} q_{i} (\log \frac{q_{i}}{m_{i}} - \sum_{j} q_{j} \log \frac{q_{j}}{m_{j}})

Proof.

q_{i} = \frac{e^{z_{i}}}{\sum_{k} e^{z_{k}}}, m_{i} = \frac{p_{i} + q_{i}}{2}, L_{J S} = \frac{1}{2} \sum_{i} p_{i} \log \frac{p_{i}}{m_{i}} + \frac{1}{2} \sum_{i} q_{i} \log \frac{q_{i}}{m_{i}} .

\frac{\partial L}{\partial q_{k}} = - \frac{\partial}{\partial q_{k}} [\frac{1}{2} \sum_{i} p_{i} \log m_{i}] + \frac{\partial}{\partial q_{k}} [\frac{1}{2} \sum_{i} q_{i} \log q_{i}] - \frac{\partial}{\partial q_{k}} [\frac{1}{2} \sum_{i} q_{i} \log m_{i}]

= - \frac{p_{k}}{2 (p_{k} + q_{k})} + \frac{1}{2} (\log q_{k} + 1) - \frac{1}{2} (\log \frac{(p_{k} + q_{k})}{2} + \frac{q_{k}}{p_{k} + q_{k}})

= \frac{1}{2} \log q_{k} - \frac{1}{2} \log \frac{(p_{k} + q_{k})}{2} = \frac{1}{2} \log \frac{q_{k}}{m_{k}},

\frac{\partial q_{j}}{\partial z_{i}} = \frac{(\frac{\partial}{\partial z_{i}} e^{z_{j}}) \sum_{k} e^{z_{k}} - e^{z_{j}} (\frac{\partial}{\partial z_{i}} \sum_{k} e^{z_{k}})}{{(\sum_{k} e^{z_{k}})}^{2}}

= {\begin{matrix} \frac{e^{z_{j}} \sum_{k} e^{z_{k}} - e^{z_{j}} e^{z_{j}}}{{(\sum_{k} e^{z_{k}})}^{2}} = \frac{e^{z_{j}}}{\sum_{k} e^{z_{k}}} \frac{\sum_{k} e^{z_{k}} - e^{z_{j}}}{\sum_{k} e^{z_{k}}} = q_{j} (1 - q_{j}), i = j \\ \frac{0 - e^{z_{j}} e^{z_{i}}}{{(\sum_{k} e^{z_{k}})}^{2}} = - \frac{e^{z_{j}}}{\sum_{k} e^{z_{k}}} \frac{e^{z_{i}}}{\sum_{k} e^{z_{k}}} = - q_{j} q_{i}, i \neq j \end{matrix}

\frac{\partial L_{J S}}{\partial z_{i}} = \sum_{j} \frac{\partial L_{J S}}{\partial q_{j}} \frac{\partial q_{j}}{\partial z_{i}} = \frac{1}{2} \sum_{j \neq i} \log \frac{q_{j}}{m_{j}} (- q_{j} q_{i}) + \frac{1}{2} \log \frac{q_{i}}{m_{i}} q_{i} (1 - q_{i})

= \frac{1}{2} q_{i} \log \frac{q_{i}}{m_{i}} - \frac{1}{2} q_{i} \sum_{j} q_{j} \log \frac{q_{j}}{m_{j}} = \frac{1}{2} q_{i} (\log \frac{q_{i}}{m_{i}} - \sum_{j} q_{j} \log \frac{q_{j}}{m_{j}}) .

Then we prove its convergence globally

Lemma 3.2: JS Divergence Convergence

\frac{\partial L_{J S}}{\partial z_{i}} = 0, \forall i ⟺ p_{i} = q_{i}, \forall i

Proof.
Sufficiency.

p_{i} = q_{i}, \forall i

⟹ \frac{\partial L_{J S}}{\partial z_{i}} = \frac{1}{2} q_{i} (\log \frac{q_{i}}{m_{i}} - \sum_{j} q_{j} \log \frac{q_{j}}{m_{j}}) = \frac{1}{2} q_{i} (\log \frac{q_{i}}{q_{i}} - \sum_{j} q_{j} \log \frac{q_{j}}{q_{j}}) = 0, \forall i

Necessity.

\frac{\partial L_{J S}}{\partial z_{i}} = 0, \forall i

⟹ q_{i} \frac{\partial L_{J S}}{\partial z_{j}} - q_{j} \frac{\partial L_{J S}}{\partial z_{i}} = 0, \forall i, j

⟹ q_{i} \frac{1}{2} q_{j} (\log \frac{q_{j}}{m_{j}} - \sum_{k} q_{k} \log \frac{q_{k}}{m_{k}}) - q_{j} \frac{1}{2} q_{i} (\log \frac{q_{i}}{m_{i}} - \sum_{k} q_{k} \log \frac{q_{k}}{m_{k}}) = 0, \forall i, j

⟹ q_{i} \frac{1}{2} q_{j} \log \frac{q_{j}}{m_{j}} - q_{j} \frac{1}{2} q_{i} \log \frac{q_{i}}{m_{i}} = 0, \forall i, j

⟹ \frac{1}{2} q_{i} q_{j} \log \frac{q_{j}}{m_{j}} = \frac{1}{2} q_{j} q_{i} \log \frac{q_{i}}{m_{i}}, \forall i, j

⟹ q_{j} m_{i} = q_{i} m_{j}, ∵ q_{i}, p_{i} \neq 0, \forall i, j

⟹ \frac{m_{j}}{q_{j}} = \frac{m_{i}}{q_{i}}, \forall i, j

⟹ \frac{p_{j} + q_{j}}{q_{j}} = \frac{p_{i} + q_{i}}{q_{i}}, \forall i, j

⟹ \frac{p_{j}}{q_{j}} = \frac{p_{i}}{q_{i}}, \forall i, j

⟹ \frac{p_{j}}{p_{i}} = \frac{q_{j}}{q_{i}}, \forall i, j

⟹ \frac{\sum_{j} p_{j}}{p_{i}} = \frac{\sum_{j} q_{j}}{q_{i}}, \forall i

⟹ p_{i} = q_{i}, \forall i

So JS divergence should be able to work in a similar same way as FKL, RKL and AKL. Actually since it has a similar derivative (see ??) with RKL(see ??) under LLM KD context, its performance can be analyzed the same way as Section 4.4.

4 Performance Analysis of Different KL Divergences

4.1 Theorems and Problem Definitions

We use the convergence rate theorem [2–4] under the nonconvex stochastic gradient descent setting, which is applicable to our problem hypothesis.

Theorem 4.1

After T iterations the expected gradient norm is bounded by

\frac{1}{T} \sum_{t = 0}^{T - 1} 𝔼 ∥ \nabla L (𝜃_{t}) ∥^{2} \leq \frac{2 (L (𝜃_{0}) - L)}{η T} + η L_{loss} σ^{2}

where the gradient–Lipschitz constant $L_{loss}$ is the smallest value such that

∥ \nabla_{𝜃} L (𝜃) - \nabla_{𝜃} L (𝜃^{'}) ∥ \leq L_{loss} ∥ 𝜃 - 𝜃^{'} ∥, \forall 𝜃, 𝜃^{'} .

In nonconvex optimization, it is not guaranteed for the convergence to a global minimum, but a standard goal is to show that the algorithm converges to a stationary point, i.e., a point where the gradient is close to zero. So for the inequality in ??, the LHS is the average expected squared gradient norm over T steps of SGD, which has a upperbound of the RHS. For the limited number of traininig epoches in the LLM knowledge distillation problem, the lower the upperbound of the gradient norm is, the faster the convergence rate may be.

Although other optimizers are usually used in LLM trainings, our focus is not the optimizer but to compare the convergence rates of KL divergence loss functions under different head and tail gap distributions. So we assume fixed step-size $η$ and similar results can be obtained to varying step-size for different gradient-based optimizers. We also assume the variance of the sampled gradient is bounded by $σ$ which is usually brought by the sampling process for a batch gradient update. $L (𝜃_{0})$ is the initial loss and $L$ is the loss after T steps, which we both assume are fixed constants. So only $L_{l o s s}$ , which is the gradient–Lipschitz constant of the loss function, is a variable that has an impact to the convergence rate.

Using the definitions of FKL and RKL under the LLM KD context from definitions 4.1 and 4.2.

Then we give the upperbound of $L_{l o s s}$ ,

Lemma 4.1

L_{l o s s} \leq M^{2} L_{F} + B L_{z z},

where

M = s u p ∥ J_{z} (𝜃) ∥, J_{z} (𝜃) = \frac{\partial z (𝜃)}{\partial 𝜃}

L_{F} = \sup ∥ \nabla^{2} F (z) ∥

B = s u p ∥ \nabla F (z) ∥ .

L_{z z} = \sup ∥ \frac{\partial J_{z} (𝜃)}{\partial 𝜃} ∥

Proof. See Section A.

From ?? we can see that M and $L_{z z}$ are bounded by the the 1st and 2nd order parameter derivative norm of LLM, which is trivial to our problem. In this work, we focus on bounding B under difference divergence loss and different head and tail distribution gap to analyze their correlations. For FKL, as we can see in ?? $L_{F}$ is irrelavant to the head and tail tap. For RKL, $L_{F}$ is a complicated but bounded small parameter which we leave it to possible future work.

4.2 Understanding B from the Generalization Bound of KL Divergence

Suppose the LLM’s model class is

H = {h : X \to z \in ℝ^{d}}

The empirical Rademacher complexity of in our problem definition can be bounded using the Rademacher contraction lemma

C_{}

R(F

\circ H) \leq B \cdot C_{}

R(H) The LLM model $h$ is fixed by assumption of our problem, then B is key to the performance of the whole model.

4.3 On the Convergence Rate of FKL

KL-divergence $D_{K L} (p, q)$ as a loss function has various convergence rates under different distributions of $p$ and $q$ . Any distribution property could have an impact to the convergence rate so that it may be difficult to find a way to change the head and tail gap of the two distributions without bringing other property difference of the two distributions. Due to the limited resources for LLM training, it is usually unrealistic to run as many as possible of epochs that could reach optimal performance. Therefore, we mainly focus on the beginning stage of the training and hypothesize some conditions that are not too harsh but also under which it is more fair to compare the performance differences largely brought by the head and tail distribution gap.

Definition 4.1: Hypothesis Conditions

Suppose there is a constant $c \in Z^{+}$ ,

p_{i} \geq q_{i}, \forall i \leq c

p_{i} \leq q_{i}, \forall i \geq c

p_{i} is in non-increasing order

And we give the definition of head H and tail T of a probability as

Definition 4.2

\forall i \in H, \sum_{i \in H} p_{i} = 0.5, i starts from 1

\forall i \notin H, i \in T . ∴ \sum_{i \in T} p_{i} = 0.5, i starts from | H | + 1

the cutoff index between head and tail is $c_{H T}$ , which may vary in different scenarios. Each of Head and tail can be furthur divided into to groups as

H = {H^{+}, H^{-}}, T = {T^{+}, T^{-}}

\forall i \in H, i \in H^{+} ⟺ p_{i} \geq q_{i}, i \in H^{-} ⟺ p_{i} \leq q_{i}

\forall i \in H, i \in T^{+} ⟺ q_{i} \geq p_{i}, i \in T^{-} ⟺ q_{i} \leq p_{i}

And define the two types of gaps $g$ and $g^{'}$ between p and q as

Definition 4.3

g_{H} = \sum_{i \in H} (p_{i} - q_{i}), g_{T} = \sum_{i \in T} (q_{i} - p_{i})

g_{H}^{'} = \sum_{i \in H} | p_{i} - q_{i} |, g_{T}^{'} = \sum_{i \in T} | q_{i} - p_{i} |

It is obvious that

g

always satisfies

Lemma 4.2

g_{H} = g_{T}

Proof.

g_{H} - g_{T} = \sum_{i \in H} (p_{i} - q_{i}) - \sum_{i \in T} (q_{i} - p_{i}) = (\sum_{i \in H} p_{i} + \sum_{i \in T} p_{i}) - (\sum_{i \in H} q_{i} + \sum_{i \in T} q_{i}) = 1 - 1 = 0

In this work, we focus on $g^{'}$ as the gap for FKL, which is also the same one used in [1]. We can see there can be only two possible scenarios for the head and tail gap, $H^{-} = ϕ$ or $T^{-} = ϕ$ . They can be both true but never be both false. See Figure 1 for a better description.
And $g^{'}$ satisfies the following properties.

Lemma 4.3: FKL’s gap properties

g_{H}^{'} - g_{T}^{'} = {\begin{matrix} - 2 \sum_{i \in T^{-}} (p_{i} - q_{i}) \leq 0, iff H^{-} = ϕ \\ 2 \sum_{i \in H^{-}} (q_{i} - p_{i}) \geq 0, iff T^{-} = ϕ \end{matrix}

g_{H}^{'} + g_{T}^{'} = {\begin{matrix} 2 \sum_{i \in T^{+}} (q_{i} - p_{i}) \leq 0, if g_{H}^{'} \leq g_{T}^{'} \\ 2 \sum_{i \in H^{+}} (p_{i} - q_{i}) \geq 0, if g_{H}^{'} \geq g_{T}^{'} \end{matrix}

Proof. See Section B.
Then we prove B in ?? is bounded by the total gap and

L_{F}

is not related to the gap.

Lemma 4.4: The upperbound of $B_{F K L}$

B_{F K L} \leq g_{H}^{'} + g_{T}^{'},

L_{F, F K L} is irrelevant to the head and tail ditribution gap.

Proof.

B_{F K L} \leq ∥ \nabla F_{F K L} (z) ∥ \leq ∥ \nabla F_{F K L} (z) ∥_{1} = | | p - q | |_{1} = g_{H}^{'} + g_{T}^{'} .

∵ \nabla F_{F K L} (z) = q (z) - p, \nabla^{2} F_{F K L} (z) = d i a g (q (z)) - q (z) q {(z)}^{⊤},

L_{F, F K L} = \sup_{z \in ℝ^{n}} ∥ \nabla^{2} F_{F K L} (z) ∥ is irrelavant to the distribution of p.

Finally we show that B is in different upperbound when the difference between head gap and tail gap varies.

Lemma 4.5

B_{F K L} |_{g_{H}^{'} \geq g_{T}^{'}} \leq B_{F K L} |_{g_{H}^{'} \leq g_{T}^{'}}

Proof. By ?? and ??,

B_{F K L} \leq g_{H}^{'} + g_{T}^{'} = {\begin{matrix} 2 \sum_{i \in T^{+}} (q_{i} - p_{i}), if g_{H}^{'} \leq g_{T}^{'} \\ 2 \sum_{i \in H^{+}} (p_{i} - q_{i}), if g_{H}^{'} \geq g_{T}^{'} \end{matrix}

We have

B_{F K L} |_{g_{H}^{'} \geq g_{T}^{'}} = \sum_{i \in H^{+}} (p_{i} - q_{i}) |_{g_{H}^{'} \geq g_{T}^{'}} = 0.5 - \sum_{i = 1}^{c_{H T}} q_{i} |_{g_{H}^{'} \geq g_{T}^{'}} \leq 0.5 - \sum_{i = 1}^{c} q_{i} \leq 0.5 - \sum_{i = 1}^{c_{H T}} q_{i} |_{g_{H}^{'} \leq g_{T}^{'}}

= \sum_{i \in H^{+}} (p_{i} - q_{i}) |_{g_{H}^{'} \leq g_{T}^{'}} \leq \sum_{i \in {H^{+}, T^{-}}} (p_{i} - q_{i}) |_{g_{H}^{'} \leq g_{T}^{'}} = \sum_{i \in T^{+}} (q_{i} - p_{i}) |_{g_{H}^{'} \leq g_{T}^{'}} = B_{F K L} |_{g_{H}^{'} \leq g_{T}^{'}}

By ?? and ??, we prove that when the head gap is larger than the tail gap, B has a lower upperbound so that FKL converges faster.

4.4 On the Convergence Rate of RKL

Similar to last section, we need to analyze B under RKL by different head and tail distribution gap. Unfortunately, RKL’s gradient is much more complicated than FKL hence we didn’t find a way to do so using the gap in ??, which is from the AKL paper[1]. We need to start from finding a better way to describe the head and tail distribution gap between q and p for RKL.

Stimulated by the fact of $R K L (q_{𝜃} | p)$ that it is more susceptible to the probability mass of q, we give the RKL’s definition of head H and tail T as

Definition 4.4: RKL’s head and tail

Suppose q is in non-increasing order,

\forall i \in H, - \sum_{i \in H} q_{i} l o g q_{i} = 0.5 H (q), i starts from 1, H(q) is q’s entropy

\forall i \notin H, i \in T . ∴ - \sum_{i \in T} q_{i} l o g q_{i} = 0.5 H (q), i starts from | H | + 1

H = {H^{+}, H^{-}}, T = {T^{+}, T^{-}}

\forall i \in H, i \in H^{+} ⟺ p_{i} \geq q_{i}, i \in H^{-} ⟺ p_{i} \leq q_{i}

\forall i \in H, i \in T^{+} ⟺ q_{i} \geq p_{i}, i \in T^{-} ⟺ q_{i} \leq p_{i}

Then we also give two definitions of gap

L o g

and

L o g^{'}

Definition 4.5: RKL’s gap

L o g_{H} = \sum_{i \in H} q_{i} l o g \frac{p_{i}}{q_{i}}, L o g_{T} = \sum_{i \in T} q_{i} l o g \frac{q_{i}}{p_{i}}

L o g_{H}^{'} = \sum_{i \in H} | q_{i} l o g \frac{p_{i}}{q_{i}} | = \sum_{i \in H^{+}} q_{i} l o g \frac{p_{i}}{q_{i}} + \sum_{i \in H^{-}} q_{i} l o g \frac{q_{i}}{p_{i}}

L o g_{T}^{'} = \sum_{i \in T} | q_{i} l o g \frac{q_{i}}{p_{i}} | = \sum_{i \in T^{+}} q_{i} l o g \frac{q_{i}}{p_{i}} + \sum_{i \in T^{-}} q_{i} l o g \frac{p_{i}}{q_{i}}

In this work, we focus on

L o g^{'}

as the gap for RKL. Then we have

Lemma 4.6: RKL’s gap properties

L o g_{H} - L o g_{T} = - F_{R K L}

L o g_{H}^{'} - L o g_{H} = 2 \sum_{i \in H^{-}} q_{i} l o g \frac{q_{i}}{p_{i}}

L o g_{T}^{'} - L o g_{T} = 2 \sum_{i \in T^{-}} q_{i} l o g \frac{p_{i}}{q_{i}}

L o g_{H}^{'} - L o g_{T}^{'} = {\begin{matrix} - 2 \sum_{i \in T^{-}} q_{i} l o g \frac{p_{i}}{q_{i}} - F_{R K L} \leq - F_{R K L}, iff H^{-} = ϕ \\ 2 \sum_{i \in H^{-}} q_{i} l o g \frac{q_{i}}{p_{i}} - F_{R K L} \geq - F_{R K L}, iff T^{-} = ϕ \end{matrix}

L o g_{H}^{'} + L o g_{T}^{'} = {\begin{matrix} - F_{R K L} + 2 \sum_{i \in T^{+}} q_{i} l o g \frac{q_{i}}{p_{i}}, if H^{-} = ϕ \\ F_{R K L} + 2 \sum_{i \in H^{+}} q_{i} l o g \frac{p_{i}}{q_{i}}, if T^{-} = ϕ \end{matrix}

Proof. See Section C.
?? can be interpreted as RKL ’s head and tail gap difference is centered at

- F_{R K L} \leq 0

. So the gap of RKL should have a center of mass that is closer to the tail part. Then we give the B’s uppperbound for RKL as the total gap.

Lemma 4.7: The upperbound 1 of $B_{R K L}$

B_{R K L} \leq L o g_{H}^{'} + L o g_{T}^{'} + F_{R K L},

Proof.

B_{R K L} = s u p ∥ \nabla F (z) ∥ \leq ∥ q (l o g \frac{q}{p} - F_{R K L}) ∥_{1} \leq ∥ q l o g \frac{q}{p} ∥_{1} + ∥ q F_{R K L} ∥_{1} = L o g_{H}^{'} + L o g_{T}^{'} + F_{R K L}

So B is actually not only bounded by the total gap, but also reverse KL divergence. By combining ?? with ??, we may have a bound that is more meaningful.

Lemma 4.8: The upperbound 2 of $B_{R K L}$

B_{R K L} \leq 2 \sum_{i \geq c} q_{i} l o g \frac{q_{i}}{p_{i}}

Proof.

B_{F K L} \leq L o g_{H}^{'} + L o g_{T}^{'} + F_{R K L} = {\begin{matrix} 2 \sum_{i \in T^{+}} q_{i} l o g \frac{q_{i}}{p_{i}}, if H^{-} = ϕ \\ 2 F_{R K L} + 2 \sum_{i \in H^{+}} q_{i} l o g \frac{p_{i}}{q_{i}}, if T^{-} = ϕ \end{matrix}

= {\begin{matrix} 2 \sum_{i \in T^{+}} q_{i} l o g \frac{q_{i}}{p_{i}}, if H^{-} = ϕ \\ 2 \sum_{i \in {H^{-}, T^{+}}} q_{i} l o g \frac{q_{i}}{q_{i}}, if T^{-} = ϕ \end{matrix}

∵ i \in T^{+} ⟺ i \geq c if H^{-} = ϕ, i \in {H^{-}, T^{+}} ⟺ i \geq c if T^{-} = ϕ

B_{F K L} \leq L o g_{H}^{'} + L o g_{T}^{'} + F_{R K L} = 2 \sum_{i \geq c} q_{i} l o g \frac{q_{i}}{p_{i}}

?? tells that only if the total gap gets smaller for the classes with $q \geq p$ , which actually accounts for the gap of the ”long tail” part, B has a lower upperbound then RKL may converges faster. That said, RKL’s performance may be mostly bounded by the tail gap rather than the difference between the head and tail gap, which matches the common understanding for RKL that it is more vulnerable for the part where p has a low value.

4.5 Summary

Intuitively, a KL divergence loss function aims to minimize the gap between two distributions so that the total gap should be highly related to the complexity and performance. However, the total gap is usually different from the KL divergence in value. Firstly we use the convergence rate theorem and Rademacher complexity contraction lemma to analyze this problem. Then we give the appropriate definitions of gap for FKL and RKL respectively. By strictly analysis of FKL, we prove that the total gap is indeed a key factor to the performance under different head and tail gap distributions for it. For RKL, we find that it works better when the long tail gap is smaller, which consists of the tail part and also some of the head part where $q \geq p$ . Although it is still difficult to compare the complexity of FKL and RKL for a more strict proof of AKL’s performance, another interesting extension may be that can we use the gap definitions in this work as a loss function instead of FKL, RKL or AKL?

5 Conclusion

In this work, we have advanced the theoretical understanding of knowledge distillation for large language models by revisiting and extending the analysis of KL divergence loss functions. First, we proved that the proposed Adaptive KL framework retains the same global convergence properties as both forward and reverse KL. Then, by introducing the convergence guarantee of Jensen–Shannon divergence , we provided one more option of divergence loss as an alternative symmetric and bounded objective. Lastly, we invested quite some effort to carefully derive explicit bounds for forward and reverse KL through convergence-rate analysis based on stochastic nonconvex optimization and Rademacher complexity, which reveals that FKL converges faster when the student’s overestimation concentrates more in the head region than the tail, and RKL’s convergence may be predominantly governed by the gap in the long-tail.

Together, these results may be hopefully helpful on understanding if and why each divergence favors different regions of the discrepancies between the teacher and student distribution, partly justifying the adaptive convex-combination strategy of AKL. The performance analysis result for RKL in this work implies that it may be bounded mostly by the tail part, which does not necessarily support the idea that RKL may performs better if the tail gap is larger than the head gap. Beyond validating AKL’s theoretical soundness, we raised a question whether any of the gap definitions in this work helps in building a better divergence loss function. For potential future work, one may try to look for more theoretical insights and empirical evidence on these gaps.

References

[1] Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models, 2024. URL https://arxiv.org/abs/2404.02657.

[2] Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming, 2013. URL https://arxiv.org/abs/1309.5549.

[3] Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods, 2024. URL https://arxiv.org/abs/2301.11235.

[4] Liam Madden, Emiliano Dall’Anese, and Stephen Becker. High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise, 2024. URL https://arxiv.org/abs/2006.05610.

A ?? Proof

The convergence condition for AKL is given by,

\frac{\partial A K L (p, q_{𝜃})}{\partial z_{i}^{q}} = 0 \forall i \in {1, 2, \dots, V}

We can consider the coefficients for AKL formula as scalar weights $w_{F K L}, w_{R K L}$ such that $w_{F K L} + w_{R K L} = 1$

\frac{\partial A K L (p, q_{𝜃})}{\partial z_{i}^{q}} = w_{F K L} \frac{\partial F K L (p, q_{𝜃})}{\partial z_{i}^{q}} + w_{R K L} \frac{\partial R K L (p, q_{𝜃})}{\partial z_{i}^{q}}

\frac{\partial A K L (p, q_{𝜃})}{\partial z_{i}^{q}} = w_{F K L} (q_{𝜃} (Y_{i} | y < t) - p (Y_{i} | y < t)) + w_{R K L} (q_{𝜃} (Y_{i} | y < t) (\log \frac{q_{𝜃} (Y_{i} | y}{< t} p (Y_{i} | y < t) - R K L (p, q_{𝜃})))

Sufficiency Condition. Put $q_{𝜃} (Y_{i} | y < t) = p (Y_{i} | y < t)$ in the above equation to evaluate its value,

\begin{array}{l} \frac{\partial A K L (p, q_{𝜃})}{\partial z_{i}^{q}} & = w_{F K L} (p (Y_{i} | y < t) - p (Y_{i} | y < t)) - w_{R K L} (q_{𝜃} (Y_{i} | y < t R K L (p, q_{𝜃})) \\ = - w_{R K L} (q_{𝜃} (Y_{i} | y < t) R K L (p, q_{𝜃})) \\ = 0, ∵ R K L (p, q_{𝜃}) = 0, \forall i \in {1, 2, \dots, V} \end{array}

Necessity Condition. We need to show that if the differential is $0$ , then it must hold that $q_{𝜃} (Y_{i} | y < t) = p (Y_{i} | y < t)$

w_{F K L} \frac{\partial F K L (p, q_{𝜃})}{\partial z_{i}^{q}} + w_{R K L} \frac{\partial R K L (p, q_{𝜃})}{\partial z_{i}^{q}} = 0

w_{F K L} (q_{𝜃} (Y_{i} | y < t) - p (Y_{i} | y < t)) + w_{R K L} (q_{𝜃} (Y_{i} | y < t) (\log \frac{q_{𝜃} (Y_{i} | y}{< t} p (Y_{i} | y < t) - R K L (p, q_{𝜃}))) = 0

Let $w_{R K L} = w$ , then $w_{F K L} = 1 - w$

(1 - w) (q_{i} - p_{i}) + w (q_{i} (\log \frac{q_{i}}{p_{i}} - R K L)) = 0, \forall i

⟹ q_{j} (1 - w) (q_{i} - p_{i}) + q_{j} w (q_{i} (\log \frac{q_{i}}{p_{i}} - R K L)) = 0, \forall i, j, (1)

and

q_{i} (1 - w) (q_{j} - p_{j}) + q_{i} w (q_{j} (\log \frac{q_{j}}{p_{j}} - R K L)) = 0, \forall i, j, (2)

(2) - (1) we have

⟹ q_{j} p_{i} - q_{i} p_{j} + w p_{j} q_{i} - w p_{i} q_{j} + w q_{i} q_{j} (l o g \frac{q_{j}}{p_{j}} - l o g \frac{q_{i}}{p_{i}}) = 0, \forall i, j

⟹ (1 - w) (q_{j} p_{i} - q_{i} p_{j}) = w q_{i} q_{j} l o g \frac{q_{i} p_{j}}{q_{j} p_{i}}, \forall i, j

⟹ \frac{l o g \frac{p_{j}}{q_{j}} - l o g \frac{p_{i}}{q_{i}}}{\frac{p_{i}}{q_{i}} - \frac{p_{j}}{q_{j}}} = \frac{(1 - w)}{w}, \forall i, j

However log(x) is a strictly concave function so that its secant slope cannot always be a constant of $\frac{(1 - w)}{w}$ unless

\frac{p_{i}}{q_{i}} = \frac{p_{j}}{q_{j}}, \forall i, j

⟹ \frac{q_{j}}{q_{i}} = \frac{p_{j}}{p_{i}}, \forall i, j

Added above together for all j

⟹ \frac{\sum_{j} q_{j}}{q_{i}} = \frac{\sum_{j} p_{j}}{p_{i}}, \forall i

⟹ q_{i} = p_{i}, \forall i

This completes the proof.

B ?? Proof

L (𝜃) = F (z (𝜃))

\nabla L (𝜃) = J_{z} {(𝜃)}^{⊤} \nabla F (z (𝜃)), J_{z} (𝜃) = \frac{\partial z (𝜃)}{\partial 𝜃}

∥ \nabla L (𝜃) - \nabla L (𝜃^{'}) ∥ = ∥ J_{z} {(𝜃)}^{⊤} \nabla F (z (𝜃)) - J_{z} {(𝜃^{'})}^{⊤} \nabla F (z (𝜃^{'})) ∥ .

∥ \nabla L (𝜃) - \nabla L (𝜃^{'}) ∥ \leq ∥ J_{z} {(𝜃)}^{⊤} (\nabla F (z (𝜃)) - \nabla F (z (𝜃^{'}))) ∥ + ∥ (J_{z} {(𝜃)}^{⊤} - J_{z} {(𝜃^{'})}^{⊤}) \nabla F (z (𝜃^{'})) ∥ .

For the first term

∥ J_{z} {(𝜃)}^{⊤} (\nabla F (z (𝜃)) - \nabla F (z (𝜃^{'}))) ∥ \leq ∥ J_{z} (𝜃) ∥ ∥ \nabla F (z (𝜃)) - \nabla F (z (𝜃^{'})) ∥ .

\leq ∥ J_{z} (𝜃) ∥ ∥ \nabla F (z (𝜃)) - \nabla F (z (𝜃^{'})) ∥ \leq L_{F} ∥ z (𝜃) - z (𝜃^{'}) ∥

\leq M L_{F} ∥ z (𝜃) - z (𝜃^{'}) ∥ \leq M^{2} L_{F} ∥ 𝜃 - 𝜃^{'} ∥ .

For the second term

∥ (J_{z} {(𝜃)}^{⊤} - J_{z} {(𝜃^{'})}^{⊤}) \nabla F (z (𝜃^{'})) ∥ \leq ∥ J_{z} (𝜃) - J_{z} (𝜃^{'}) ∥ ∥ \nabla F (z (𝜃^{'})) ∥

\leq B L_{z z} ∥ 𝜃 - 𝜃^{'} ∥

Putting these together

∥ \nabla L (𝜃) - \nabla L (𝜃^{'}) ∥ \leq (M^{2} L_{F} + B L_{z z}) ∥ 𝜃 - 𝜃^{'} ∥ .

L_{l o s s} \leq M^{2} L_{F} + B L_{z z} .

C ?? Proof

g_{H}^{'} = \sum_{i \in H^{+}} p_{i} - \sum_{i \in H^{+}} q_{i} + \sum_{i \in H^{-}} q_{i} - \sum_{i \in H^{-}} p_{i}

g_{T}^{'} = \sum_{i \in T^{+}} q_{i} - \sum_{i \in T^{+}} p_{i} + \sum_{i \in T^{-}} p_{i} - \sum_{i \in T^{-}} q_{i}

g_{H}^{'} - g_{H} = 2 \sum_{i \in H^{-}} (q_{i} - p_{i}), g_{T}^{'} - g_{T} = 2 \sum_{i \in T^{-}} (p_{i} - q_{i})

g_{H}^{'} - g_{T}^{'} = (g_{H}^{'} - g_{H}) - (g_{T}^{'} - g_{T}) = 2 \sum_{i \in H^{-}} (q_{i} - p_{i}) - 2 \sum_{i \in T^{-}} (p_{i} - q_{i})

= {\begin{matrix} - 2 \sum_{i \in T^{-}} (p_{i} - q_{i}) \leq 0, iff H^{-} = ϕ \\ 2 \sum_{i \in H^{-}} (q_{i} - p_{i}) \geq 0, iff T^{-} = ϕ \end{matrix}

g_{H}^{'} + g_{T}^{'} = {\begin{matrix} \sum_{i \in H^{+}} p_{i} - \sum_{i \in H^{+}} q_{i} + \sum_{i \in T^{+}} q_{i} - \sum_{i \in T^{+}} p_{i} + \sum_{i \in T^{-}} p_{i} - \sum_{i \in T^{-}} q_{i}, if g_{H}^{'} \leq g_{T}^{'} \\ \sum_{i \in H^{+}} p_{i} - \sum_{i \in H^{+}} q_{i} + \sum_{i \in H^{-}} q_{i} - \sum_{i \in H^{-}} p_{i} + \sum_{i \in T^{+}} q_{i} - \sum_{i \in T^{+}} p_{i}, if g_{H}^{'} \geq g_{T}^{'} \end{matrix}

= {\begin{matrix} 0.5 - \sum_{i \in {H^{+}, T^{-}}} q_{i} + \sum_{i \in T^{+}} q_{i} - \sum_{i \in {T^{+}, T^{-}}} p_{i} - 2 \sum_{i \in T^{+}} p_{i}, if g_{H}^{'} \leq g_{T}^{'} \\ - \sum_{i \in {H^{+}, H^{-}}} p_{i} - \sum_{i \in H^{+}} q_{i} + \sum_{i \in {H^{-}, T^{+}}} q_{i} + 2 \sum_{i \in H^{+}} p_{i} - 0.5, if g_{H}^{'} \geq g_{T}^{'} \end{matrix}

= {\begin{matrix} 0.5 - (1 - \sum_{i \in T^{+}} q_{i}) + \sum_{i \in T^{+}} q_{i} - 0.5 - 2 \sum_{i \in T^{+}} p_{i}, if g_{H}^{'} \leq g_{T}^{'} \\ - 0.5 - \sum_{i \in H^{+}} q_{i} + (1 - \sum_{i \in H^{+}} q_{i}) + 2 \sum_{i \in H^{+}} p_{i} - 0.5, if g_{H}^{'} \geq g_{T}^{'} \end{matrix}

= {\begin{matrix} 2 \sum_{i \in T^{+}} (q_{i} - p_{i}), if g_{H}^{'} \leq g_{T}^{'} \\ 2 \sum_{i \in H^{+}} (p_{i} - q_{i}), if g_{H}^{'} \geq g_{T}^{'} \end{matrix}

D ?? Proof

L o g_{H} - L o g_{T} = - \sum_{i \in {H, T}} q_{i} l o g \frac{q_{i}}{p_{i}} = - F_{R K L}

L o g_{H}^{'} - L o g_{T}^{'} = L o g_{H}^{'} - L o g_{T}^{'} - ((L o g_{H} - L o g_{T}) + F_{R K L})

= (L o g_{H}^{'} - L o g_{H}) - (L o g_{T}^{'} - L o g_{T}) - F_{R K L}

= 2 \sum_{i \in H^{-}} q_{i} l o g \frac{p_{i}}{q_{i}} - 2 \sum_{i \in T^{-}} q_{i} l o g \frac{q_{i}}{p_{i}} - F_{R K L}

= {\begin{matrix} - R K L + 2 \sum_{i \in T^{+}} q_{i} l o g \frac{q_{i}}{p_{i}}, if L o g_{H}^{'} - L o g_{T}^{'} \leq - F_{R K L} \\ R K L + 2 \sum_{i \in H^{+}} q_{i} l o g \frac{p_{i}}{q_{i}}, if L o g_{H}^{'} - L o g_{T}^{'} \geq - F_{R K L} \end{matrix}

If $L o g_{H}^{'} - L o g_{T}^{'} \leq - F_{R K L}$ ,

L o g_{H}^{'} = \sum_{i \in H^{+}} q_{i} l o g p_{i} - \sum_{i \in H^{+}} q_{i} l o g q_{i}

= \sum_{i \in H^{+}} q_{i} l o g p_{i} + 0.5 H (q)

L o g_{T}^{'} = \sum_{i \in T^{+}} q_{i} l o g q_{i} - \sum_{i \in T^{+}} q_{i} l o g p_{i} + \sum_{i \in T^{-}} q_{i} l o g p_{i} - \sum_{i \in T^{-}} q_{i} l o g q_{i}

= - 0.5 H (q) - 2 \sum_{i \in T^{-}} q_{i} l o g q_{i} - \sum_{i \in T^{+}} q_{i} l o g p_{i} + \sum_{i \in T^{-}} q_{i} l o g p_{i}

L o g_{H}^{'} + L o g_{T}^{'} = \sum_{i \in {H^{+}, T^{+}, T^{-}}} q_{i} l o g p_{i} - 2 \sum_{i \in T^{-}} q_{i} l o g q_{i} - 2 \sum_{i \in T^{+}} q_{i} l o g p_{i}

= - H (q, p) + 2 H (q) + 2 \sum_{i \in H^{+}} q_{i} l o g q_{i} - 2 \sum_{i \in T^{+}} q_{i} l o g p_{i}, H(q,p) is cross-entropy of q and p

= - F_{R K L} + 2 \sum_{i \in H^{+}} q_{i} l o g \frac{q_{i}}{p_{i}}

If $L o g_{H}^{'} - L o g_{T}^{'} \geq - F_{R K L}$ ,

L o g_{H}^{'} = \sum_{i \in H^{+}} q_{i} l o g p_{i} - \sum_{i \in H^{+}} q_{i} l o g q_{i} + \sum_{i \in H^{-}} q_{i} l o g q_{i} - \sum_{i \in H^{-}} q_{i} l o g p_{i}

= 2 \sum_{i \in H^{+}} q_{i} l o g p_{i} + 0.5 H (q) + 2 \sum_{i \in H^{-}} q_{i} l o g q_{i} - \sum_{i \in {H^{+}, H^{-}}} q_{i} l o g p_{i}

L o g_{T}^{'} = \sum_{i \in T^{+}} q_{i} l o g q_{i} - \sum_{i \in T^{+}} q_{i} l o g p_{i}

= - 0.5 H (q) - \sum_{i \in T^{+}} q_{i} l o g p_{i}

L o g_{H}^{'} + L o g_{T}^{'} = - \sum_{i \in {H^{+}, H^{-}, T^{+}}} q_{i} l o g p_{i} + 2 \sum_{i \in H^{-}} q_{i} l o g q_{i} + 2 \sum_{i \in H^{+}} q_{i} l o g p_{i}

= H (q, p) + (- H (q) - 2 \sum_{i \in H^{+}} q_{i} l o g q_{i}) + 2 \sum_{i \in H^{+}} q_{i} l o g p_{i}, H(q,p) is cross-entropy of q and p

= F_{R K L} + 2 \sum_{i \in H^{+}} q_{i} l o g \frac{p_{i}}{q_{i}}

L o g_{H}^{'} + L o g_{T}^{'} = {\begin{matrix} - F_{R K L} + 2 \sum_{i \in T^{+}} q_{i} l o g \frac{q_{i}}{p_{i}}, if L o g_{H}^{'} - L o g_{T}^{'} \leq - F_{R K L} \\ F_{R K L} + 2 \sum_{i \in H^{+}} q_{i} l o g \frac{p_{i}}{q_{i}}, if L o g_{H}^{'} - L o g_{T}^{'} \geq - F_{R K L} \end{matrix}

Citation

If you find this analysis useful and want to cite it in your work, you can use the following BibTeX entry:

@misc{zhong2025convergence, title={On the Convergence and Performance Analysis of Adaptive Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models}, author={Joseph Zhong}, year={2025}, howpublished={\url{https://josephzhong.github.io/writings/kd/main.html}} }