KL Regularization Analysis
Two Level KL 关于LLM强化学习中的KL散度,假设策略模型为$\pi_\theta$,参考模型为$\pi_{ref}$,两个模型的KL散度定义为 $$ D_{KL}(\pi_\theta\Vert\pi_{ref})=\mathbb E_{y\sim\pi_\theta}\log\frac{\pi_\theta(y)}{\pi_{ref}(y)}=\sum_{y\in\mathcal Y}\pi_\theta(y)\log\frac{\pi_\theta(y)}{\pi_{ref}(y)} $$ ...