LLM Notes

LLMs

2.1 rope

位置编码：Transformer里注意力机制本身对顺序无感知，必须引入位置信息

理想的位置编码应该满足：

每个位置有唯一表示
相对位置可感知：第m个token对第n个token的注意力得分，应该只依赖相对距离 m-n ，而不是绝对位置
能外推到更长序列（训练没见过的长度）

传统的绝对位置编码（Sinusoidal PE）直接把位置信息加在embedding上，无法天然满足“相对位置”性质

注意力分数由$q^T_mk_n$计算，Rope思路：构造一个函数$f$，使得：

$$ \langle f(q,m), f(k,n)\rangle = g(q,k,m-n) $$

即：内积结果只与相对位置 m-n 有关，与绝对位置无关。

从2维情形推导

对于向量$q=[q_0,q_1]$，把它看作复数

$$ q\leftrightarrow q_o+iq_1 $$

定义编码函数为“旋转”：

$$ f(q,m)=q\cdot e^{im\theta}=(q_0+iq_1)\cdot(\cos m\theta+i\sin m\theta) $$$$ f(q,m)=\begin{pmatrix}q_0^\prime \\ q_1^\prime\end{pmatrix}=\begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}\begin{pmatrix}q_0 \\ q_1\end{pmatrix}=R_m q $$

含义是，把向量在复平面上旋转 $m\theta$ 角度。这就是定义出来了这个$f$函数。

验证内积性质发现：

$$ \langle f(q,m),f(k,n)\rangle=\text{Re}[(qe^{im\theta})\cdot\overline{(ke^{in\theta})}]=\text{Re}[q\overline{k}\cdot e^{i(m-n)\theta}] $$$$ \begin{align*} \langle f(q,m),f(k,n)\rangle&=\begin{pmatrix}q_0^\prime \\ q_1^\prime\end{pmatrix}^T\begin{pmatrix}k_0^\prime \\ k_1^\prime\end{pmatrix} \\ &=\begin{pmatrix}q_0 \\ q_1\end{pmatrix}^T \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}^T\begin{pmatrix}\cos n\theta & -\sin n\theta \\ \sin n\theta & \cos n\theta\end{pmatrix}\begin{pmatrix}k_0 \\ k_1\end{pmatrix} \\ &=\begin{pmatrix}q_0 \\ q_1\end{pmatrix}^T\begin{pmatrix}\cos((n-m)\theta) & -\sin((n-m)\theta) \\ \sin((n-m)\theta) & \cos((n-m)\theta)\end{pmatrix}\begin{pmatrix}k_0 \\ k_1\end{pmatrix} \end{align*} $$

结果只依赖 m-n。

扩展到高维（实际使用）

Rope的旋转矩阵是一个分块对角矩阵

$$ R_m = \begin{bmatrix}R_{m\theta_1} & 0 & \cdots & 0 \\ 0 & R_{m\theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m\theta_{d/2}}\end{bmatrix} $$

其中每一个小块是：

$$ R_{m\theta_i}=\begin{bmatrix}\cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i)\end{bmatrix} $$

因此:

$$ \begin{align*} f(q,m)&= R_m = \begin{bmatrix}R_{m\theta_1} & 0 & \cdots & 0 \\ 0 & R_{m\theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m\theta_{d/2}}\end{bmatrix} \begin{bmatrix}q_0\\q_1\\\vdots\\ q_{d-1}\end{bmatrix} \\ &= \begin{bmatrix}\cos(m\theta_1) & -\sin(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ \sin(m\theta_1) & \cos(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos(m\theta_2) & -\sin(m\theta_2) & \cdots & 0 & 0 \\ 0 & 0 & \sin(m\theta_2) & \cos(m\theta_2) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & \cos(m\theta_{d/2}) & -\sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & \cdots & \sin(m\theta_{d/2}) & \cos(m\theta_{d/2})\end{bmatrix}\begin{bmatrix}q_0\\q_1\\q_2\\q_3\\\vdots\\ q_{d-2}\\ q_{d-1}\end{bmatrix} \\ &=\begin{bmatrix}q_0\cos(m\theta_1)-q_1\sin(m\theta_1) \\ q_0\sin(m\theta_1)+q_1\cos(m\theta_1) \\ q_2\cos(m\theta_2)-q_3\sin(m\theta_2) \\ q_2\sin(m\theta_2)+q_3\cos(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2})-q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-2}\sin(m\theta(d/2))+q_{d-1}\cos(m\theta_{d/2})\end{bmatrix} = \begin{bmatrix}q_0\cos(m\theta_1)-q_1\sin(m\theta_1) \\ q_1\cos(m\theta_1)+q_0\sin(m\theta_1) \\ q_2\cos(m\theta_2)-q_3\sin(m\theta_2) \\ q_3\cos(m\theta_2)+q_2\sin(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2})-q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-1}\cos(m\theta_{d/2})+q_{d-2}\sin(m\theta_{d/2})\end{bmatrix} \\ &=\begin{bmatrix}q_0\cos(m\theta_1) \\ q_1\cos(m\theta_1) \\ q_2\cos(m\theta_2) \\ q_3\cos(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2}) \\ q_{d-1}\cos(m\theta_{d/2})\end{bmatrix} + \begin{bmatrix}-q_1\sin(m\theta_1) \\ q_0\sin(m\theta_1) \\ -q_3\sin(m\theta_2) \\ q_2\sin(m\theta_2) \\ \vdots \\ -q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-2}\sin(m\theta_{d/2})\end{bmatrix} \end{align*} $$

能推出$\langle f(q,m),f(k,n)\rangle=q^TR_{n-m}k$

其中$\theta_i$如下，$i=1,2,\cdots,d/2$

$$ \theta_i = \frac{1}{10000^{\frac{i-1}{d/2}}} = 10000^{-\frac{i-1}{d/2}} $$$$ \begin{align*} \mathbf{\theta} &= \begin{bmatrix}10000^{-\frac{0}{d/2}}, 10000^{-\frac{1}{d/2}},\cdots,10000^{-\frac{(d/2)-1}{d/2}}\end{bmatrix} \\ &=\begin{bmatrix}10000^{-\frac{0}{d}} & 10000^{-\frac{2}{d}} & \cdots & 10000^{-\frac{d-2}{d}}\end{bmatrix} \end{align*} $$

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105


import torch
from typing import Tuple

def rotate_half(x: torch.Tensor) -> torch.Tensor:
    """
    [x0, x1, x2, x3, ...] -> [-x1, x0, -x3, x2, ...]
    """
  x_even = x[..., 0::2] # [..., head_dim / 2]
  x_add = x[..., 1::2] # [..., head_dim / 2]

  x_rot = torch.stack((-x_odd, x_even), dim=-1).flatten(-2)
  
  return x_rot

def apply_rope(q: torch.Tensor, k: torch.Tensor, base: float = 10000.0) -> Tuple[torch.Tensor, torch.Tensor]:
  """
  对query和key应用 Rope 位置编码
  输入：
    q: Tensor, shape = [bs, seq_len, num_heads, head_dim]
    k: Tensor, shape = [bs, seq_len, num_heads, head_dim]
  输出：
    q_rope: Tensor, shape = [bs, seq_len, num_heads, head_dim]
    k_rope: Tensor, shape = [bs, seq_len, num_heads, head_dim]
  约束：
    1. q,k shape相同
    2. head_dim必须偶数
    3. Rope只作用在最后一维 head_dim上
  """
  bs, seq_len, num_heads, head_dim = q.shape
  assert head_dim % 2 == 0

  device = q.device
  dtype = q.dtype

  """
  构造每一组二维向量对应的频率
  shape: [head_dim / 2]
  inv_freq实际就是 [\theta_1, \theta_2, ..., \theta_{d/2}]
  """
  inv_freq = base ** -(torch.arange(0, head_dim, 2, device=device).float()) / head_dim

  """
  构造位置索引
  shape: [seq_len]
  """
  position_ids = torch.arange(seq_len, device=device).float()

  """
  计算每个位置、每个频率对应的旋转角度
  position_ids: [seq_len]
  inv_freq: [head_dim / 2]
  freqs: [seq_len, head_dim / 2]
  freqs就是所有位置对应不同head_dim位置的旋转角度

  m\theta_1, m\theta_2, ..., m\theta_{d/2}
  m = 0, 1, ..., seq_len-1
  """
  position_ids = position_ids.unsqueeze(-1) # [seq_len, 1]
  inv_freq = inv_freq.unsqueeze(0) # [1, head_dim]
  freqs = position_ids * inv_freq # [seq_len, head_dim / 2]

  """
  每个频率对应二维中的两个维度，所以复制一份
  shape: [seq_len, head_dim]
  
  freqs = tensor([
      [a, b, c],
      [d, e, f],
  ])

  ->

  tensor([
      [a, a, b, b, c, c],
      [d, d, e, e, f, f],
  ])

  freqs:
  m\theta_1, m\theta_1, m\theta_2, m\theta_2, ..., m\theta_{d/2}, m\theta_{d/2}
  """
  freqs = torch.repeat_interleave(freqs, repeats=2, dim=-1)

  """
  # 构造 cos / sin, 并broadcast到 q/k 形状
  # 原始: [seq_len, head_dim]
  # 目标: [1, seq_len, 1, head_dim]
  """
  cos = freqs.cos()[None, :, None, :].to(dtype)
  sin = freqs.sin()[None, :, None, :].to(dtype)

  """
  # 应用Rope
  # 二维旋转公式：
  # [x0', x1'] = [x0 * cos - x1 * sin, x0 * sin + x1 * cos]

  q = [q_0, q_1, q_2, q_3, ..., q_{d-2}, q_{d-1}]
  rotate_half(q) = [-q_1, q_0, -q_3, q_2, ..., -q_{d-1}, q_{d-2}]
  cos = [cos(m\theta_1), cos(m\theta_1), cos(m\theta_2), cos(m\theta_2), ..., cos(m\theta(d/2)), cos(m\theta(d/2))]
  sin = [sin(m\theta_1), sin(m\theta_1), sin(m\theta_2), sin(m\theta_2), ..., sin(m\theta(d/2)), sin(m\theta(d/2))]

  q_rope = [q_0cos(m\theta_1) - q_1sin(m\theta_1), ...]

  """
  q_rope = q * cos + rotate_half(q) * sin
  k_rope = k * cos + rotate_half(k) * sin

2.2 mhsa

设输入：

$$ X\in\mathbb R^{B\times T\times d} $$

其中：$B=\text{batch size}, T=\text{seq len}, d=d_{model}$

设多头数为：$h$，每个head的维度为$d_h=\frac{d}{h}$

线性映射得到Q，K，V

$$ \begin{aligned} Q &= XW_Q \in \mathbb R^{B\times T\times d}, \\ K &= XW_K \in \mathbb R^{B\times T\times d}, \\ V &= XW_V \in \mathbb R^{B\times T\times d}. \end{aligned} $$

其中$W_Q,W_K,W_V\in\mathbb R^{d\times d}$

拆成多头将最后一维度拆成$h$个head：

$$ Q, K, V\in\mathbb R^{B\times T\times h\times d_h} $$

经过transpose：

$$ Q, K, V\in\mathbb R^{B\times h\times T\times d_h} $$

计算注意力分数对每个batch、每个head，计算：

$$ S=\frac{QK^T}{\sqrt{d_h}} $$

其中$K^T$是对最后两个维度转置：

$$ K^T\in\mathbb R^{B\times h\times d_h\times T} $$

因此：$S\in\mathbb R^{B\times h\times T\times T}$

Softmax得到注意力权重

对最后一维度做softmax：

$$ A=\text{softmax}(S, dim=-1)\in\mathbb R^{B\times h\times T\times T} $$

加权求和$V$

$$ \begin{aligned} O_{\text{head}} &= AV, \\ A &\in \mathbb R^{B\times h\times T\times T}, \\ V &\in \mathbb R^{B\times h\times T\times d_h}. \end{aligned} $$

所以：$O_{head}\in\mathbb R^{B\times h\times T\times d_h}$

合并多头并线性映射

先把$O_{\text{head}}$转置并合并最后两个维度：

$$ \begin{aligned} O_{\text{concat}} &\in \mathbb R^{B\times T\times (h d_h)} = \mathbb R^{B\times T\times d}, \\ O &= O_{\text{concat}}W_O \in \mathbb R^{B\times T\times d}. \end{aligned} $$

其中$W_O\in\mathbb R^{d\times d}$。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadSelfAttention(nn.Module):
  """
  输入：
    x: [B, T, d]
  输出：
    out: [B, T, d]
  其中：
    B = batch size
    T = seq_len
    d = d_model
    h = num_heads
    d_h = d // h
  """
  def __init__(self, d_model: int, num_heads: int):
    super().__init__()
    assert d_model % num_heads == 0
    self.d_model = d_model
    self.num_heads = num_heads
    self.d_h = d_model // num_heads

    # 一次性生成 Q, K, V
    self.qkv_proj = nn.Linear(d_model, 3 * d_model)

    # 输出投影 O
    self.out_proj = nn.Linear(d_model, d_model)
  
  def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
    B, T, d = x,shape

    # x: [B, T, d]
    # qkv: [B, T, 3d]
    qkv = self.qkv_proj(x)

    # qkv: [B, T, 3, h, d_h]
    qkv = qkv.view(B, T, 3, self.num_heads, self.d_h)

    # qkv: [3, B, h, T, d_h]
    qkv = qkv.permute(2, 0, 3, 1, 4)

    # q, k, v: [B, h, T, d_h]
    q, k, v = qkv[0], qkv[1], qkv[2]

    # scores: [B, h, T, T]
    scores = q @ k.transpose(-2, -1)
    scores = scores / (self.d_h ** 0.5)

    # mask 可选，casual mask 或 padding mask
    if mask is not None:
      scores = scores.masked_fill(mask == 0, float("-inf"))
    
    # attn: [B, h, T, T]
    attn = F.softmax(scores, dim=-1)

    # out: [B, h, T, d_h]
    out = attn @ v

    # out: [B, T, h, d_h]
    # .contiguous: 作用是把每个tensor在内存中按顺序排序，因为.permute和.transpose都只是改变访问顺序，内存顺序没有变化。
    # .reshape会自动拷贝，如果内存不连续的话
    out = out.transpose(1, 2).contiguous()

    # out: [B, T, d]
    out = out.view(B, T, d)

    # out: [B, T, d]
    out = self.out_proj(out)

    return out

2.3 kvcache

保存历史推理过程中计算得到的$k,v$向量，在计算最新输出token的时候可以复用之前的$k,v$向量

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83


import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadSelfAttentionWithKVCache(nn.Module):
  """
  输入：
    x: [B, T, d]
  输出：
    out: [B, T, d]
    new_k: [B, h, past_len + T, d_h]
    new_v: [B, h, past_len + T, d_h]
  其中：
    B = batch size
    T = 当前输入长度
        prefill阶段：T = prompt_len
        decode阶段： T = 1
    d = d_model
    h = num_heads
    d_h = d // h
  """
  def __init__(self, d_model: int, num_heads: int):
    super().__init__()
    assert d_model % num_heads == 0

    self.d_model = d_model
    self.num_heads = num_heads
    self.d_h = self.d_model // self.num_heads

    self.qkv_proj = nn.Linear(d_model, 3 * d_model)
    self.out_proj = nn.Linear(d_model, d_model)

  def forward(self, x: torch.Tensor, mask: torch.Tensor = None, past_k: torch.Tensor = None, past_v: torch.Tensor = None, use_cache: bool = True):
    """
    x: [B, T, d]
    past_k: None or [B, h, past_len, d_h]
    past_v: None or [B, h, past_len, d_h]

    return:
      out: [B, T, d]
      new_k: [B, h, past_len + T, d_h]
      new_v: [B, h, past_len + T, d_h]
    """
    B, T, d = x.shape

    # qkv: [B, T, 3d]
    qkv = self.qkv_proj(x)
    
    # qkv: [3, B, h, T, d_h]
    qkv = qkv.view(B, T, 3, self.num_heads, self.d_h).permute(2, 0, 3, 1, 4)

    # [B, h, T, d_h]
    q, k, v = qkv[0], qkv[1], qkv[2]

    if past_k is not None and past_v is not None:
      k = torch.cat([past_k, k], dim=2)
      v = torch.cat([past_v, v], dim=2)
    
    total_len = k.size(2)

    # 保存给下一轮decode用
    new_k = k if use_cache else None
    new_v = v if use_cache else None

    # socres: [B, h, T, total_len]
    scores = q @ k.tranpose(-2, -1)
    scores = scores / (self.d_h ** 0.5)

    if mask is not None:
      scores = scores.mask_fill(mask == 0, float("-inf"))

    attn = F.softmax(scores, dim=-1)
    
    # out [B, h, T, d_h]
    out = attn @ v
    # out
    out = out.tranpose(1, 2).contiguous()
    # out: [B, T, d]
    out = out.view(B, T, d)

    out = self.out_proj(out)

    return out, new_k, new_v

2.4 ffn

FFN在transformer里一般指feed forward network，也叫MLP层

每个transformer block里，通常结构是：

1
2
3
4
5


x
-> Multi-Head Self-Attention
-> Add & Norm
-> FFN / MLP
-> Add & Norm

Attention负责token之间的信息交互；FFN负责对每个token自己的表示做非线性变换和特征增强。

本质公式：

$$ \text{FFN}(x) = W_2\sigma(W_1x+b_1)+b_2 $$

假设某token的hidden state是

$$ x\in\mathbb R^d $$

第一层线性变换：

$$ h = W_1x+b_1 $$

其中：$W_1\in\mathbb R^{d_{ff}\times d},\ b_1\in\mathbb R^{d_{ff}}$

所以：

$$ h\in\mathbb R^{d_{ff}} $$

一般情况

$$ d_{ff} = 4d $$

然后经过激活函数：

$$ \tilde{h}=\sigma(h) $$

再进过第二层线性变换：

$$ y = W_2\tilde{h}+b_2 $$

其中：$W_2\in\mathbb R^{d\times d_{ff}},\ b_2\in\mathbb R^d$

所以：

$$ y\in\mathbb R^d $$

最后整体就是：

$$ \begin{align*} x&\in\mathbb R^d \\ x\rightarrow W_1x+b_1&\in\mathbb R^{d_{ff}} \\ \rightarrow\sigma(W_1x+b_1)&\in\mathbb R^{d_{ff}} \\ \rightarrow W_2\sigma(W_1x+b_1)+b_2&\in\mathbb R^d \end{align*} $$

对整个序列的FFN

$$ X\in\mathbb R^{B\times T\times d} $$

类似的shape变化：

$$ [B,T,d]\rightarrow [B,T,d_{ff}]\rightarrow [B,T,d] $$

如果没有激活函数，FFN变成：

$$ \begin{align*} \text{FFN}(x)&=W_2(W_1x+b_1)+b_2 \\ &=W_2W_1x+W_2b_1+b2 \end{align*} $$

本质上还是一层线性层，所以必须加入非线性，这样模型才能表达复杂的非线性函数。

激活函数扩展

ReLU

早期transformer原论文使用的ReLU：

$$ \text{ReLU}(x)=\max(0,x) $$

优点是简单，计算快；缺点是负数区直接变成0，可能出现神经元死亡问题。

GELU

BERT、GPT系列里常见的是GELU，GELU可以理解成一种更平滑的ReLU：

$$ \text{GELU}(x)=x\Phi(x) $$

其中$\Phi(x)$是标准正态分布的累计分布函数（PDF）。整体上，对于GELU，$x$越大，越容易通过，$x$越小，越容易被抑制，但不像ReLU直接硬切为0，而是平滑地控制

Swish/SiLU

SiLU也叫Swish，公式是：

$$ \text{SiLU}(x)=x\cdot\text{sigmoid}(x) $$

其中：

$$ \text{sigmoid}(x)=\frac{1}{1+e^{-x}} $$

整体上也是一个平滑的激活函数

从普通FFN到GLU (Gated Linear Uint)、SwiGLU

现在很多大模型，比如LLaMA系列，不用最朴素的两层FFN，而是用GLU类结构，尤其是SwiGLU

普通FFN是：

$$ \text{FFN}(x)=W_2\sigma(W_1x) $$

GLU类FFN是：

$$ \text{GLU-FFN}(x)=W_{down}(\sigma(W_{gate}x)\odot W_{up}x) $$

核心区别就是多出了一个gate作为门控信号，其中$\odot$表示逐元素相乘。

其中:

$$ W_{gate}\in\mathbb R^{d_{ff}\times d_f} $$

SwiGLU就是GLU的一个变体，它把gate分支的激活函数换成了SiLU：

$$ \text{SwiGLU}(x)=W_{down}(\text{SiLU}(W_{gate}x)\odot W_{up}x) $$

可以理解：

gate分支先用SiLU生成一个平滑的门控信号
然后和up分支生成的候选特征逐元素相乘
最后down投影回d_model

coding

普通FFN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


import torch
import torch.nn as nn
import torch.nn.functional as F

class FFN(nn.Module):
  """
  x: [B, T, d]
  out: [B, T, d]
  """

  def __init__(self, d: int, d_ff: int):
    super().__init__()

    self.up_proj = nn.Linear(d, d_ff)
    self.down_proj = nn.Linear(d_ff, d)

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    hidden = self.up_proj(x)
    hidden = F.gelu(hidden)
    out = self.down_proj(hidden)

    return out

LLM Notes

LLMs

2.1 rope

2.2 mhsa

2.3 kvcache

2.4 ffn

2.5 gqa

2.6 grpo ppo dpo dapo gspo

2.7 api调用

2.8 sampling topp topk, softmax

2.9 cross entropy

2.10 kl divergence

LLMs#

2.1 rope#

2.2 mhsa#

2.3 kvcache#

2.4 ffn#

2.5 gqa#

2.6 grpo ppo dpo dapo gspo#

2.7 api调用#

2.8 sampling topp topk, softmax#

2.9 cross entropy#

2.10 kl divergence#

LLMs

2.1 rope

2.2 mhsa

2.3 kvcache

2.4 ffn

2.5 gqa

2.6 grpo ppo dpo dapo gspo

2.7 api调用

2.8 sampling topp topk, softmax

2.9 cross entropy

2.10 kl divergence