LLMs

2.1 rope

位置编码:Transformer里注意力机制本身对顺序无感知,必须引入位置信息

理想的位置编码应该满足:

  1. 每个位置有唯一表示

  2. 相对位置可感知:第m个token对第n个token的注意力得分,应该只依赖相对距离 m-n ,而不是绝对位置

  3. 能外推到更长序列(训练没见过的长度)

传统的绝对位置编码(Sinusoidal PE)直接把位置信息加在embedding上,无法天然满足“相对位置”性质

注意力分数由$q^T_mk_n$计算,Rope思路:构造一个函数$f$,使得:

$$ \langle f(q,m), f(k,n)\rangle = g(q,k,m-n) $$

即:内积结果只与相对位置 m-n 有关,与绝对位置无关。

从2维情形推导

对于向量$q=[q_0,q_1]$,把它看作复数

$$ q\leftrightarrow q_o+iq_1 $$

定义编码函数为“旋转”:

$$ f(q,m)=q\cdot e^{im\theta}=(q_0+iq_1)\cdot(\cos m\theta+i\sin m\theta) $$$$ f(q,m)=\begin{pmatrix}q_0^\prime \\ q_1^\prime\end{pmatrix}=\begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}\begin{pmatrix}q_0 \\ q_1\end{pmatrix}=R_m q $$

含义是,把向量在复平面上旋转 $m\theta$ 角度。这就是定义出来了这个$f$函数。

验证内积性质发现:

$$ \langle f(q,m),f(k,n)\rangle=\text{Re}[(qe^{im\theta})\cdot\overline{(ke^{in\theta})}]=\text{Re}[q\overline{k}\cdot e^{i(m-n)\theta}] $$$$ \begin{align*} \langle f(q,m),f(k,n)\rangle&=\begin{pmatrix}q_0^\prime \\ q_1^\prime\end{pmatrix}^T\begin{pmatrix}k_0^\prime \\ k_1^\prime\end{pmatrix} \\ &=\begin{pmatrix}q_0 \\ q_1\end{pmatrix}^T \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}^T\begin{pmatrix}\cos n\theta & -\sin n\theta \\ \sin n\theta & \cos n\theta\end{pmatrix}\begin{pmatrix}k_0 \\ k_1\end{pmatrix} \\ &=\begin{pmatrix}q_0 \\ q_1\end{pmatrix}^T\begin{pmatrix}\cos((n-m)\theta) & -\sin((n-m)\theta) \\ \sin((n-m)\theta) & \cos((n-m)\theta)\end{pmatrix}\begin{pmatrix}k_0 \\ k_1\end{pmatrix} \end{align*} $$

结果只依赖 m-n。

扩展到高维(实际使用)

Rope的旋转矩阵是一个分块对角矩阵

$$ R_m = \begin{bmatrix}R_{m\theta_1} & 0 & \cdots & 0 \\ 0 & R_{m\theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m\theta_{d/2}}\end{bmatrix} $$

其中每一个小块是:

$$ R_{m\theta_i}=\begin{bmatrix}\cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i)\end{bmatrix} $$

因此:

$$ \begin{align*} f(q,m)&= R_m = \begin{bmatrix}R_{m\theta_1} & 0 & \cdots & 0 \\ 0 & R_{m\theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m\theta_{d/2}}\end{bmatrix} \begin{bmatrix}q_0\\q_1\\\vdots\\ q_{d-1}\end{bmatrix} \\ &= \begin{bmatrix}\cos(m\theta_1) & -\sin(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ \sin(m\theta_1) & \cos(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos(m\theta_2) & -\sin(m\theta_2) & \cdots & 0 & 0 \\ 0 & 0 & \sin(m\theta_2) & \cos(m\theta_2) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & \cos(m\theta_{d/2}) & -\sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & \cdots & \sin(m\theta_{d/2}) & \cos(m\theta_{d/2})\end{bmatrix}\begin{bmatrix}q_0\\q_1\\q_2\\q_3\\\vdots\\ q_{d-2}\\ q_{d-1}\end{bmatrix} \\ &=\begin{bmatrix}q_0\cos(m\theta_1)-q_1\sin(m\theta_1) \\ q_0\sin(m\theta_1)+q_1\cos(m\theta_1) \\ q_2\cos(m\theta_2)-q_3\sin(m\theta_2) \\ q_2\sin(m\theta_2)+q_3\cos(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2})-q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-2}\sin(m\theta(d/2))+q_{d-1}\cos(m\theta_{d/2})\end{bmatrix} = \begin{bmatrix}q_0\cos(m\theta_1)-q_1\sin(m\theta_1) \\ q_1\cos(m\theta_1)+q_0\sin(m\theta_1) \\ q_2\cos(m\theta_2)-q_3\sin(m\theta_2) \\ q_3\cos(m\theta_2)+q_2\sin(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2})-q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-1}\cos(m\theta_{d/2})+q_{d-2}\sin(m\theta_{d/2})\end{bmatrix} \\ &=\begin{bmatrix}q_0\cos(m\theta_1) \\ q_1\cos(m\theta_1) \\ q_2\cos(m\theta_2) \\ q_3\cos(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2}) \\ q_{d-1}\cos(m\theta_{d/2})\end{bmatrix} + \begin{bmatrix}-q_1\sin(m\theta_1) \\ q_0\sin(m\theta_1) \\ -q_3\sin(m\theta_2) \\ q_2\sin(m\theta_2) \\ \vdots \\ -q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-2}\sin(m\theta_{d/2})\end{bmatrix} \end{align*} $$

能推出$\langle f(q,m),f(k,n)\rangle=q^TR_{n-m}k$

其中$\theta_i$如下,$i=1,2,\cdots,d/2$

$$ \theta_i = \frac{1}{10000^{\frac{i-1}{d/2}}} = 10000^{-\frac{i-1}{d/2}} $$$$ \begin{align*} \mathbf{\theta} &= \begin{bmatrix}10000^{-\frac{0}{d/2}}, 10000^{-\frac{1}{d/2}},\cdots,10000^{-\frac{(d/2)-1}{d/2}}\end{bmatrix} \\ &=\begin{bmatrix}10000^{-\frac{0}{d}} & 10000^{-\frac{2}{d}} & \cdots & 10000^{-\frac{d-2}{d}}\end{bmatrix} \end{align*} $$
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
import torch
from typing import Tuple

def rotate_half(x: torch.Tensor) -> torch.Tensor:
    """
    [x0, x1, x2, x3, ...] -> [-x1, x0, -x3, x2, ...]
    """
  x_even = x[..., 0::2] # [..., head_dim / 2]
  x_add = x[..., 1::2] # [..., head_dim / 2]

  x_rot = torch.stack((-x_odd, x_even), dim=-1).flatten(-2)
  
  return x_rot

def apply_rope(q: torch.Tensor, k: torch.Tensor, base: float = 10000.0) -> Tuple[torch.Tensor, torch.Tensor]:
  """
  对query和key应用 Rope 位置编码
  输入:
    q: Tensor, shape = [bs, seq_len, num_heads, head_dim]
    k: Tensor, shape = [bs, seq_len, num_heads, head_dim]
  输出:
    q_rope: Tensor, shape = [bs, seq_len, num_heads, head_dim]
    k_rope: Tensor, shape = [bs, seq_len, num_heads, head_dim]
  约束:
    1. q,k shape相同
    2. head_dim必须偶数
    3. Rope只作用在最后一维 head_dim上
  """
  bs, seq_len, num_heads, head_dim = q.shape
  assert head_dim % 2 == 0

  device = q.device
  dtype = q.dtype

  """
  构造每一组二维向量对应的频率
  shape: [head_dim / 2]
  inv_freq实际就是 [\theta_1, \theta_2, ..., \theta_{d/2}]
  """
  inv_freq = base ** -(torch.arange(0, head_dim, 2, device=device).float()) / head_dim

  """
  构造位置索引
  shape: [seq_len]
  """
  position_ids = torch.arange(seq_len, device=device).float()

  """
  计算每个位置、每个频率对应的旋转角度
  position_ids: [seq_len]
  inv_freq: [head_dim / 2]
  freqs: [seq_len, head_dim / 2]
  freqs就是所有位置对应不同head_dim位置的旋转角度

  m\theta_1, m\theta_2, ..., m\theta_{d/2}
  m = 0, 1, ..., seq_len-1
  """
  position_ids = position_ids.unsqueeze(-1) # [seq_len, 1]
  inv_freq = inv_freq.unsqueeze(0) # [1, head_dim]
  freqs = position_ids * inv_freq # [seq_len, head_dim / 2]

  """
  每个频率对应二维中的两个维度,所以复制一份
  shape: [seq_len, head_dim]
  
  freqs = tensor([
      [a, b, c],
      [d, e, f],
  ])

  ->

  tensor([
      [a, a, b, b, c, c],
      [d, d, e, e, f, f],
  ])

  freqs:
  m\theta_1, m\theta_1, m\theta_2, m\theta_2, ..., m\theta_{d/2}, m\theta_{d/2}
  """
  freqs = torch.repeat_interleave(freqs, repeats=2, dim=-1)

  """
  # 构造 cos / sin, 并broadcast到 q/k 形状
  # 原始: [seq_len, head_dim]
  # 目标: [1, seq_len, 1, head_dim]
  """
  cos = freqs.cos()[None, :, None, :].to(dtype)
  sin = freqs.sin()[None, :, None, :].to(dtype)

  """
  # 应用Rope
  # 二维旋转公式:
  # [x0', x1'] = [x0 * cos - x1 * sin, x0 * sin + x1 * cos]

  q = [q_0, q_1, q_2, q_3, ..., q_{d-2}, q_{d-1}]
  rotate_half(q) = [-q_1, q_0, -q_3, q_2, ..., -q_{d-1}, q_{d-2}]
  cos = [cos(m\theta_1), cos(m\theta_1), cos(m\theta_2), cos(m\theta_2), ..., cos(m\theta(d/2)), cos(m\theta(d/2))]
  sin = [sin(m\theta_1), sin(m\theta_1), sin(m\theta_2), sin(m\theta_2), ..., sin(m\theta(d/2)), sin(m\theta(d/2))]

  q_rope = [q_0cos(m\theta_1) - q_1sin(m\theta_1), ...]

  """
  q_rope = q * cos + rotate_half(q) * sin
  k_rope = k * cos + rotate_half(k) * sin

2.2 mhsa

设输入:

$$ X\in\mathbb R^{B\times T\times d} $$

其中:$B=\text{batch size}, T=\text{seq len}, d=d_{model}$

设多头数为:$h$,每个head的维度为$d_h=\frac{d}{h}$

  1. 线性映射得到Q,K,V
$$ \begin{aligned} Q &= XW_Q \in \mathbb R^{B\times T\times d}, \\ K &= XW_K \in \mathbb R^{B\times T\times d}, \\ V &= XW_V \in \mathbb R^{B\times T\times d}. \end{aligned} $$

其中$W_Q,W_K,W_V\in\mathbb R^{d\times d}$

  1. 拆成多头 将最后一维度拆成$h$个head:
$$ Q, K, V\in\mathbb R^{B\times T\times h\times d_h} $$

经过transpose:

$$ Q, K, V\in\mathbb R^{B\times h\times T\times d_h} $$
  1. 计算注意力分数 对每个batch、每个head,计算:
$$ S=\frac{QK^T}{\sqrt{d_h}} $$

其中$K^T$是对最后两个维度转置:

$$ K^T\in\mathbb R^{B\times h\times d_h\times T} $$

因此:$S\in\mathbb R^{B\times h\times T\times T}$

  1. Softmax得到注意力权重

对最后一维度做softmax:

$$ A=\text{softmax}(S, dim=-1)\in\mathbb R^{B\times h\times T\times T} $$
  1. 加权求和$V$
$$ \begin{aligned} O_{\text{head}} &= AV, \\ A &\in \mathbb R^{B\times h\times T\times T}, \\ V &\in \mathbb R^{B\times h\times T\times d_h}. \end{aligned} $$

所以:$O_{head}\in\mathbb R^{B\times h\times T\times d_h}$

  1. 合并多头并线性映射

先把$O_{\text{head}}$转置并合并最后两个维度:

$$ \begin{aligned} O_{\text{concat}} &\in \mathbb R^{B\times T\times (h d_h)} = \mathbb R^{B\times T\times d}, \\ O &= O_{\text{concat}}W_O \in \mathbb R^{B\times T\times d}. \end{aligned} $$

其中$W_O\in\mathbb R^{d\times d}$。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadSelfAttention(nn.Module):
  """
  输入:
    x: [B, T, d]
  输出:
    out: [B, T, d]
  其中:
    B = batch size
    T = seq_len
    d = d_model
    h = num_heads
    d_h = d // h
  """
  def __init__(self, d_model: int, num_heads: int):
    super().__init__()
    assert d_model % num_heads == 0
    self.d_model = d_model
    self.num_heads = num_heads
    self.d_h = d_model // num_heads

    # 一次性生成 Q, K, V
    self.qkv_proj = nn.Linear(d_model, 3 * d_model)

    # 输出投影 O
    self.out_proj = nn.Linear(d_model, d_model)
  
  def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
    B, T, d = x,shape

    # x: [B, T, d]
    # qkv: [B, T, 3d]
    qkv = self.qkv_proj(x)

    # qkv: [B, T, 3, h, d_h]
    qkv = qkv.view(B, T, 3, self.num_heads, self.d_h)

    # qkv: [3, B, h, T, d_h]
    qkv = qkv.permute(2, 0, 3, 1, 4)

    # q, k, v: [B, h, T, d_h]
    q, k, v = qkv[0], qkv[1], qkv[2]

    # scores: [B, h, T, T]
    scores = q @ k.transpose(-2, -1)
    scores = scores / (self.d_h ** 0.5)

    # mask 可选,casual mask 或 padding mask
    if mask is not None:
      scores = scores.masked_fill(mask == 0, float("-inf"))
    
    # attn: [B, h, T, T]
    attn = F.softmax(scores, dim=-1)

    # out: [B, h, T, d_h]
    out = attn @ v

    # out: [B, T, h, d_h]
    # .contiguous: 作用是把每个tensor在内存中按顺序排序,因为.permute和.transpose都只是改变访问顺序,内存顺序没有变化。
    # .reshape会自动拷贝,如果内存不连续的话
    out = out.transpose(1, 2).contiguous()

    # out: [B, T, d]
    out = out.view(B, T, d)

    # out: [B, T, d]
    out = self.out_proj(out)

    return out

2.3 kvcache

保存历史推理过程中计算得到的$k,v$向量,在计算最新输出token的时候可以复用之前的$k,v$向量

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadSelfAttentionWithKVCache(nn.Module):
  """
  输入:
    x: [B, T, d]
  输出:
    out: [B, T, d]
    new_k: [B, h, past_len + T, d_h]
    new_v: [B, h, past_len + T, d_h]
  其中:
    B = batch size
    T = 当前输入长度
        prefill阶段:T = prompt_len
        decode阶段: T = 1
    d = d_model
    h = num_heads
    d_h = d // h
  """
  def __init__(self, d_model: int, num_heads: int):
    super().__init__()
    assert d_model % num_heads == 0

    self.d_model = d_model
    self.num_heads = num_heads
    self.d_h = self.d_model // self.num_heads

    self.qkv_proj = nn.Linear(d_model, 3 * d_model)
    self.out_proj = nn.Linear(d_model, d_model)

  def forward(self, x: torch.Tensor, mask: torch.Tensor = None, past_k: torch.Tensor = None, past_v: torch.Tensor = None, use_cache: bool = True):
    """
    x: [B, T, d]
    past_k: None or [B, h, past_len, d_h]
    past_v: None or [B, h, past_len, d_h]

    return:
      out: [B, T, d]
      new_k: [B, h, past_len + T, d_h]
      new_v: [B, h, past_len + T, d_h]
    """
    B, T, d = x.shape

    # qkv: [B, T, 3d]
    qkv = self.qkv_proj(x)
    
    # qkv: [3, B, h, T, d_h]
    qkv = qkv.view(B, T, 3, self.num_heads, self.d_h).permute(2, 0, 3, 1, 4)

    # [B, h, T, d_h]
    q, k, v = qkv[0], qkv[1], qkv[2]

    if past_k is not None and past_v is not None:
      k = torch.cat([past_k, k], dim=2)
      v = torch.cat([past_v, v], dim=2)
    
    total_len = k.size(2)

    # 保存给下一轮decode用
    new_k = k if use_cache else None
    new_v = v if use_cache else None

    # socres: [B, h, T, total_len]
    scores = q @ k.tranpose(-2, -1)
    scores = scores / (self.d_h ** 0.5)

    if mask is not None:
      scores = scores.mask_fill(mask == 0, float("-inf"))

    attn = F.softmax(scores, dim=-1)
    
    # out [B, h, T, d_h]
    out = attn @ v
    # out
    out = out.tranpose(1, 2).contiguous()
    # out: [B, T, d]
    out = out.view(B, T, d)

    out = self.out_proj(out)

    return out, new_k, new_v

2.4 ffn

FFN在transformer里一般指feed forward network,也叫MLP层

每个transformer block里,通常结构是:

1
2
3
4
5
x
-> Multi-Head Self-Attention
-> Add & Norm
-> FFN / MLP
-> Add & Norm

Attention负责token之间的信息交互;FFN负责对每个token自己的表示做非线性变换和特征增强。

本质公式:

$$ \text{FFN}(x) = W_2\sigma(W_1x+b_1)+b_2 $$

假设某token的hidden state是

$$ x\in\mathbb R^d $$

第一层线性变换:

$$ h = W_1x+b_1 $$

其中:$W_1\in\mathbb R^{d_{ff}\times d},\ b_1\in\mathbb R^{d_{ff}}$

所以:

$$ h\in\mathbb R^{d_{ff}} $$

一般情况

$$ d_{ff} = 4d $$

然后经过激活函数:

$$ \tilde{h}=\sigma(h) $$

再进过第二层线性变换:

$$ y = W_2\tilde{h}+b_2 $$

其中:$W_2\in\mathbb R^{d\times d_{ff}},\ b_2\in\mathbb R^d$

所以:

$$ y\in\mathbb R^d $$

最后整体就是:

$$ \begin{align*} x&\in\mathbb R^d \\ x\rightarrow W_1x+b_1&\in\mathbb R^{d_{ff}} \\ \rightarrow\sigma(W_1x+b_1)&\in\mathbb R^{d_{ff}} \\ \rightarrow W_2\sigma(W_1x+b_1)+b_2&\in\mathbb R^d \end{align*} $$

对整个序列的FFN

$$ X\in\mathbb R^{B\times T\times d} $$

类似的shape变化:

$$ [B,T,d]\rightarrow [B,T,d_{ff}]\rightarrow [B,T,d] $$

如果没有激活函数,FFN变成:

$$ \begin{align*} \text{FFN}(x)&=W_2(W_1x+b_1)+b_2 \\ &=W_2W_1x+W_2b_1+b2 \end{align*} $$

本质上还是一层线性层,所以必须加入非线性,这样模型才能表达复杂的非线性函数。

激活函数扩展

  • ReLU

早期transformer原论文使用的ReLU:

$$ \text{ReLU}(x)=\max(0,x) $$

优点是简单,计算快;缺点是负数区直接变成0,可能出现神经元死亡问题。

  • GELU

BERT、GPT系列里常见的是GELU,GELU可以理解成一种更平滑的ReLU:

$$ \text{GELU}(x)=x\Phi(x) $$

其中$\Phi(x)$是标准正态分布的累计分布函数(PDF)。整体上,对于GELU,$x$越大,越容易通过,$x$越小,越容易被抑制,但不像ReLU直接硬切为0,而是平滑地控制

  • Swish/SiLU

SiLU也叫Swish,公式是:

$$ \text{SiLU}(x)=x\cdot\text{sigmoid}(x) $$

其中:

$$ \text{sigmoid}(x)=\frac{1}{1+e^{-x}} $$

整体上也是一个平滑的激活函数

  • 从普通FFN到GLU (Gated Linear Uint)、SwiGLU

现在很多大模型,比如LLaMA系列,不用最朴素的两层FFN,而是用GLU类结构,尤其是SwiGLU

普通FFN是:

$$ \text{FFN}(x)=W_2\sigma(W_1x) $$

GLU类FFN是:

$$ \text{GLU-FFN}(x)=W_{down}(\sigma(W_{gate}x)\odot W_{up}x) $$

核心区别就是多出了一个gate作为门控信号,其中$\odot$表示逐元素相乘。

其中:

$$ W_{gate}\in\mathbb R^{d_{ff}\times d_f} $$

SwiGLU就是GLU的一个变体,它把gate分支的激活函数换成了SiLU:

$$ \text{SwiGLU}(x)=W_{down}(\text{SiLU}(W_{gate}x)\odot W_{up}x) $$

可以理解:

  1. gate分支先用SiLU生成一个平滑的门控信号
  2. 然后和up分支生成的候选特征逐元素相乘
  3. 最后down投影回d_model

coding

  • 普通FFN
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
import torch.nn as nn
import torch.nn.functional as F

class FFN(nn.Module):
  """
  x: [B, T, d]
  out: [B, T, d]
  """

  def __init__(self, d: int, d_ff: int):
    super().__init__()

    self.up_proj = nn.Linear(d, d_ff)
    self.down_proj = nn.Linear(d_ff, d)

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    hidden = self.up_proj(x)
    hidden = F.gelu(hidden)
    out = self.down_proj(hidden)

    return out

2.5 gqa

2.6 grpo ppo dpo dapo gspo

2.7 api调用

2.8 sampling topp topk, softmax

2.9 cross entropy

2.10 kl divergence