LLMs#
2.1 rope#
位置编码:Transformer里注意力机制本身对顺序无感知,必须引入位置信息
理想的位置编码应该满足:
-
每个位置有唯一表示
-
相对位置可感知:第m个token对第n个token的注意力得分,应该只依赖相对距离 m-n ,而不是绝对位置
-
能外推到更长序列(训练没见过的长度)
传统的绝对位置编码(Sinusoidal PE)直接把位置信息加在embedding上,无法天然满足“相对位置”性质
注意力分数由$q^T_mk_n$计算,Rope思路:构造一个函数$f$,使得:
$$
\langle f(q,m), f(k,n)\rangle = g(q,k,m-n)
$$
即:内积结果只与相对位置 m-n 有关,与绝对位置无关。
从2维情形推导
对于向量$q=[q_0,q_1]$,把它看作复数
$$
q\leftrightarrow q_o+iq_1
$$
定义编码函数为“旋转”:
$$
f(q,m)=q\cdot e^{im\theta}=(q_0+iq_1)\cdot(\cos m\theta+i\sin m\theta)
$$$$
f(q,m)=\begin{pmatrix}q_0^\prime \\ q_1^\prime\end{pmatrix}=\begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}\begin{pmatrix}q_0 \\ q_1\end{pmatrix}=R_m q
$$
含义是,把向量在复平面上旋转 $m\theta$ 角度。这就是定义出来了这个$f$函数。
验证内积性质发现:
$$
\langle f(q,m),f(k,n)\rangle=\text{Re}[(qe^{im\theta})\cdot\overline{(ke^{in\theta})}]=\text{Re}[q\overline{k}\cdot e^{i(m-n)\theta}]
$$$$
\begin{align*}
\langle f(q,m),f(k,n)\rangle&=\begin{pmatrix}q_0^\prime \\ q_1^\prime\end{pmatrix}^T\begin{pmatrix}k_0^\prime \\ k_1^\prime\end{pmatrix} \\
&=\begin{pmatrix}q_0 \\ q_1\end{pmatrix}^T \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}^T\begin{pmatrix}\cos n\theta & -\sin n\theta \\ \sin n\theta & \cos n\theta\end{pmatrix}\begin{pmatrix}k_0 \\ k_1\end{pmatrix} \\
&=\begin{pmatrix}q_0 \\ q_1\end{pmatrix}^T\begin{pmatrix}\cos((n-m)\theta) & -\sin((n-m)\theta) \\ \sin((n-m)\theta) & \cos((n-m)\theta)\end{pmatrix}\begin{pmatrix}k_0 \\ k_1\end{pmatrix}
\end{align*}
$$
结果只依赖 m-n。
扩展到高维(实际使用)
Rope的旋转矩阵是一个分块对角矩阵
$$
R_m = \begin{bmatrix}R_{m\theta_1} & 0 & \cdots & 0 \\ 0 & R_{m\theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m\theta_{d/2}}\end{bmatrix}
$$
其中每一个小块是:
$$
R_{m\theta_i}=\begin{bmatrix}\cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i)\end{bmatrix}
$$
因此:
$$
\begin{align*}
f(q,m)&= R_m = \begin{bmatrix}R_{m\theta_1} & 0 & \cdots & 0 \\ 0 & R_{m\theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m\theta_{d/2}}\end{bmatrix} \begin{bmatrix}q_0\\q_1\\\vdots\\ q_{d-1}\end{bmatrix} \\
&= \begin{bmatrix}\cos(m\theta_1) & -\sin(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ \sin(m\theta_1) & \cos(m\theta_1) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos(m\theta_2) & -\sin(m\theta_2) & \cdots & 0 & 0 \\ 0 & 0 & \sin(m\theta_2) & \cos(m\theta_2) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & \cos(m\theta_{d/2}) & -\sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & \cdots & \sin(m\theta_{d/2}) & \cos(m\theta_{d/2})\end{bmatrix}\begin{bmatrix}q_0\\q_1\\q_2\\q_3\\\vdots\\ q_{d-2}\\ q_{d-1}\end{bmatrix} \\
&=\begin{bmatrix}q_0\cos(m\theta_1)-q_1\sin(m\theta_1) \\ q_0\sin(m\theta_1)+q_1\cos(m\theta_1) \\ q_2\cos(m\theta_2)-q_3\sin(m\theta_2) \\ q_2\sin(m\theta_2)+q_3\cos(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2})-q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-2}\sin(m\theta(d/2))+q_{d-1}\cos(m\theta_{d/2})\end{bmatrix} =
\begin{bmatrix}q_0\cos(m\theta_1)-q_1\sin(m\theta_1) \\ q_1\cos(m\theta_1)+q_0\sin(m\theta_1) \\ q_2\cos(m\theta_2)-q_3\sin(m\theta_2) \\ q_3\cos(m\theta_2)+q_2\sin(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2})-q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-1}\cos(m\theta_{d/2})+q_{d-2}\sin(m\theta_{d/2})\end{bmatrix} \\
&=\begin{bmatrix}q_0\cos(m\theta_1) \\ q_1\cos(m\theta_1) \\ q_2\cos(m\theta_2) \\ q_3\cos(m\theta_2) \\ \vdots \\ q_{d-2}\cos(m\theta_{d/2}) \\ q_{d-1}\cos(m\theta_{d/2})\end{bmatrix} +
\begin{bmatrix}-q_1\sin(m\theta_1) \\ q_0\sin(m\theta_1) \\ -q_3\sin(m\theta_2) \\ q_2\sin(m\theta_2) \\ \vdots \\ -q_{d-1}\sin(m\theta_{d/2}) \\ q_{d-2}\sin(m\theta_{d/2})\end{bmatrix}
\end{align*}
$$
能推出$\langle f(q,m),f(k,n)\rangle=q^TR_{n-m}k$
其中$\theta_i$如下,$i=1,2,\cdots,d/2$
$$
\theta_i = \frac{1}{10000^{\frac{i-1}{d/2}}} = 10000^{-\frac{i-1}{d/2}}
$$$$
\begin{align*}
\mathbf{\theta} &= \begin{bmatrix}10000^{-\frac{0}{d/2}}, 10000^{-\frac{1}{d/2}},\cdots,10000^{-\frac{(d/2)-1}{d/2}}\end{bmatrix} \\
&=\begin{bmatrix}10000^{-\frac{0}{d}} & 10000^{-\frac{2}{d}} & \cdots & 10000^{-\frac{d-2}{d}}\end{bmatrix}
\end{align*}
$$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
|
import torch
from typing import Tuple
def rotate_half(x: torch.Tensor) -> torch.Tensor:
"""
[x0, x1, x2, x3, ...] -> [-x1, x0, -x3, x2, ...]
"""
x_even = x[..., 0::2] # [..., head_dim / 2]
x_add = x[..., 1::2] # [..., head_dim / 2]
x_rot = torch.stack((-x_odd, x_even), dim=-1).flatten(-2)
return x_rot
def apply_rope(q: torch.Tensor, k: torch.Tensor, base: float = 10000.0) -> Tuple[torch.Tensor, torch.Tensor]:
"""
对query和key应用 Rope 位置编码
输入:
q: Tensor, shape = [bs, seq_len, num_heads, head_dim]
k: Tensor, shape = [bs, seq_len, num_heads, head_dim]
输出:
q_rope: Tensor, shape = [bs, seq_len, num_heads, head_dim]
k_rope: Tensor, shape = [bs, seq_len, num_heads, head_dim]
约束:
1. q,k shape相同
2. head_dim必须偶数
3. Rope只作用在最后一维 head_dim上
"""
bs, seq_len, num_heads, head_dim = q.shape
assert head_dim % 2 == 0
device = q.device
dtype = q.dtype
"""
构造每一组二维向量对应的频率
shape: [head_dim / 2]
inv_freq实际就是 [\theta_1, \theta_2, ..., \theta_{d/2}]
"""
inv_freq = base ** -(torch.arange(0, head_dim, 2, device=device).float()) / head_dim
"""
构造位置索引
shape: [seq_len]
"""
position_ids = torch.arange(seq_len, device=device).float()
"""
计算每个位置、每个频率对应的旋转角度
position_ids: [seq_len]
inv_freq: [head_dim / 2]
freqs: [seq_len, head_dim / 2]
freqs就是所有位置对应不同head_dim位置的旋转角度
m\theta_1, m\theta_2, ..., m\theta_{d/2}
m = 0, 1, ..., seq_len-1
"""
position_ids = position_ids.unsqueeze(-1) # [seq_len, 1]
inv_freq = inv_freq.unsqueeze(0) # [1, head_dim]
freqs = position_ids * inv_freq # [seq_len, head_dim / 2]
"""
每个频率对应二维中的两个维度,所以复制一份
shape: [seq_len, head_dim]
freqs = tensor([
[a, b, c],
[d, e, f],
])
->
tensor([
[a, a, b, b, c, c],
[d, d, e, e, f, f],
])
freqs:
m\theta_1, m\theta_1, m\theta_2, m\theta_2, ..., m\theta_{d/2}, m\theta_{d/2}
"""
freqs = torch.repeat_interleave(freqs, repeats=2, dim=-1)
"""
# 构造 cos / sin, 并broadcast到 q/k 形状
# 原始: [seq_len, head_dim]
# 目标: [1, seq_len, 1, head_dim]
"""
cos = freqs.cos()[None, :, None, :].to(dtype)
sin = freqs.sin()[None, :, None, :].to(dtype)
"""
# 应用Rope
# 二维旋转公式:
# [x0', x1'] = [x0 * cos - x1 * sin, x0 * sin + x1 * cos]
q = [q_0, q_1, q_2, q_3, ..., q_{d-2}, q_{d-1}]
rotate_half(q) = [-q_1, q_0, -q_3, q_2, ..., -q_{d-1}, q_{d-2}]
cos = [cos(m\theta_1), cos(m\theta_1), cos(m\theta_2), cos(m\theta_2), ..., cos(m\theta(d/2)), cos(m\theta(d/2))]
sin = [sin(m\theta_1), sin(m\theta_1), sin(m\theta_2), sin(m\theta_2), ..., sin(m\theta(d/2)), sin(m\theta(d/2))]
q_rope = [q_0cos(m\theta_1) - q_1sin(m\theta_1), ...]
"""
q_rope = q * cos + rotate_half(q) * sin
k_rope = k * cos + rotate_half(k) * sin
|
2.2 mhsa#
设输入:
$$
X\in\mathbb R^{B\times T\times d}
$$
其中:$B=\text{batch size}, T=\text{seq len}, d=d_{model}$
设多头数为:$h$,每个head的维度为$d_h=\frac{d}{h}$
- 线性映射得到Q,K,V
$$
\begin{aligned}
Q &= XW_Q \in \mathbb R^{B\times T\times d}, \\
K &= XW_K \in \mathbb R^{B\times T\times d}, \\
V &= XW_V \in \mathbb R^{B\times T\times d}.
\end{aligned}
$$
其中$W_Q,W_K,W_V\in\mathbb R^{d\times d}$
- 拆成多头
将最后一维度拆成$h$个head:
$$
Q, K, V\in\mathbb R^{B\times T\times h\times d_h}
$$
经过transpose:
$$
Q, K, V\in\mathbb R^{B\times h\times T\times d_h}
$$
- 计算注意力分数
对每个batch、每个head,计算:
$$
S=\frac{QK^T}{\sqrt{d_h}}
$$
其中$K^T$是对最后两个维度转置:
$$
K^T\in\mathbb R^{B\times h\times d_h\times T}
$$
因此:$S\in\mathbb R^{B\times h\times T\times T}$
- Softmax得到注意力权重
对最后一维度做softmax:
$$
A=\text{softmax}(S, dim=-1)\in\mathbb R^{B\times h\times T\times T}
$$
- 加权求和$V$
$$
\begin{aligned}
O_{\text{head}} &= AV, \\
A &\in \mathbb R^{B\times h\times T\times T}, \\
V &\in \mathbb R^{B\times h\times T\times d_h}.
\end{aligned}
$$
所以:$O_{head}\in\mathbb R^{B\times h\times T\times d_h}$
- 合并多头并线性映射
先把$O_{\text{head}}$转置并合并最后两个维度:
$$
\begin{aligned}
O_{\text{concat}} &\in \mathbb R^{B\times T\times (h d_h)}
= \mathbb R^{B\times T\times d}, \\
O &= O_{\text{concat}}W_O \in \mathbb R^{B\times T\times d}.
\end{aligned}
$$
其中$W_O\in\mathbb R^{d\times d}$。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
|
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadSelfAttention(nn.Module):
"""
输入:
x: [B, T, d]
输出:
out: [B, T, d]
其中:
B = batch size
T = seq_len
d = d_model
h = num_heads
d_h = d // h
"""
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_h = d_model // num_heads
# 一次性生成 Q, K, V
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
# 输出投影 O
self.out_proj = nn.Linear(d_model, d_model)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
B, T, d = x,shape
# x: [B, T, d]
# qkv: [B, T, 3d]
qkv = self.qkv_proj(x)
# qkv: [B, T, 3, h, d_h]
qkv = qkv.view(B, T, 3, self.num_heads, self.d_h)
# qkv: [3, B, h, T, d_h]
qkv = qkv.permute(2, 0, 3, 1, 4)
# q, k, v: [B, h, T, d_h]
q, k, v = qkv[0], qkv[1], qkv[2]
# scores: [B, h, T, T]
scores = q @ k.transpose(-2, -1)
scores = scores / (self.d_h ** 0.5)
# mask 可选,casual mask 或 padding mask
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
# attn: [B, h, T, T]
attn = F.softmax(scores, dim=-1)
# out: [B, h, T, d_h]
out = attn @ v
# out: [B, T, h, d_h]
# .contiguous: 作用是把每个tensor在内存中按顺序排序,因为.permute和.transpose都只是改变访问顺序,内存顺序没有变化。
# .reshape会自动拷贝,如果内存不连续的话
out = out.transpose(1, 2).contiguous()
# out: [B, T, d]
out = out.view(B, T, d)
# out: [B, T, d]
out = self.out_proj(out)
return out
|
2.3 kvcache#
保存历史推理过程中计算得到的$k,v$向量,在计算最新输出token的时候可以复用之前的$k,v$向量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
|
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadSelfAttentionWithKVCache(nn.Module):
"""
输入:
x: [B, T, d]
输出:
out: [B, T, d]
new_k: [B, h, past_len + T, d_h]
new_v: [B, h, past_len + T, d_h]
其中:
B = batch size
T = 当前输入长度
prefill阶段:T = prompt_len
decode阶段: T = 1
d = d_model
h = num_heads
d_h = d // h
"""
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_h = self.d_model // self.num_heads
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
self.out_proj = nn.Linear(d_model, d_model)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None, past_k: torch.Tensor = None, past_v: torch.Tensor = None, use_cache: bool = True):
"""
x: [B, T, d]
past_k: None or [B, h, past_len, d_h]
past_v: None or [B, h, past_len, d_h]
return:
out: [B, T, d]
new_k: [B, h, past_len + T, d_h]
new_v: [B, h, past_len + T, d_h]
"""
B, T, d = x.shape
# qkv: [B, T, 3d]
qkv = self.qkv_proj(x)
# qkv: [3, B, h, T, d_h]
qkv = qkv.view(B, T, 3, self.num_heads, self.d_h).permute(2, 0, 3, 1, 4)
# [B, h, T, d_h]
q, k, v = qkv[0], qkv[1], qkv[2]
if past_k is not None and past_v is not None:
k = torch.cat([past_k, k], dim=2)
v = torch.cat([past_v, v], dim=2)
total_len = k.size(2)
# 保存给下一轮decode用
new_k = k if use_cache else None
new_v = v if use_cache else None
# socres: [B, h, T, total_len]
scores = q @ k.tranpose(-2, -1)
scores = scores / (self.d_h ** 0.5)
if mask is not None:
scores = scores.mask_fill(mask == 0, float("-inf"))
attn = F.softmax(scores, dim=-1)
# out [B, h, T, d_h]
out = attn @ v
# out
out = out.tranpose(1, 2).contiguous()
# out: [B, T, d]
out = out.view(B, T, d)
out = self.out_proj(out)
return out, new_k, new_v
|
2.4 ffn#
FFN在transformer里一般指feed forward network,也叫MLP层
每个transformer block里,通常结构是:
1
2
3
4
5
|
x
-> Multi-Head Self-Attention
-> Add & Norm
-> FFN / MLP
-> Add & Norm
|
Attention负责token之间的信息交互;FFN负责对每个token自己的表示做非线性变换和特征增强。
本质公式:
$$
\text{FFN}(x) = W_2\sigma(W_1x+b_1)+b_2
$$
假设某token的hidden state是
$$
x\in\mathbb R^d
$$
第一层线性变换:
$$
h = W_1x+b_1
$$
其中:$W_1\in\mathbb R^{d_{ff}\times d},\ b_1\in\mathbb R^{d_{ff}}$
所以:
$$
h\in\mathbb R^{d_{ff}}
$$
一般情况
$$
d_{ff} = 4d
$$
然后经过激活函数:
$$
\tilde{h}=\sigma(h)
$$
再进过第二层线性变换:
$$
y = W_2\tilde{h}+b_2
$$
其中:$W_2\in\mathbb R^{d\times d_{ff}},\ b_2\in\mathbb R^d$
所以:
$$
y\in\mathbb R^d
$$
最后整体就是:
$$
\begin{align*}
x&\in\mathbb R^d \\
x\rightarrow W_1x+b_1&\in\mathbb R^{d_{ff}} \\
\rightarrow\sigma(W_1x+b_1)&\in\mathbb R^{d_{ff}} \\
\rightarrow W_2\sigma(W_1x+b_1)+b_2&\in\mathbb R^d
\end{align*}
$$
对整个序列的FFN
$$
X\in\mathbb R^{B\times T\times d}
$$
类似的shape变化:
$$
[B,T,d]\rightarrow [B,T,d_{ff}]\rightarrow [B,T,d]
$$
如果没有激活函数,FFN变成:
$$
\begin{align*}
\text{FFN}(x)&=W_2(W_1x+b_1)+b_2 \\
&=W_2W_1x+W_2b_1+b2
\end{align*}
$$
本质上还是一层线性层,所以必须加入非线性,这样模型才能表达复杂的非线性函数。
激活函数扩展
早期transformer原论文使用的ReLU:
$$
\text{ReLU}(x)=\max(0,x)
$$
优点是简单,计算快;缺点是负数区直接变成0,可能出现神经元死亡问题。
BERT、GPT系列里常见的是GELU,GELU可以理解成一种更平滑的ReLU:
$$
\text{GELU}(x)=x\Phi(x)
$$
其中$\Phi(x)$是标准正态分布的累计分布函数(PDF)。整体上,对于GELU,$x$越大,越容易通过,$x$越小,越容易被抑制,但不像ReLU直接硬切为0,而是平滑地控制
SiLU也叫Swish,公式是:
$$
\text{SiLU}(x)=x\cdot\text{sigmoid}(x)
$$
其中:
$$
\text{sigmoid}(x)=\frac{1}{1+e^{-x}}
$$
整体上也是一个平滑的激活函数
- 从普通FFN到GLU (Gated Linear Uint)、SwiGLU
现在很多大模型,比如LLaMA系列,不用最朴素的两层FFN,而是用GLU类结构,尤其是SwiGLU
普通FFN是:
$$
\text{FFN}(x)=W_2\sigma(W_1x)
$$
GLU类FFN是:
$$
\text{GLU-FFN}(x)=W_{down}(\sigma(W_{gate}x)\odot W_{up}x)
$$
核心区别就是多出了一个gate作为门控信号,其中$\odot$表示逐元素相乘。
其中:
$$
W_{gate}\in\mathbb R^{d_{ff}\times d_f}
$$
SwiGLU就是GLU的一个变体,它把gate分支的激活函数换成了SiLU:
$$
\text{SwiGLU}(x)=W_{down}(\text{SiLU}(W_{gate}x)\odot W_{up}x)
$$
可以理解:
- gate分支先用SiLU生成一个平滑的门控信号
- 然后和up分支生成的候选特征逐元素相乘
- 最后down投影回d_model
coding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
import torch
import torch.nn as nn
import torch.nn.functional as F
class FFN(nn.Module):
"""
x: [B, T, d]
out: [B, T, d]
"""
def __init__(self, d: int, d_ff: int):
super().__init__()
self.up_proj = nn.Linear(d, d_ff)
self.down_proj = nn.Linear(d_ff, d)
def forward(self, x: torch.Tensor) -> torch.Tensor:
hidden = self.up_proj(x)
hidden = F.gelu(hidden)
out = self.down_proj(hidden)
return out
|
2.5 gqa#
2.6 grpo ppo dpo dapo gspo#
2.7 api调用#
2.8 sampling topp topk, softmax#
2.9 cross entropy#
2.10 kl divergence#