366 lines
16 KiB
Markdown
366 lines
16 KiB
Markdown
## 基于Triton的ROCm 不同后端实现优化,基本实现vllm后端正常推理,以及pipeline后端中第一步layout用的DocLayout-YOLO
|
||
|
||
**已有完整python vllm和mineru环境直接跳转第五步!!!**
|
||
**其他GPU执行问题可以参考,先prof查看定位找到哪个算子问题,然后triton后端实现即可**
|
||
测试了一下,基本和MinerU官网效果差不多,用AMD的人也不是很多,就在评论区分享给大家了
|
||
|
||
### 1.结果介绍
|
||
**补充一个200页的PDF python编程书测试一下速度,可以到1.99it/s:**
|
||
Two Step Extraction: 100%|████████████████████████████████████████| 200/200 [01:40<00:00, 1.99it/s]
|
||
|
||
**下面为之前14学术论文测试结果:**
|
||
7900xtx mineru-gradio --server-name 0.0.0.0 --server-port 7860 --enable-vllm-engine true 速度大概为**1.6-1.8s/it**,没有仔细测试,简单试了两个文档。第二种矩阵乘法代替原来的dots点乘可以进一步提速到1.3s/it,优化后的主要算子耗时在hipblast(这个没法提升了)和vllm triton后端,各占25%耗时吧,vllm tirion后端这个这个只能等官方优化了。。。。
|
||
doclayout-yolo的layout速度从原来的1.6it/s提高到15it/s,注意需要缓存一下输入的pdf尺寸后,triton必须要缓存尺寸没办法。主要是为了保留模型输入输出接口,最小代码改动。
|
||
采用-b vlm-vllm-engine模式举个例子
|
||
|
||
---
|
||
**测试结果为优化为5d矩阵乘代替原来的点积结果:**
|
||
2025-10-05 15:45:12.985 | INFO | mineru.backend.vlm.vlm_analyze:get_model:128 - get vllm-engine predictor cost: 18.45s
|
||
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00, 12.20it/s]
|
||
Processed prompts: 100%|█████████████████████| 14/14 [00:08<00:00, 1.56it/s, est. speed input: 2174.18 toks/s, output: 791.87 toks/s]
|
||
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████| 278/278 [00:00<00:00, 323.03it/s]
|
||
Processed prompts: 100%|██████████████████| 278/278 [00:07<00:00, 37.63it/s, est. speed input: 5264.66 toks/s, output: 2733.31 toks/s]
|
||
|
||
mineru-gradio --server-name 0.0.0.0 --server-port 7860 --enable-vllm-engine true测试:
|
||
2025-10-05 15:46:55.953 | WARNING | mineru.cli.common:convert_pdf_bytes_to_bytes_by_pypdfium2:54 - end_page_id is out of range, use pdf_docs length
|
||
Two Step Extraction: 100%|████████████████████████████████████████████████████████████████████████████| 14/14 [00:18<00:00, 1.30s/it]
|
||
|
||
---
|
||
|
||
### 2.原因介绍
|
||
AMD RDNA使用vllm后端有严重的性能问题,原因是因为vllm的**qwen2_vl.py**中有一个算子在rocm kernel上没有对应的实现,导致性能出现严重的卷积计算回退,一次执行花了12s,。。。。。。。。一言难尽。即**MIOpen 库中缺少模型中特定 Conv3d(bfloat16) 的优化内核**。
|
||
DocLayout-YOLO的**g2l_crm.py**空洞卷积也是这个问题,专业的CDNA MI210也没解决这个问题
|
||
正好一起处理了。
|
||
|
||
---
|
||
|
||
### 3.环境介绍
|
||
System: Ubuntu 24.04.3 Kernel: Linux 6.14.0-33-generic ROCm version: 7.0.1
|
||
python环境:
|
||
python 3.12
|
||
pytorch-triton-rocm 3.5.0+gitbbb06c03
|
||
torch 2.10.0.dev20251001+rocm7.0
|
||
torchvision 0.25.0.dev20251003+rocm7.0
|
||
vllm 0.11.0rc2.dev198+g736fbf4c8.rocm701
|
||
不同版本无所谓,处理方法是一样的。
|
||
|
||
---
|
||
|
||
### 4.前置环境安装
|
||
```
|
||
uv venv --python python3.12
|
||
source .venv/bin/activate
|
||
uv pip install --pre torch torchvision -i https://pypi.tuna.tsinghua.edu.cn/simple/ --extra-index-url https://download.pytorch.org/whl/nightly/rocm7.0
|
||
uv pip install pip
|
||
# 避免覆盖我们本地的pytorch,改用pip而没有继续使用uv pip
|
||
pip install -U "mineru[core]" -i https://pypi.mirrors.ustc.edu.cn/simple/
|
||
```
|
||
vllm 安装参考官方手册[Vllm](https://docs.vllm.com.cn/en/latest/getting_started/installation/gpu.html#amd-rocm)
|
||
```
|
||
#手动安装aiter,vllm,amd-smi等,自行找一个位置clone,然后进入该目录吧
|
||
git clone --recursive https://github.com/ROCm/aiter.git
|
||
cd aiter
|
||
git submodule sync; git submodule update --init --recursive
|
||
python setup.py develop
|
||
cd ..
|
||
git clone https://github.com/vllm-project/vllm.git
|
||
cd vllm/
|
||
cp -r /opt/rocm/share/amd_smi ~/Pytorch/vllm/
|
||
pip install amd_smi/
|
||
pip install --upgrade numba \
|
||
scipy \
|
||
huggingface-hub[cli,hf_transfer] \
|
||
setuptools_scm
|
||
pip install -r requirements/rocm.txt
|
||
export PYTORCH_ROCM_ARCH="gfx1100" #根据自己的GPU架构 rocminfo | grep gfx
|
||
python setup.py develop
|
||
```
|
||
---
|
||
|
||
### 5.vllm中关键triton算子添加
|
||
#### 这里我给出两种解决方法,第一种解决方法就是前面提到的优化到1.5到1.8s/it,第二种方法有手动优化算子到矩阵乘法,7900xtx肯定适用,大概1.3s/it,其他AMD GPU相对方案一也有提速,但是不一定是最佳速度实现,里面的手动部分可能需要微调。
|
||
**注意pip把triton 后端的flash_attn卸载了,搞了半天各种尝试还是报错,问题比较大,直接不用就行了**
|
||
```
|
||
#定位自己vllm位置XXX
|
||
pip show vllm
|
||
```
|
||
**关键更改**
|
||
XXX/vllm/model_executor/models/qwen2_vl.py文件:
|
||
**1.qwen2_vl.py文件33行下增加from .qwen2_vl_vision_kernels import triton_conv3d_patchify**
|
||
```
|
||
from collections.abc import Iterable, Mapping, Sequence
|
||
from functools import partial
|
||
from typing import Annotated, Any, Callable, Literal, Optional, Union
|
||
|
||
import torch
|
||
import torch.nn as nn
|
||
import torch.nn.functional as F
|
||
from .qwen2_vl_vision_kernels import triton_conv3d_patchify
|
||
```
|
||
**接下来分为方案一(2.1和3.1)和方案二(2.2和3.2),选取一种实现即可**
|
||
|
||
---
|
||
**方案1**
|
||
**2.1qwen2_vl.py文件498行class Qwen2VisionPatchEmbed(nn.Module),PS.就是这玩意AMD没有现成的内核算子导致回退**
|
||
```
|
||
class Qwen2VisionPatchEmbed(nn.Module):
|
||
|
||
def __init__(
|
||
self,
|
||
patch_size: int = 14,
|
||
temporal_patch_size: int = 2,
|
||
in_channels: int = 3,
|
||
embed_dim: int = 1152,
|
||
) -> None:
|
||
super().__init__()
|
||
self.patch_size = patch_size
|
||
self.temporal_patch_size = temporal_patch_size
|
||
self.embed_dim = embed_dim
|
||
|
||
kernel_size = (temporal_patch_size, patch_size, patch_size)
|
||
self.proj = nn.Conv3d(in_channels,
|
||
embed_dim,
|
||
kernel_size=kernel_size,
|
||
stride=kernel_size,
|
||
bias=False)
|
||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||
L, C = x.shape
|
||
x_reshaped = x.view(L, -1, self.temporal_patch_size, self.patch_size,
|
||
self.patch_size)
|
||
|
||
# Call your custom Triton kernel instead of self.proj
|
||
x_out = triton_conv3d_patchify(x_reshaped, self.proj.weight)
|
||
|
||
# The output of our kernel is already the correct shape [L, embed_dim]
|
||
return x_out
|
||
```
|
||
**3.1XXX/vllm/model_executor/models/目录下创建qwen2_vl_vision_kernels.py文件,用triton实现**
|
||
```
|
||
import torch
|
||
from vllm.triton_utils import tl, triton
|
||
|
||
@triton.jit
|
||
def _conv3d_patchify_kernel(
|
||
# Pointers to tensors
|
||
X, W, Y,
|
||
# Tensor dimensions
|
||
N, C_in, D_in, H_in, W_in,
|
||
C_out, KD, KH, KW,
|
||
# Stride and padding for memory access
|
||
stride_xn, stride_xc, stride_xd, stride_xh, stride_xw,
|
||
stride_wn, stride_wc, stride_wd, stride_wh, stride_ww,
|
||
stride_yn, stride_yc,
|
||
# Triton-specific metaparameters
|
||
BLOCK_SIZE: tl.constexpr,
|
||
):
|
||
"""
|
||
Triton kernel for a non-overlapping 3D patching convolution.
|
||
Each kernel instance computes one output value for one patch.
|
||
"""
|
||
# Get the program IDs for the N (patch) and C_out (output channel) dimensions
|
||
pid_n = tl.program_id(0) # The index of the patch we are processing
|
||
pid_cout = tl.program_id(1) # The index of the output channel we are computing
|
||
|
||
# --- Calculate memory pointers ---
|
||
# Pointer to the start of the current input patch
|
||
x_ptr = X + (pid_n * stride_xn)
|
||
# Pointer to the start of the current filter (weight)
|
||
w_ptr = W + (pid_cout * stride_wn)
|
||
# Pointer to where the output will be stored
|
||
y_ptr = Y + (pid_n * stride_yn + pid_cout * stride_yc)
|
||
|
||
# --- Perform the convolution (element-wise product and sum) ---
|
||
# This is a dot product between the flattened patch and the flattened filter.
|
||
accumulator = tl.zeros((BLOCK_SIZE,), dtype=tl.float32)
|
||
|
||
# Iterate over the elements of the patch/filter
|
||
for c_offset in range(0, C_in):
|
||
for d_offset in range(0, KD):
|
||
for h_offset in range(0, KH):
|
||
# Unrolled loop for the innermost dimension (width) for performance
|
||
for w_offset in range(0, KW, BLOCK_SIZE):
|
||
# Create masks to handle cases where KW is not a multiple of BLOCK_SIZE
|
||
w_range = w_offset + tl.arange(0, BLOCK_SIZE)
|
||
w_mask = w_range < KW
|
||
|
||
# Calculate offsets to load data
|
||
patch_offset = (c_offset * stride_xc + d_offset * stride_xd +
|
||
h_offset * stride_xh + w_range * stride_xw)
|
||
filter_offset = (c_offset * stride_wc + d_offset * stride_wd +
|
||
h_offset * stride_wh + w_range * stride_ww)
|
||
|
||
# Load patch and filter data, applying masks
|
||
patch_vals = tl.load(x_ptr + patch_offset, mask=w_mask, other=0.0)
|
||
filter_vals = tl.load(w_ptr + filter_offset, mask=w_mask, other=0.0)
|
||
|
||
# Multiply and accumulate
|
||
accumulator += patch_vals.to(tl.float32) * filter_vals.to(tl.float32)
|
||
|
||
# Sum the accumulator block and store the single output value
|
||
output_val = tl.sum(accumulator, axis=0)
|
||
tl.store(y_ptr, output_val)
|
||
|
||
|
||
def triton_conv3d_patchify(x: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
|
||
"""
|
||
Python wrapper for the 3D patching convolution Triton kernel.
|
||
"""
|
||
# Get tensor dimensions
|
||
N, C_in, D_in, H_in, W_in = x.shape
|
||
C_out, _, KD, KH, KW = weight.shape
|
||
|
||
# Create the output tensor
|
||
# The output of this specific conv is (N, C_out, 1, 1, 1), which we squeeze
|
||
Y = torch.empty((N, C_out), dtype=x.dtype, device=x.device)
|
||
|
||
# Define the grid for launching the Triton kernel
|
||
# Each kernel instance handles one patch (N) for one output channel (C_out)
|
||
grid = (N, C_out)
|
||
|
||
# Launch the kernel
|
||
# We pass all strides to make the kernel flexible
|
||
_conv3d_patchify_kernel[grid](
|
||
x, weight, Y,
|
||
N, C_in, D_in, H_in, W_in,
|
||
C_out, KD, KH, KW,
|
||
x.stride(0), x.stride(1), x.stride(2), x.stride(3), x.stride(4),
|
||
weight.stride(0), weight.stride(1), weight.stride(2), weight.stride(3), weight.stride(4),
|
||
Y.stride(0), Y.stride(1),
|
||
BLOCK_SIZE=16, # A reasonable default, can be tuned
|
||
)
|
||
|
||
return Y
|
||
```
|
||
---
|
||
**方案2**
|
||
**2.2qwen2_vl.py文件498行class Qwen2VisionPatchEmbed(nn.Module)函数,PS.就是这玩意AMD没有现成的内核算子导致回退,这里我们直接5D张量一步到位,改为矩阵乘法**
|
||
```
|
||
class Qwen2VisionPatchEmbed(nn.Module):
|
||
|
||
def __init__(
|
||
self,
|
||
patch_size: int = 14,
|
||
temporal_patch_size: int = 2,
|
||
in_channels: int = 3,
|
||
embed_dim: int = 1152,
|
||
) -> None:
|
||
super().__init__()
|
||
self.patch_size = patch_size
|
||
self.temporal_patch_size = temporal_patch_size
|
||
self.embed_dim = embed_dim
|
||
|
||
kernel_size = (temporal_patch_size, patch_size, patch_size)
|
||
|
||
self.proj = nn.Conv3d(in_channels,
|
||
embed_dim,
|
||
kernel_size=kernel_size,
|
||
stride=kernel_size,
|
||
bias=False)
|
||
|
||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||
L, C = x.shape
|
||
x_reshaped_5d = x.view(L, -1, self.temporal_patch_size, self.patch_size,
|
||
self.patch_size)
|
||
|
||
return triton_conv3d_patchify(x_reshaped_5d, self.proj.weight)
|
||
```
|
||
**3.2XXX/vllm/model_executor/models/目录下创建qwen2_vl_vision_kernels.py文件,用triton实现**
|
||
```
|
||
import torch
|
||
from vllm.triton_utils import tl, triton
|
||
|
||
@triton.jit
|
||
def _conv_gemm_kernel(
|
||
A, B, C, M, N, K,
|
||
stride_am, stride_ak,
|
||
stride_bk, stride_bn,
|
||
stride_cm, stride_cn,
|
||
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
|
||
):
|
||
pid_m = tl.program_id(0)
|
||
pid_n = tl.program_id(1)
|
||
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
|
||
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
|
||
offs_k = tl.arange(0, BLOCK_K)
|
||
a_ptrs = A + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak)
|
||
b_ptrs = B + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn)
|
||
accumulator = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
||
for k in range(0, K, BLOCK_K):
|
||
a = tl.load(a_ptrs, mask=(offs_m[:, None] < M) & (offs_k[None, :] < K), other=0.0)
|
||
b = tl.load(b_ptrs, mask=(offs_k[:, None] < K) & (offs_n[None, :] < N), other=0.0)
|
||
accumulator += tl.dot(a, b)
|
||
a_ptrs += BLOCK_K * stride_ak
|
||
b_ptrs += BLOCK_K * stride_bk
|
||
offs_k += BLOCK_K
|
||
c = accumulator.to(C.dtype.element_ty)
|
||
offs_cm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
|
||
offs_cn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
|
||
c_ptrs = C + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
|
||
c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
|
||
tl.store(c_ptrs, c, mask=c_mask)
|
||
|
||
def triton_conv3d_patchify(x_5d: torch.Tensor, weight_5d: torch.Tensor) -> torch.Tensor:
|
||
N_patches, _, _, _, _ = x_5d.shape
|
||
C_out, _, _, _, _ = weight_5d.shape
|
||
A = x_5d.view(N_patches, -1)
|
||
B = weight_5d.view(C_out, -1).transpose(0, 1).contiguous()
|
||
M, K = A.shape
|
||
_K, N = B.shape
|
||
assert K == _K
|
||
C = torch.empty((M, N), device=A.device, dtype=A.dtype)
|
||
|
||
# --- 针对7900xtx的手动调优配置,其他GPU的最优组合可能需要自行寻找,AMD的autotune效果就是没有效果 ---
|
||
best_config = {
|
||
'BLOCK_M': 128,
|
||
'BLOCK_N': 128,
|
||
'BLOCK_K': 32,
|
||
}
|
||
num_stages = 4
|
||
num_warps = 8
|
||
|
||
grid = (triton.cdiv(M, best_config['BLOCK_M']),
|
||
triton.cdiv(N, best_config['BLOCK_N']))
|
||
|
||
_conv_gemm_kernel[grid](
|
||
A, B, C,
|
||
M, N, K,
|
||
A.stride(0), A.stride(1),
|
||
B.stride(0), B.stride(1),
|
||
C.stride(0), C.stride(1),
|
||
**best_config,
|
||
num_stages=num_stages,
|
||
num_warps=num_warps
|
||
)
|
||
|
||
return C
|
||
```
|
||
---
|
||
**4.关闭终端后再次使用mineru-gradio会报一个Lora错误,修改代码跳过它**
|
||
```
|
||
pip show mineru_vl_utils
|
||
```
|
||
|
||
打开该文件XXX/mineru_vl_utils/vlm_client/vllm_async_engine_client.py修改第58行self.tokenizer = vllm_async_llm.tokenizer.get_lora_tokenizer()为:
|
||
```
|
||
try:
|
||
self.tokenizer = vllm_async_llm.tokenizer.get_lora_tokenizer()
|
||
except AttributeError:
|
||
# 如果没有 get_lora_tokenizer 方法,直接使用原始 tokenizer
|
||
self.tokenizer = vllm_async_llm.tokenizer
|
||
```
|
||
|
||
**最后整两个环境变量后愉快玩耍即可**
|
||
```
|
||
export MINERU_MODEL_SOURCE=modelscope
|
||
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
|
||
```
|
||
---
|
||
|
||
### 6.vllm后端已经没有问题,下面是pipeline 中layout用的doclayout-yolo模型空洞卷积问题
|
||
### 我在 [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO/issues/120#issuecomment-3368144275) 下做了一个回答,因此 pipeline 的空洞卷积问题不在这里赘述,直接点击链接查看即可。
|
||
查看自己doclayout-yolo安装位置如下,然后进入修改链接中回复介绍的文件即可
|
||
```
|
||
pip show doclayout-yolo
|
||
```
|
||
|