sglang_benchmark | LiamY Blog

type

status

slug

date

summary

what is FlashInfer(an attention kernel backend)

flashinfer-ai • Updated Jun 15, 2025

FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more. FlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.

推理引擎

推理引擎架构

在深入探讨推理引擎的架构之前，让我们先来概述一下推理引擎的基本概念。推理引擎作为 AI 系统中的关键组件，负责将训练好的模型部署到实际应用中，执行推理任务，从而实现智能决策和自动化处理。随着 AI 技术的快速发展，推理引擎的设计和实现面临着诸多挑战，同时也展现出独特的优势。本节将详细阐述推理引擎的特点、技术挑战以及如何应对这些挑战，为读者提供一个较为全面的视角。同时，我们将深入探讨推理引擎...

https://chenzomi12.github.io/04Inference01Inference/05Inference.html

LLM的一些理论

↪️

旋转位置编码

docker部分

👻

Docker 的一些小东西

Bugs

flashInfer 里面 Cuda Graph的问题

报错：

💡

NotImplementedError: Error in calling custom op rotary_embedding: Could not run '_C::rotary_embedding' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rotary_embedding' is only available for these backends: [HIP, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

跑起来了：

不太对讲，这里发现它的自检没有通过：

弄好了，找到原因了，因为发现请求的响应是502 bad gate way所以就合理推测是不是因为代理的原因，果然因为设置了本机的http代理导致了服务器响应出问题（我也有点不太明白为什么这会出问题，因为实际上单独shell curl访问的时候是不会出现这个问题的）

跑测试

查看是否支持memory offloading

[Feature] Question about kvcache offloading to disk or CPU memory · sgl-project sglang · Discussion #1073

Motivation In sglang's paper, it mentions that sglang plans to implement a kvcache offloading mechanism. I notice that there is a disable_disk_cache=False option in sglang's Server args, an...

https://github.com/sgl-project/sglang/discussions/1073

—cpu-offload-gb

但是这里有个选项是指定预留多少G的ram用于CPU offloading

顺着这个可选项找下去，找到了它的确实现了一个简单的CPU的offload的实现：

有个人的求助：在同一张GPU上跑多个模型：

Load multiple models on the same GPU in the cluster

Updated Nov 30, 2024

顺带看看vllm是否支持

swap

这里看到vllm有个swap机制，这个看起来很像是offloading，不过这个机制我感觉有点不太懂(通过—swap-space参数指定交换空间大小，默认是4G)：

When is the CPU KV cache used and swapping?

Updated Feb 6, 2025

The CPU KV cache is only used in cases where a sequence group has multiple sequences running. An example of this would be in a generation request when beam_search is enabled or best_of>1
You can see the logic here
You can see where the logic is called from here
There does not seem to be a way to enable CPU swapping right now for other cases. You can modify the server code however in the places that I listed above.

为啥这里说overhead recomputation 会比 swapping 小？我问copilot的解答是：

这个和offloading不太一样，它不是动态调整，而是最开始根据请求sequence的情况，如果是beam_search或者使能了best_of是大于1（也就是产出多个候选结果）的时候才会进行，是直接将其放在CPU memory上，用的时候整批swap

offloading

[core][model] yet another cpu offload implementation

Updated Jan 23, 2025

这里是—cpu-offload-gb参数实现的pr

CPU offloading support

Updated Apr 23, 2024

vLLM currently focuses on optimizing throughputs on the server side. Swapping weights to CPU RAM will almost always hurt the serving throughput and GPU utilization, and thus it's not in our short-term plan. You are welcome to checkout another project of us to learn more about that constraint scenario: https://github.com/FMInference/FlexGen

vllm关于prefix caching 的offload支持：

[Usage]: Does Prefix Caching currently support offloading to the CPU?

Updated Nov 24, 2024

主要是有这个设置：

💡

-cpu-offload-gb

The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. Intuitively, this argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight,which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass. Default: 0

给我的感觉是这个offloading是用来扩容的目的，显存不够，通过补充cpu内存的方式来做扩展，不过的确offloading的用处就是这样，如果可以，我们当然希望一张卡上跑的多个模型能够一直把显存都占着用，但是显存不够大，我们可能加载新模型的时候就需要就把旧的模型占的那部分显存给offload掉，但是按照vllm的描述来看，这部分的实现是不确定的，没有实验之前，有几点未知：

vllm貌似只能跑一个模型，那么它这里的offload是为了支持更大的模型而不是为了支持更多的模型

如果我们是通过开多个终端的方式使用vllm，也就是两个vllm引擎服务独立，从而在单张卡上跑多个模型，那么这个offloading应该不能达到我们想要的效果，因为设想一下，我们在23G显存的GPU上打算用两个vllm跑两个模型（1对1），那么我们就事先分配好每个vllm显存占用的空间大小，多的空间我们设置cpu-offloading它去找cpu用，但是这样的话，我们必须在事先就能确定这个模型运行的最佳显存大小，但是这个是不可能的，主要是对kvcache这部分开销，因为这部分开销是运行的时候的动态开销，在做推理的时候会随着seq的变化而变化，虽然存在一个最大值，但是我们把两个vllm分开运行，那必然是大量的“内部碎片”

这样最大的缺点是在静态的显存分配，我们开的单个vllm引擎越多，那么kvcache、激活值等中间结果每个引擎都是单独的一套动态空间，那么这种预留空间一定是超额的，随着vllm的增加一定超额会增加。

为什么vllm会这样呢？我想最大原因还是在于它设计想的还是最起码是占用一张卡，对于大模型而言，这样是完全合理的，做cpu offloading也是为了解决模型显存太大，一张卡放不下的问题。但是我看文章说，它也只是做了将模型参数的offloading，事实上 kvcache的offloading可能在单卡的单个模型情况少见是因为我们的kvcache需要快速的访问，又不存在和另一个模型kvcache之间存在竞争，如果这个时候都放不下kvcache的话，那重算接下来的kv对，比把接下来的kv对放到cpu的ram再加载过来合理？（因为可能目前都没有涉及到那么那么大的长对话）：

实际上，cpu-offload-gb 只分担了模型的参数部分，并且这一个功能仅支持部分特定结构的模型；具体支持哪些模型，可以查看 models/utils.py#L181 处所定义的 make_layers 函数, 被哪些模型所调用。

我想先看看社区是否有关于在vllm上运行多个模型的讨论：

Support Multiple Models

Updated May 25, 2025

One work around that I notice vLLM already gives is using Docker containers.
Use docker compose to serve multiple models that way two or more models can be served at a time based on your memory capabilities.

对vllm来说，参数gpu-memory-utilization的作用：

from:https://blog.csdn.net/weixin_43408232/article/details/143117513

and: https://docs.vllm.ai/en/latest/serving/engine_args.html (search gpu-memory-utilization)

💡

-gpu-memory-utilization

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. This is a per-instance limit, and only applies to the current vLLM instance.It does not matter if you have another vLLM instance running on the same GPU. For example, if you have two vLLM instances running on the same GPU, you can set the GPU memory utilization to 0.5 for each instance. Default: 0.9

指要用于模型执行器的空闲 GPU 内存的比率，范围从 0 到 1。默认值为 0.9。通过这个参数，我们可以控制 GPU 内存的使用率，从而提高 GPU 的利用率。

具体来说，当我们需要部署模型时，GPU上的显存分布可能是这样的，包括两部分：

Allocated（已分配）: 这一部分用红色表示，它显示了已经被使用的GPU内存，这部分的取值也有可能是 0，代表没有被使用的GPU内存。

Free（空闲）: 用蓝色框表示的是当前未被使用的GPU内存。模型部署只能使用 Free 部分的 GPU 内存，但是是否是需要全部使用 Free 部分的 GPU 内存，这就需要根据实际情况来决定。通过调整 --gpu-memory-utilization 参数，我们可以控制 GPU 内存的使用率，从而提高 GPU 的利用率。所以在模型部署后，GPU上的显存分布可能是这样的，包括三部分：

Allocated（已分配）: 这一部分用红色表示，它显示了已经被使用的GPU内存，这部分的取值也有可能是 0，代表没有被使用的GPU内存。

Model Deployment（模型部署）: 这一部分代表了未来可能用于模型部署的内存。

Unused（未使用）: 用灰色表示，是完全未被利用的内存。这部分的取值也有可能是 0，代表没有被剩余的GPU内存。

对于model deployment部分的内存开销，它由更多的组成：

我们可以看到Peak值可以被拆解为两部分，所以模型部署的显存占用可以细分为三部分：

模型权重 (model weight): 这一部分用深蓝色表示，展示了分配给模型权重的内存，这部分是相对固定的。

激活等 (activation, etc.): 用绿色表示的是除了模型权重之外的其他内容，比如激活值（模型推理时的中间结果）等。这部分内存的大小可能根据配置(config)而有所不同，并且可能与最大序列数（max_num_seqs）成线性关系。

KV缓存内存使用情况 (memory used by the KV cache): 用灰色表示的是为KV键值缓存保留的内存。KV缓存用于快速存取模型在推断过程中产生的中间数据，以加速后续的计算过程。这部分内存应该足够存放至少一个请求的信息，并且其大小可能与模型的最大长度（max_model_len（模型的最大生成长度，包含prompt长度和generated长度））成线性关系。

embedding 模型和 LLM模型同时跑

主要是看是否支持了分时复用，我的感觉就是看是否是支持sglang能不能同时跑多个model？（我不太确定，看到它的使用方法都是只去指定一个model path 并没有看到指定两个的情况，那是不是只是单纯的开两个shell去跑？

有个问题不懂，SGL里面说的online和offline是什么意思？

💡

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

Offline Batch Inference

Custom Server on Top of the Engine

指的是接口是否是一个http的接口（online）还是一个非http的engine api

测试结果

下载模型记得export HF_ENDPOINT=https://hf-mirror.com

然后使用hugginface-cli下载，例如：

llm qwen2.5-7B-Instruct的测试结果

主要是最后的报告：

发现qwen 7B的还是大了，后续同时跑embedding没办法在4090里面跑，所以我换成了deepseek-coder-1.3b-instruct

llm deepseek-coder-1.3b-instruct

embedding model gte-Qwen2-1.5B-instruct

测试是否有抢占

思路是让llm先去占用完计算核心，然后再把embedding放进去查看是否有获得执行。

单独测embedding统一一下结果：

结果区别不太大，看总时长22s的样子，观察gpu利用在90%左右，显存开销在静态分配在11G

再看看单独的llm 1.5B的deepseek

数据都比较接近时间差不多41s，看到运行时利用率也都在85%以上，平均在90%了

结果

先跑llm再跑的embedding

在embedding模型跑起来的时候，如果我embedding起的快的话，他还可能比llm先结束

额外的测试

我又用benchmark文件夹里面 bendch_in_batch_prefix的测试脚本跑了上面两个模型的情况（微调了一下sample的数量）：

基本情况一样的，单个模型GPU核心数能用到90%，两个模型同时跑的时候利用率是100%，但是速度没有折的很厉害（我觉得只是增加了100%的延迟的话，是正常行为，因为平均的拥有计算资源被折半了）。

下面是对比：

问题

很奇怪的是，明明gpu显示显存占用都没有占满，但是仍然报错：

💡

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 380.75 MiB is free. Process 1545321 has 9.65 GiB memory in use. Including non-PyTorch memory, this process has 13.61 GiB memory in use. Of the allocated memory 13.19 GiB is allocated by PyTorch, and 14.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

原来是他中间有个步骤消耗的显存有点大，所以当时肉眼没观察到

总结

sglang这边：

有offload cpu的操作，通过指定参数—cpu-offload-gb即可，但是目前看起来应该是比较simple的实现

单张卡，sglang跑多个模型的时候，是不会发生阻塞，只是会减速，因为资源少了，但是我觉得这个和sglang没关系（有可能是我测的方法不对？）因为sglang我是两个shell打开运行，他们通过pytorch调用cuda在gpu上执行，我可以抽象为两个sglang引擎跑的是两个任务，那负责这部分功能实现的应该是看pytorch调用gpu的方式？或者有可能更底部，gpu对于分配的任务，它的执行方式，我更感觉这个是gpu的硬件/固件对于多个任务在其上跑的调度策略实现。

sglang offloading实现

Hierarchical Caching for SGLang by xiezhq-hermann · Pull Request #2693 · sgl-project/sglang

Motivation While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currentl...

https://github.com/sgl-project/sglang/pull/2693

GitHub - sgl-project/sglang at xiezhq-hierarchical

SGLang is a fast serving framework for large language models and vision language models. - GitHub - sgl-project/sglang at xiezhq-hierarchical

https://github.com/sgl-project/sglang/tree/xiezhq-hierarchical

看看效果：

上面是没有开hierarchy的情况,下面是给了4.19G的多级kv cache缓存

感觉没区别哇：

看往下跑的结果：

我增加了cpu offloading的比例，这次分了5.96G的host memory

cache命中在这个过程中是非常的高（后期60%~70%），但是速度却很慢：

我感觉问题在LRU的置换策略上，因为前期是很快的。

我offloading了快100G 的cpu缓存给，这个时候速度有明显提升（但是和没有offloading相比还是不行，我感觉是它置换策略没有很好利用到原来显存放kvcache的内容有关系）：

换一种参数测试

我提升了每轮客户端的对话轮数（8→20）减少了客户端数量（100→20）

w/o offloading的时候cache命中率在20%~50%区间，offloading的时候cache 命中在80%~90%

我再试试降低客户端对话轮数（8→2）增加客户端数量（100→200)

我再试试增加输入输出长度request-length（512→2048）,output-length(64→256),10个client 每个client10轮对话：

缓存命中在70%~80%

缓存命中在40%~50%

更换了一组参数：

单客户端多轮对话有无offloading都一样，没有区别几乎，LRU机制足够了

于是我就想到，是否offloading会在大量客户端存在kvcache的evict情况下有好的效果，体现优于重算的地方？

但是并没有，开了200个客户端，每个客户端都是一轮对话，它仍然无法表现出更好的表现，一直都差于w/o offloading我觉得应该是实现的代码有问题，不然不可能，这个想法一定是能够加速客户端的。

放缓了一下请求速度：

延迟小了点，不过offloading这边的实现不管怎么样都变弱了。

长文本

测试的一些问题解决

Memory management for large batch sizes · sgl-project sglang · Discussion #187

Does SGLang automatically manage memory when a large batch (e.g. 100 batch size with avg. token per sequence 6,000) is submitted for processing? I have been getting OOM errors when increasing my ba...

https://github.com/sgl-project/sglang/discussions/187

参数配置

max_total_tokens: Optional[int] = None

—max_total_tokens