ktransformers小功能补丁

type

status

slug

date

summary

将config文件整理

Server 通过 RESTful API 对外提供模型推理服务，提供 ChatCompletion 和 Assistant 两种调用方式。

ChatCompletion 接口要求用户一次提供所有的历史对话，然后返回模型的回复。AI 服务提供商（例如OpenAI ）和本地推理框架（例如Ollama ）都提供 ChatCompletion 接口。为了兼容 OpenAI 和 Ollama，Server 分别提供和它们一致的 API 接口。因此，当前使用 OpenAI 和 Ollama 的应用可以无缝切换到我们的 Server。例如：如何使用 Tabby 和 ktransformers 在本地利用 236B 的大模型做代码补全？。

Assistant 适用于应用需要复用一系列资源并调用模型的场景。例如，在教育应用场景中，应用开发者可以创建一个名为二年级数学老师的 Assistant，并设置初始prompt（“你是一个有经验的的二年级数学老师...”），上传相关的资料（二年级数学教材）。创建 Assistant 后，应用需要创建一个 Thread 来存储用户和模型的对话消息（Message）。调用模型时，应用需要创建一个 Run 来获得 Assistant 的回复。相对于 ChatCompletion，实现了 Assistant 的 Server 代替应用实现了对话背景复用和多轮对话，使得复杂场景下的模型的调用更加方便。 OpenAI Assistant API 提出了这样的 Assistant 接口，而 Server 也提供和它一致的 API 。

这些 API 定义在server/api中，它们的具体使用请见这里。

格式化

目前f-string无法格式化

Black doesn't format expressions inside f-strings

Updated Apr 19, 2024

E203

PEP 8 recommends to treat : in slices as a binary operator with the lowest priority, and to leave an equal amount of space on either side, except if a parameter is omitted (e.g. ham[1 + 1 :]). It recommends no spaces around : operators for “simple expressions” (ham[lower:upper]), and extra space for “complex expressions” (ham[lower : upper + offset]). Black treats anything more than variable names as “complex” (ham[lower : upper + 1]). It also states that for extended slices, both : operators have to have the same amount of spacing, except if a parameter is omitted (ham[1 + 1 ::]). Black enforces these rules consistently.

This behaviour may raise E203 whitespace before ':' warnings in style guide enforcement tools like Flake8. Since E203 is not PEP 8 compliant, you should tell Flake8 to ignore these warnings.

The Black code style - Black 24.8.0 documentation

Black aims for consistency, generality, readability and reducing git diffs. Similar language constructs are formatted with similar rules. Style configuration options are deliberately limited and rarely added. Previous formatting is taken into account as little as possible, with rare exceptions like the magic trailing comma. The coding style used by Black can be viewed as a strict subset of PEP 8.

https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#slices

vscode 的settings.json设置：

flake忽略的一些W、E、F

思路

当前存在两个不同的config文件，一个yaml文件，一个main文件，都可以对任务启动的参数配置产生影响，需要将这种混乱的配置方式统一起来，实现命令行输入-大于-配置文件-大于-默认配置的效果。并且命令行可以输入所有的可变更的参数。

将原来arg的内容放到config文件里面配置，同时生效到配置文件和命令行

一些问题记录

from ktransformers/server/main.py: 112

这个optimize_config_path应该就是ktransformers/local_chat.py:51里面的optimize_rule_path

那对于这个配置的描述应该是写错了的，它应该是yml文件而不是json，默认路径也应该是ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat.yaml而不是这个：

注意不能出现args没输入参数，结果把默认值改成None这种覆盖的情况

对于local_chat 和 server_chat实现方式的学习

local_chat和server使用了逻辑相当的两套代码完成模型启动和推理的任务，需要完成合并

一些发现

server这边这里看起来的embedding只有新输入的内容

但是上面的

这边new_message是把历史消息也加入进去了的

但是上面这个切片逻辑和原因我不能理解

generated_ids和input_ids并不是有重叠的啊，generated_ids只有推理出来的token，input_ids的前self.seq_length并不是只有回复，同时也是包含了问题的吧？

这边看到unequal_mask 也能发现：

后面都不一样了。这个代码应该是错的吧

不过我理解了原理，是因为prefill会塞一些内容填kvcache，如果线程没变的话，那再次prefill的时候只需要填新加入的message就行，而不需要再把之前的加进去。

问题1：

这个有什么用？

超大模型加载转换Trick

在深度学习领域，大模型的训练和推理通常需要消耗大量的计算和内存。如何高效地加载和使用大模型是一个相当关键的问题。在这篇博客中，我将分享一些关于更快加载大模型和减少内存的技巧. 问题分析假设现在我们有一个236B 超大模型的原始权重的 checkpoint.pth 文件, 比如 DeepSeek Chat V2, 以BF16 格式存储, 一个标准的加载流程如下 123456import torchs

https://fazzie-key.cool/2024/05/20/big-model/

偶然也发现个大佬，跪了

torch.device() 上下文管理器确保工厂调用将像它们被传递了指定的”device”作为参数一样执行。在 torch.device('meta') 上的张量不携带数据。然而，它们具有张量所具有的所有其他元数据，例如.size()、.stride()、.requires_grad等。注意, 在使用 torch.device('meta')后, 我们需要加上 assign=True参数来让参数被加载. 最后一段代码可以check 所有参数被正确加载了, 加载后的参数的 device应该不再是 meta 了.

使用 mmap + meta device 加载几乎没有时间开销, 只有模型真正运行时才会从硬盘拷贝权重到CPU RAM.

问题2：

在server的后端的transformers这里的inference里面thread_id是干嘛的？

成员变量里面的cache: StaticCache是干嘛的？

cuda_graph_runner是什么技术？

这里莫名加了一个空字符

这里self.current_ids好像是个函数?

单步调试看看细节过程：

输入hello：

prefill的时候产生一个token是37727，它decode出来是’ Hello’,目前的TextStreamer里的print_len为0，通过text[self.print_len : text.rfind(" ") + 1]来截取字段打印的话（左闭右开），那么会把空格单独答应出来，而print_len就会更新+1

后续的decode，先产出了一个token是0，这样进入TextStreamer里的text解析token_cache出来就是’ Hello!’,由于最后一个空格在开头，并且已经打印过了（print_len=1），那么text[self.print_len : text.rfind(" ") + 1]给出的printable_text就是个空字符，那么这次不会改变这里print_len,继续decode吧，所以这里就出现上面发现的问题，明明是一个空字符，但是append_message_delta的时候，调用的filter_append里面delta_index +1了，不过这里的确不是很清除这个delta_index是干嘛的，暂且搁置。

又一轮接着decode，token是1724，加入token_cache得到的text是’ Hello! How‘,所以我们的printable_text就是’Hello! ‘,然后print_len更新为8

下一轮decode，token是481

问题3：

这里应该加上空格，在合并相邻用户的input的时候

python 的协程使用

Python: what are the advantages of async over threads?

I've had a hard time trying to understand how and why async functionality works in python and I am still not sure I understand everything correctly (especially the 'why' part). Please correct me if...

https://stackoverflow.com/questions/48020593/python-what-are-the-advantages-of-async-over-threads

asyc/asyncio allows concurrency within a single thread. This gives you, as the developer, much more fine grained control of the task switching and can give much better performance for concurrent I/O bound tasks than Python threading. Asyncio is a Python library designed for writing single-threaded concurrent code using coroutines. It excels in situations involving high-level structured network code or when handling multiple I/O-bound tasks simultaneously.

找出来的结论应该是python的异步是单线程的异步，不引入线程的话，应该是不涉及临界区的。

CudaGraph

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/700224642

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/700224642

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/681904382

Stream & event

cuda 有个stream的概念，就是初步的对于程序执行依赖关系进行划分，有依赖前后关系的放在一个stream里面，就会保证队首先执行而不会一个stream里面的任务并行，而不同stream的内容可以并行，这个并行取决于映射的硬件execution engine，多个任务被下发给kernel之后，只要没超过硬件资源限额，那就能并行了。

event的机制只是为了让不同stream之间也能有个同步机制，也就是stream2可以等待steam1里面的一个kernel，将其作为一个event加入steam2里面，然后接着向stream2里面插入kernel2时，不会立刻执行，而会等待kernel1执行完成。

cudagraph

理解ktransformers的CudaRunner里面的capture

CUDA C++ Programming Guide

The programming guide to the CUDA model and interface.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution.

During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.

Instantiation takes a snapshot of the graph template, validates it, and performs much of the setup and initialization of work with the aim of minimizing what needs to be done at launch. The resulting instance is known as an executable graph.

An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.

flash attention

Flash Attention

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention

HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.

By improving the efficiency of attention operations, Flash Attention allows for faster training and inference of transformer-based models. Rather than loading queries, keys, and values, or intermediate computation results multiple times for each computation iteration, Flash Attention loads all the data (queries, keys, and values) just once. It then computes the attention score (conducts a series of operations) on this loaded data before writing back the final results. Additionally, it divides the loaded data into smaller blocks, aiding parallel processing.

Batch for kv cache

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/719610083

LLM Inference Optimisation — Continuous Batching

Summary from Achieve 23x LLM Inference Throughput & Reduce p50 Latency (anyscale.com)

https://medium.com/@yohoso/llm-inference-optimisation-continuous-batching-2d66844c19e9

In the image on the top right, the white blocks represent wasted GPU time. If under an 8k token context, this could lead to significant underutilization of the GPU.

Orca: A Distributed Serving System for Transformer-Based Generative Models | USENIX

Gyeong-In Yu and Joo Seong Jeong, Seoul National University; Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI; Byung-Gon Chun, FriendliAI and Seoul National University

https://www.usenix.org/conference/osdi22/presentation/yu

In this blog, we discuss continuous batching, a critical systems-level optimization that improves both throughput and latency under load for LLMs.

Achieve 23x LLM Inference Throughput & Reduce p50 Latency

https://www.anyscale.com/blog/continuous-batching-llm-inference

它的中文翻译：

Continuous Batching：一种提升 LLM 部署吞吐量的利器

连续批处理可以实现23倍的 LLM 推理吞吐量，同时降低延迟（P50）。

https://www.high-flyer.cn/en/blog/continuous-batching/

方老师的技术报告：

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/676109470

论文：

www.usenix.org

https://www.usenix.org/system/files/osdi22-yu.pdf

示意图描述的是，系统正在生成请求x3、x4的第1个Token（论文称为Initiation Phase），同时生成请求x1、x2的后续Token（论文称为Increment Phase）。因为产生第1个Token后才会有K/V Cache，所以从Attention K/V Manager那里看有没有request对应的K/V cache，就能分辨出每个request处于哪个phase； Linear的计算不涉及Token之间的交互，因此将输入的Batch维度和Seq维度Reshape成1个维度，即可完成Linear的Batch计算（注：有的读者可能会疑惑为什么Linear的输入和输出维度分别是[7,H]和[7,3H]，为什么升维了？这是因为通过矩阵拼接可以实现用一个Linear同时计算出QKV，即输出包含了有H维的Q、H维的K和H维的V，合计3H维）； Attention的计算设涉及Token之间的交互，OCRA的操作是：将输入按Batch维Split，每个样本分别计算Attention，最后再将结果Merge在一起。

迭代级调度（Iteration-level Scheduling）：提出一种新的调度机制，调度执行时以迭代为单位，而不是整个请求。这样，每次迭代后，检测到完成的请求，并立即将生成的Token返回给客户端。对于新到达的请求，有机会在当前的迭代执行后进行处理，从而减少等待时间。通过迭代级调度，调度器可以完全控制每次迭代处理的请求数量和哪些请求，具体如下图所示。

选择性批处理（Selective Batching）：在应用批处理和迭代级调度时，只对选定的少数操作(与形状不规则的输入张量兼容的操作)应用批处理。即所有非注意力操作，包括线性、层归一化、Add和GeLU操作，它们不需要区分不同请求的张量元素，而注意力操作需要请求的概念（即需要批量维度）来仅在相同请求的Token之间计算注意力。选择性批处理了解每个操作的不同特性；它将批次分割并对每个请求单独处理注意力操作，同时将其他操作应用到没有请求概念的Token级（而不是请求级）批处理。这样可以在不同的操作中灵活地处理请求，避免因不同请求处理不同数量的Token 而导致的批处理问题，具体如下图所示，可以看到Attn前后有Split和Merge操作。

TabbyAPI

试试搭建跑一下：