ktransformers相关内容学习

type

status

slug

date

summary

TASK

任务：

ktransformers

kvcache-ai • Updated Apr 8, 2025

目标：了解这个项目和相关的一些技术

我想先尝试构建这个项目

安装cuda

How to Install CUDA on Ubuntu 22.04 | Step-by-Step

This tutorial will show you how to install CUDA on Ubuntu 22.04, the latest NVIDIA drivers, and the CUDA toolkit to unlock your NVIDIA GPU capabilities.

https://www.cherryservers.com/blog/install-cuda-ubuntu

Please note that CUDA is not yet installed at this stage.

The CUDA toolkit is now installed, and a few manual actions must be executed to complete the setup. We will now proceed to update the environment variables as recommended by the NVIDIA documentation. ‣

因为莫名其妙显示 CUDA error: invalid device function，我打算重新安装12.1版本：

下面是卸载方法：

CUDA Installation Guide for Linux

The installation instructions for the CUDA Toolkit on Linux.

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#uninstallation

更换驱动的时候出现的兼容问题：

解决问题：How to disable Nouveau kernel driver-CSDN博客

文章浏览阅读1.2w次，点赞2次，收藏17次。当我们为ubuntu系统安装显卡驱动时，有时会提示由于Nouveau kernel driver在运行，所以驱动安装不成功。这时需要手动禁止Nouveau kernel driver的运行。方法如下：1.首先ctrl+alt+F1切换到命令行模式：2.登陆账号密码3.创建新文件nano /etc/modprobe.d/blacklist-nouveau.conf在文件里..._disable nouveau kernel driver

https://blog.csdn.net/WangJiankun_ls/article/details/82375928

安装pytorch

遗留问题

还是不行，降低了cuda版本过后还是同样报错，目前我无法解决，先搁置，打算先试试cuda编程，只是保证自己的驱动是正确的先。

miniconda 安装

无法联通huggingface或者挂梯子耗流量太大：

解决国内无法访问huggingface.co_huggingface国内访问-CSDN博客

文章浏览阅读1.4w次，点赞26次，收藏51次。加速访问Hugging Face的门户。作为一个公益项目，我们致力于提供稳定、快速的镜像服务，帮助国内用户无障碍访问Hugging Face的资源。要访问：https://huggingface.co/THUDM/chatglm3-6b。要想下载对应的文件，只是换个域名即可，参数地址路径不变。HF-Mirror - Huggingface 镜像站。_huggingface国内访问

https://blog.csdn.net/qyhua/article/details/139505301

出bug

GLIBCXX 3.4.30 not found in conda environment

I am installing a package with pip in my conda environment and I keep getting this error: ImportError: /home/anavani/anaconda3/envs/dmcgb/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not fou...

https://askubuntu.com/questions/1418016/glibcxx-3-4-30-not-found-in-conda-environment

ModuleNotFoundError: No module named 'flash_attn'

huggingfacce transformers

这个是什么东东？

Hugging Face Transformers 萌新完全指南

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/blog/zh/noob_intro_transformers

理论内容：

📊

trasformer 模型推理和训练过程

关于并行的问题，终于有解答我疑惑的：

Can the decoder in a transformer model be parallelized like the encoder?

Can the decoder in a transformer model be parallelized like the encoder? As far as I understand, the encoder has all the tokens in the sequence to compute the self-attention scores. But for a deco...

https://ai.stackexchange.com/questions/12490/can-the-decoder-in-a-transformer-model-be-parallelized-like-the-encoder

Can the decoder in a transformer model be parallelized like the encoder?

Generally NO

但是transformer的encoder在gpu上怎么并行的呢？

如何让Transformer在GPU上跑得更快？快手：需要GPU底层优化

快手异构计算团队分享如何在 GPU 上实现基于 Transformer 架构的 AI 模型的极限加速

https://www.jiqizhixin.com/articles/2021-02-01

项目的内容

首先看到项目在和llama.cpp做对比，那么先看看llama.cpp这个项目

llama.cpp

Llama.cpp Tutorial: A Complete Guide to Efficient LLM Inference and Implementation

This comprehensive guide on Llama.cpp will navigate you through the essentials of setting up your development environment, understanding its core functionalities, and leveraging its capabilities to solve real-world use cases.

https://www.datacamp.com/tutorial/llama-cpp-tutorial

This is where LLaMa.cpp (or LLaMa C++) comes to the rescue, providing a lighter, more portable alternative to the heavyweight frameworks.

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

关于项目的deepseek-v2-injection和injection_turorial

什么是quantization

Quantization

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/optimum/en/concept_guides/quantization

The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type.

Storing all 236 billion parameters of a model in GPU VRAM is clearly impractical for local users. Therefore, we strategically store only the most computationally intensive parameters on the GPU. For instance, after our optimizations, the MLA operator, which contains 128 heads with a shared compressed key-value representation, shows an arithmetic intensity of 512. This makes it the most intensive operator, particularly during smaller inference batch sizes. Hence, it is allocated to the GPU to leverage the power of tensor cores.

Marlin是什么？

marlin

IST-DASLab • Updated Apr 6, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Requirements: • NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)

我感觉找到之前跑不起来的原因了

https://truehost.com/what-is-compute-capability-of-a-gpu/#:~:text=Device Manager (Windows),under the “Details” tab

marlin的优化

all activations are essentially always fetched from L2 cache and are further reused several times within registers

We execute global weight loads asynchronously, to all compute operations but also activations loads, with a cache policy that allows immediate eviction（回收） in order to not unnecessary pollute the L2 cache with values that are never reused.

llamafile是什么：

llamafile

Mozilla-Ocho • Updated Apr 7, 2025

How llamafile works

A llamafile is an executable LLM that you can run on your own computer. It contains the weights for a given open LLM, as well as everything needed to actually run that model on your computer. There's nothing to install or configure (with a few caveats, discussed in subsequent sections of this document).

MOE？

混合专家模型（MoE）详解

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/blog/zh/moe

GGUF格式是什么？

What is GGUF and GGML?

GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative…

https://medium.com/@phillipgimmi/what-is-gguf-and-ggml-e364834d241c

GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT

GGUF (GPT-Generated Unified Format), introduced as a successor to GGML (GPT-Generated Model Language) GGUF represents an upgrade to GGML, offering greater flexibility, extensibility, and compatibility. It aims to streamline the user experience and support a wider range of models beyond llama.cpp.

token和embedding

Explained: Tokens and Embeddings in LLMs

The numbers that make all the sense

https://medium.com/the-research-nest/explained-tokens-and-embeddings-in-llms-69a16ba5db33

In the world of natural language processing, it is the smallest unit of analysis that we define. What you call a token depends on your tokenization method; plenty of such methods exist. Creating tokens is basically the first step to perform for most NLP tasks.

What do tokens look like in the context of LLMs like ChatGPT? The tokenization methods used for LLMs differ from those used in general NLP.

Broadly speaking, we can call it “subword tokenization,” where we create tokens that need not necessarily be complete words as we see in whitespace tokenization. This is precisely why one word is not equal to one token.

When they say GPT-4 Turbo has 128K tokens as its context length, it is not exactly 128K words but a number close to it.

Token IDs are a straightforward numerical representation of tokens. It is, in fact, a basic form of vectorization. They do not capture any deeper relationships or patterns between the tokens.

Standard vectorization techniques (like TF-IDF) include creating more complex numerical representations based on some logic.

Embeddings are advanced vector representations of tokens. They try to capture the most nuance, connections, and semantic meanings between tokens. Each embedding is generally a series of real numbers on a vector space computed by a neural network.

In short, text is converted to tokens. Tokens are assigned token IDs. These token IDs can be used to create embeddings for more nuanced numerical representation in complex models.

Embeddings are the “real inputs” of LLMs.

关于long_context_introduction.md

Prune or Retrieval？

Even when using the KVCache Offload feature of llama.cpp to offload the KVCache to CPU/DRAM, barely making the model runnable, performance remains unacceptable due to the need to fully scan the entire KVCache each time a single token is generated.

Following this, a series of works built on H2O's approach by designing more complex methods for selecting tokens that perform better in different scenarios. These methods are quite reasonable for single-word inference. However, as we previously explored in the Mooncake project, we believe that the future trend is to precompute reusable KVCache as much as possible, and then use it to answer different questions.

Therefore, with this goal in mind, we prefer not to delete any tokens from the KVCache, or at least not remove a significant portion of them, to ensure that different questions can focus on different parts of the context in the future.

We further investigated related research, among which InfLLM proposed a very promising framework. Not only does it recognize that attention is sparse, but it also suggests that overly long contexts can cause attention to be dispersed into irrelevant noise, thereby reducing the model's ability to focus on key information. To address this issue, InfLLM introduces an external memory module (Memory Units) to store the context's KVCache. In each computation step, the most relevant semantic information is retrieved from this external memory module to participate in the calculation, thus enhancing the model's ability to handle long-context inference.

Specifically, InfLLM organizes the external memory module using semantic blocks composed of neighboring tokens and employs a sliding window mechanism during computation. In each step, it selects only the semantic blocks at the head of the context (Initial Tokens), the blocks near the current token (Local Tokens), and a few blocks with the highest semantic similarity to the current token to participate in the attention calculation. As shown in equation 1, to efficiently retrieve the blocks with the highest similarity, InfLLM selects a few representative tokens whose scores are the highest within each block. Use Equation 2 to calculate the semantic similarity between the current token and each semantic block.

Compared to the previously mentioned H2O, the differences in InfLLM are as follows:

The KVCache is not discarded but stored in memory and dynamically loaded onto the GPU during inference.

KVCache is managed at the granularity of blocks rather than tokens, with each block selecting a few tokens as its representative index tokens.

InfLLM's proposed method aligns with our "compute once, use many" approach of reusing KVCache. The external memory units in this method can be offloaded to CPU/DRAM or even SSD storage, allowing different parts to be selected for computation based on the specific question. This significantly improves the efficiency of attention computation.

Quest analyzed the recall rate of key tokens in H2O and full attention, finding that the Top-10 attention score token recall rate for the H2O algorithm is around 50%, which indicates that too much key information was lost.

貌似这张图的乘法这里有问题

During the attention computation stage, the dot product is computed between the current query vector and the max key and min key of each KVCache block, respectively. Then, for each channel, the maximum value between the two resulting product vectors is selected and summed to serve as the upper bound of the relevance score for that KVCache block, as shown in stage 1 of the diagram. Based on the relevance scores, the top-k KVCache blocks are selected to participate in the attention computation, as illustrated in stage 2 of the diagram.

Compared to InfLLM, Quest does not take heterogeneous architectures into account. Instead, it assumes that all KVCache can still fit into memory, simply leveraging sparse attention to accelerate the inference process. Ultimately, Quest achieves a 7.03x speedup in attention computation and a 2.23x improvement in end-to-end inference latency.

MagicPiG, based on Locality-Sensitive Hashing (LSH), proposes a dynamic KVCache management strategy.

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/701580870

这边有个地方感觉不理解

H2O采用了一种贪心法：对于某个token比如A来说，A以及A之后的token都会对A有注意力，把这些注意力加起来，可以得到A的“重要性分数”。将每个token的重要性分数都计算出来，然后丢弃分数最低的那些token即可

SnapKV的出发点是：大模型应用如对话（特别是RAG）和文档处理通常是输入很长，而输出相对较短。下面这句文章好像没说，但我结合他的代码意会了一下：因为输入长而输出短，因此可以在prefill阶段对KV Cache进行处理，drop一些不重要的token，而解码阶段就照常进行，这样不需要在解码阶段进行额外的计算，整体又省显存又加速了。

Retrieval-Augmented Generation（RAG）

KTransformers CPU Sparse Attn Framework

Based on the introduction of the above papers, we have distilled the following key points:

The distribution of attention weights is sparse, and useless KVCache may introduce noise, which could actually reduce performance during the inference stage.

For the KVCache eviction strategy during the inference stage, the common approach is to retain the tokens from the beginning and the end of the prompt, while designing algorithms to select the tokens from the middle portion. One of the main factors affecting the model's performance is the ability to accurately identify the key tokens.

Managing the middle portion of tokens in blocks can improve memory swapping and attention computation efficiency, and smaller blocks do not seem to perform worse than token-level granularity.

The tokens that each attention layer focuses on during inference differ, and even the allocated KVCache capacity for different layers should vary.

We organized the KVCache in units of blocks. Specifically:

KVCache Partitioning: A complete input prompt is divided into three configurable parts: Initial, Context, and Local. During the computation process, the Initial/Local parts will be fully attended to, while the Context part will be sparsely retrieved. This approach is based on findings from many papers (such as streamingLLM and Minference) which mention the existence of "attention sinks," where higher attention weights are often found at the beginning and the end of the sequence.

Context Block Partitioning: For the middle Context, we follow the InfLLM approach by dividing it into blocks based on a configurable fixed number of tokens. Each block can select 1 to k tokens as its representative tokens. During the actual inference phase, the Context blocks that require attention are selected based on these representative tokens.

Specifically, we have implemented the following methods for selecting representative tokens, based on the approaches outlined in various papers.

Max: The maximum values of multiple tokens within a block, across each channel, are concatenated to form the representative token for the current block.
Mean: The average values of multiple tokens within a block, across each channel, are concatenated to form the representative token for the current block.
Quest: A combination of the previous two methods: the maximum and minimum values of multiple tokens within a block, across each channel, are taken as the representative tokens for the block. Under this method, the number of representative tokens is fixed at 2
Dynamic: By calculating the cumulative attention score for each token using a specific method, each block selects the top-k tokens with the highest scores as the representative tokens for the block. This is similar to InfLLM but with some simplifications.

Once the representative tokens for each block are determined, use Equation 2 from InfLLM to calculate the similarity between the input X and the k representative tokens of each block B, and only select the top rk blocks for attention computation, where lP represents the length of the historical tokens:

Since InfLLM requires calculating a representative score for each token during the prefill stage and then selecting a representative token for each block based on these scores, this operation involves invasive modifications to the prefill implementation, making it difficult to integrate with other methods. Furthermore, in actual testing, we found that in most scenarios, similar or even better results can be achieved through a combination of other methods. Therefore, we ultimately decided not to integrate this method into the framework.

Further Optimizations

We can see that both the original model and the accelerated KTransformers achieve perfect scores on the relatively simpler datasets, such as Single Needle Retrieval and passkey. At the same time, the generation speed has significantly improved, increasing from 4.86 tokens/s with llama.cpp to 27.49 tokens/s with KTransformers, achieving up to a 5.65x speedup. Although the current configuration shows a noticeable drop in performance on the more challenging kvretrieval dataset, in the next section, we will address this by implementing a more optimized selection strategy to compensate for or even surpass the original model's accuracy.

As mentioned earlier, the goal of the kvretrieval dataset is to find a matching key-value pair within a long sequence of semantically meaningless pairs. If tokens are generated by reselecting based on the current query each time, the likelihood of deviation increases as the text grows, leading to the selection of different KVCache blocks compared to previous selections. To address this, we introduced a preselection mechanism using SnapKV to calculate the method for selecting representative tokens, which preselects a portion of the KVCache blocks. During the subsequent inference process, the selection is limited to these blocks. After one round of preselection, the score increased from 15.4 to 24.2, surpassing the original model + full attention's performance of 21 points. Further research indicates that the sparsity effect of the KVCache in the first few layers of LLMs is not as significant. Therefore, we set the first two layers to fully reuse the KVCache, ultimately achieving a score of 24.4.

The Needle In a Haystack Test

Evaluating the performance of RAG systems

https://towardsdatascience.com/the-needle-in-a-haystack-test-a94974c1ad38

It works by embedding specific, targeted information (the “needle”) within a larger, more complex body of text (the “haystack”). The goal is to assess an LLM’s ability to identify and utilize this specific piece of information amidst a vast amount of data.

The Needle in a Haystack test was first used to evaluate the recall of two popular LLMs, OpenAI’s ChatGPT-4 and Anthropic’s Claude 2.1. An out of place statement, “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day,” was placed at varying depths within snippets of varying lengths taken from essays by Paul Graham, similar to this:

不过这个“大海捞针”的测试，确实我觉得不太好，后面说到claude的一些改进得到更好效果，我觉得他们的改进是合理的，两种方法，一种是让这个Needle的内容和Haystack部分贴近一些，这样claude做assertion的时候就有更充分的理由，或者就干脆不要让它做论断，直接让它找最relevant的句子。

读读代码

这里感觉怪怪的，如果配置文件被装入到本地过后，那岂不是后面写配置文件也更新不了？我觉得有更改的时候应该把用户那里copy的部分也改了。（多加一个checksum？这样提高效率缓存）

没问题了，我理解错意思了，这个设计就是想让用户去到~/.ktransformers/config.yaml里面去修改

我想试试其它模型的时候，找了个最小的模型Mixtral-8x7B，但是去掉marlin的优化过后还是跑不通，溢出了，不过这里我发现了一个问题，对于有access control的情况，需要加token： 👇

How to use gated models?

I had the same issues when I tried the Llama-2 model with a token passed through code. I gave up after while using cli. But what I see from your error: ** “Your request to access model meta-llama/Llama-2-7b-hf is awaiting a review from the repo authors.” ** I have an assumption. I suspect some auth response caching issues or - less likely - some extreme response caching expiration time values. As you said you could login via notebook (and access the repo, I assume?), the caching seems t...

https://discuss.huggingface.co/t/how-to-use-gated-models/53234/3

Transformer

先粗泛的理解：

注意力机制

RNN：Recurrent Neural Networks,即循环神经网络

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/52119092

Transformer

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/82312421

学学论文：Attention Is All You Need

proceedings.neurips.cc

https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Transformer论文逐段精读

00:00 标题和作者 03:21 摘要 08:11 结论 10:05 导言 14:35 相关工作 16:34 模型 1:12:49 实验 1:21:46 讨论

https://www.youtube.com/watch?v=nzqlFIcCSWQ

只应用了注意力机制，没有用recurrence and convolutions. 注意encoder-decoder architecture。transformer：eschewing recurence and instead relying entierly on an attention mechanism to draw global dependencies between input and output.能够并行

它：replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

RNN特点（缺点）：给一个序列，是从左到右一步步往前做，比如句子的话，就是一个词一个词的往前看。对第t个词，会计算出一个输出叫叫做隐藏状态（这是由前一个词的隐藏状态和当前第t个词本身所决定），这样就能把前面学到的历史信息通过放到当下，和当前词做一些计算然后得到输出。它把之前的信息全部放在隐藏状态里面然后一个一个放下去。它是一个时序，一步步计算（难以并行），历史信息一步步往后传递，如果时序比较长，早期的时序信息在后续可能会丢掉

difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention. (每一次能看到所有像素，一层能把整个序列看到)

自回归：过去时候的输出也可以作为当前时刻的输入

An attention function can be described as mapping a query and a set of key-value pairs to an output

Scaled Dot-Product Attention 把query和key做内积算相似度，然后除以根号维度再进行softmax从而算出一个和为1非负的权重，再把权重作用到value上得到输出：

之所以要多除一个根号是因为在较大的时候，点积的值可能会比较大，相对差距就会变大，导致最大的值做出的softmax的值会更接近于1

注意这里的mask是为了避免在t时刻看到t时刻以后的东西，也就是对t时刻的,我应该只是一直看而不应该看和它之后的东西，但是在注意力机制的时候，我会看到所有，我们的会跟所有k里面的东西进行运算。所以我们可以在计算权重的时候，不要用到后面的内容即可，将之后的那些k值，直接换成一个很大的负数，这样softmax的时候直接就消失了。

Multi-Head Attention

transformer如何使用注意力机制的

How transformer do inference?

Inference time(generating new text)

How a Transformer works at inference vs training time

I made this video to illustrate the difference between how a Transformer is used at inference time (i.e. when generating text) vs. how a Transformer is trained. Disclaimer: this video assumes that you are familiar with the basics of deep learning, and that you've used HuggingFace Transformers at least once. If that's not the case, I highly recommend this course: http://cs231n.stanford.edu/ which will teach you the basics of deep learning. To learn HuggingFace, I recommend our free course: https://huggingface.co/course. The video goes in detail explaining the difference between input_ids, decoder_input_ids and labels: - the input_ids are the inputs to the encoder - the decoder_input_ids are the inputs to the decoder - the labels are the targets for the decoder. Resources: - Transformer paper: https://arxiv.org/abs/1706.03762 - Jay Allamar's The Illustrated Transformer blog post: https://jalammar.github.io/illustrated-transformer/ - HuggingFace Transformers: https://github.com/huggingface/transformers - Transformers-Tutorials, a repository containing several demos for Transformer-based models: https://github.com/NielsRogge/Transformers-Tutorials.

https://youtu.be/IGu7ivuy1Ag?si=g_eBNsMXKY67-A4-

transformer doesn’t come up with translation in a single forward pass.

we tokenize the input sentence(source sentence → source tokens → input_ids)(using wordpiece, BPE,…) e.g. hello my dog is cute → hello,my,dog,is,cu,te → 38,69,54,2345,222,494

the encoder will turn every one of input_ids/tokens to an embedding vector, and we call this the last hidden states. each vector size is 768(typically for a base size Transformer). So the last hidden states will have (1,6,768) e.g. we can get : [-0.2,0.5,…,0.9](→hello) [0.34,064,…,0.8](→my) …(6 in total)