type
status
slug
date
summary
tags
category
password
icon
TASK我想先尝试构建这个项目安装cuda遗留问题出bugModuleNotFoundError: No module named 'flash_attn'huggingfacce transformers理论内容:但是transformer的encoder在gpu上怎么并行的呢?项目的内容llama.cpp关于项目的deepseek-v2-injection和injection_turorialMarlin是什么?marlin的优化llamafile是什么:How llamafile worksMOE?GGUF格式是什么?token和embedding 关于long_context_introduction.mdPrune or Retrieval?KTransformers CPU Sparse Attn FrameworkFurther OptimizationsThe Needle In a Haystack Test读读代码Transformer先粗泛的理解:注意力机制Transformer学学论文:Attention Is All You NeedHow transformer do inference?then go to decoder感觉有用,但是目前可能来不及看(性价比不高)的链接/论文:vscode+python的配置对项目的理解
TASK
任务:
ktransformers
kvcache-ai • Updated Jan 15, 2025
目标:了解这个项目和相关的一些技术
我想先尝试构建这个项目
安装cuda
Please note that CUDA is not yet installed at this stage.
The CUDA toolkit is now installed, and a few manual actions must be executed to complete the setup. We will now proceed to update the environment variables as recommended by the NVIDIA documentation.
‣
因为莫名其妙显示 CUDA error: invalid device function,我打算重新安装12.1版本:
下面是卸载方法:
更换驱动的时候出现的兼容问题:
安装pytorch
遗留问题
还是不行,降低了cuda版本过后还是同样报错,目前我无法解决,先搁置,打算先试试cuda编程,只是保证自己的驱动是正确的先。
miniconda 安装
无法联通huggingface或者挂梯子耗流量太大:
出bug
ModuleNotFoundError: No module named 'flash_attn'
huggingfacce transformers
这个是什么东东?
理论内容:
trasformer 模型推理和训练过程关于并行的问题,终于有解答我疑惑的:
Can the decoder in a transformer model be parallelized like the encoder?
Generally NO
但是transformer的encoder在gpu上怎么并行的呢?
项目的内容
首先看到项目在和llama.cpp做对比,那么先看看llama.cpp这个项目
llama.cpp
This is where LLaMa.cpp (or LLaMa C++) comes to the rescue, providing a lighter, more portable alternative to the heavyweight frameworks.
The main goal of
llama.cpp
is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.关于项目的deepseek-v2-injection和injection_turorial
什么是quantization
The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type.
Storing all 236 billion parameters of a model in GPU VRAM is clearly impractical for local users. Therefore, we strategically store only the most computationally intensive parameters on the GPU. For instance, after our optimizations, the MLA operator, which contains 128 heads with a shared compressed key-value representation, shows an arithmetic intensity of 512. This makes it the most intensive operator, particularly during smaller inference batch sizes. Hence, it is allocated to the GPU to leverage the power of tensor cores.
Marlin是什么?
marlin
IST-DASLab • Updated Jan 14, 2025
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Requirements:
• NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)
我感觉找到之前跑不起来的原因了
marlin的优化
- all activations are essentially always fetched from L2 cache and are further reused several times within registers
- We execute global weight loads asynchronously, to all compute operations but also activations loads, with a cache policy that allows immediate eviction(回收) in order to not unnecessary pollute the L2 cache with values that are never reused.
llamafile是什么:
llamafile
Mozilla-Ocho • Updated Jan 16, 2025
How llamafile works
A llamafile is an executable LLM that you can run on your own computer. It contains the weights for a given open LLM, as well as everything needed to actually run that model on your computer. There's nothing to install or configure (with a few caveats, discussed in subsequent sections of this document).
MOE?
GGUF格式是什么?
GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT
GGUF (GPT-Generated Unified Format), introduced as a successor to GGML (GPT-Generated Model Language)
GGUF represents an upgrade to GGML, offering greater flexibility, extensibility, and compatibility. It aims to streamline the user experience and support a wider range of models beyond llama.cpp.
token和embedding
In the world of natural language processing, it is the smallest unit of analysis that we define. What you call a token depends on your tokenization method; plenty of such methods exist. Creating tokens is basically the first step to perform for most NLP tasks.
What do tokens look like in the context of LLMs like ChatGPT? The tokenization methods used for LLMs differ from those used in general NLP.
Broadly speaking, we can call it “subword tokenization,” where we create tokens that need not necessarily be complete words as we see in whitespace tokenization. This is precisely why one word is not equal to one token.
When they say GPT-4 Turbo has 128K tokens as its context length, it is not exactly 128K words but a number close to it.
- Token IDs are a straightforward numerical representation of tokens. It is, in fact, a basic form of vectorization. They do not capture any deeper relationships or patterns between the tokens.
- Standard vectorization techniques (like TF-IDF) include creating more complex numerical representations based on some logic.
- Embeddings are advanced vector representations of tokens. They try to capture the most nuance, connections, and semantic meanings between tokens. Each embedding is generally a series of real numbers on a vector space computed by a neural network.
In short, text is converted to tokens. Tokens are assigned token IDs. These token IDs can be used to create embeddings for more nuanced numerical representation in complex models.
Embeddings are the “real inputs” of LLMs.
关于long_context_introduction.md
Prune or Retrieval?
Even when using the KVCache Offload feature of llama.cpp to offload the KVCache to CPU/DRAM, barely making the model runnable, performance remains unacceptable due to the need to fully scan the entire KVCache each time a single token is generated.
Following this, a series of works built on H2O's approach by designing more complex methods for selecting tokens that perform better in different scenarios. These methods are quite reasonable for single-word inference. However, as we previously explored in the Mooncake project, we believe that the future trend is to precompute reusable KVCache as much as possible, and then use it to answer different questions.
Therefore, with this goal in mind, we prefer not to delete any tokens from the KVCache, or at least not remove a significant portion of them, to ensure that different questions can focus on different parts of the context in the future.
We further investigated related research, among which InfLLM proposed a very promising framework. Not only does it recognize that attention is sparse, but it also suggests that overly long contexts can cause attention to be dispersed into irrelevant noise, thereby reducing the model's ability to focus on key information. To address this issue, InfLLM introduces an external memory module (Memory Units) to store the context's KVCache. In each computation step, the most relevant semantic information is retrieved from this external memory module to participate in the calculation, thus enhancing the model's ability to handle long-context inference.
Specifically, InfLLM organizes the external memory module using semantic blocks composed of neighboring tokens and employs a sliding window mechanism during computation. In each step, it selects only the semantic blocks at the head of the context (Initial Tokens), the blocks near the current token (Local Tokens), and a few blocks with the highest semantic similarity to the current token to participate in the attention calculation. As shown in equation 1, to efficiently retrieve the blocks with the highest similarity, InfLLM selects a few representative tokens whose scores are the highest within each block. Use Equation 2 to calculate the semantic similarity between the current token and each semantic block.
Compared to the previously mentioned H2O, the differences in InfLLM are as follows:
- The KVCache is not discarded but stored in memory and dynamically loaded onto the GPU during inference.
- KVCache is managed at the granularity of blocks rather than tokens, with each block selecting a few tokens as its representative index tokens.
InfLLM's proposed method aligns with our "compute once, use many" approach of reusing KVCache. The external memory units in this method can be offloaded to CPU/DRAM or even SSD storage, allowing different parts to be selected for computation based on the specific question. This significantly improves the efficiency of attention computation.
Quest analyzed the recall rate of key tokens in H2O and full attention, finding that the Top-10 attention score token recall rate for the H2O algorithm is around 50%, which indicates that too much key information was lost.
貌似这张图的乘法这里有问题
During the attention computation stage, the dot product is computed between the current query vector and the max key and min key of each KVCache block, respectively. Then, for each channel, the maximum value between the two resulting product vectors is selected and summed to serve as the upper bound of the relevance score for that KVCache block, as shown in stage 1 of the diagram. Based on the relevance scores, the top-k KVCache blocks are selected to participate in the attention computation, as illustrated in stage 2 of the diagram.
Compared to InfLLM, Quest does not take heterogeneous architectures into account. Instead, it assumes that all KVCache can still fit into memory, simply leveraging sparse attention to accelerate the inference process. Ultimately, Quest achieves a 7.03x speedup in attention computation and a 2.23x improvement in end-to-end inference latency.
MagicPiG, based on Locality-Sensitive Hashing (LSH), proposes a dynamic KVCache management strategy.
这边有个地方感觉不理解
H2O采用了一种贪心法:对于某个token比如A来说,A以及A之后的token都会对A有注意力,把这些注意力加起来,可以得到A的“重要性分数”。将每个token的重要性分数都计算出来,然后丢弃分数最低的那些token即可
SnapKV的出发点是:大模型应用如对话(特别是RAG)和文档处理通常是输入很长,而输出相对较短。下面这句文章好像没说,但我结合他的代码意会了一下:因为输入长而输出短,因此可以在prefill阶段对KV Cache进行处理,drop一些不重要的token,而解码阶段就照常进行,这样不需要在解码阶段进行额外的计算,整体又省显存又加速了。
Retrieval-Augmented Generation(RAG)
KTransformers CPU Sparse Attn Framework
Based on the introduction of the above papers, we have distilled the following key points:
- The distribution of attention weights is sparse, and useless KVCache may introduce noise, which could actually reduce performance during the inference stage.
- For the KVCache eviction strategy during the inference stage, the common approach is to retain the tokens from the beginning and the end of the prompt, while designing algorithms to select the tokens from the middle portion. One of the main factors affecting the model's performance is the ability to accurately identify the key tokens.
- Managing the middle portion of tokens in blocks can improve memory swapping and attention computation efficiency, and smaller blocks do not seem to perform worse than token-level granularity.
- The tokens that each attention layer focuses on during inference differ, and even the allocated KVCache capacity for different layers should vary.
We organized the KVCache in units of blocks. Specifically:
- KVCache Partitioning: A complete input prompt is divided into three configurable parts: Initial, Context, and Local. During the computation process, the Initial/Local parts will be fully attended to, while the Context part will be sparsely retrieved. This approach is based on findings from many papers (such as streamingLLM and Minference) which mention the existence of "attention sinks," where higher attention weights are often found at the beginning and the end of the sequence.
- Context Block Partitioning: For the middle Context, we follow the InfLLM approach by dividing it into blocks based on a configurable fixed number of tokens. Each block can select 1 to k tokens as its representative tokens. During the actual inference phase, the Context blocks that require attention are selected based on these representative tokens.
- Specifically, we have implemented the following methods for selecting representative tokens, based on the approaches outlined in various papers.
- Max: The maximum values of multiple tokens within a block, across each channel, are concatenated to form the representative token for the current block.
- Mean: The average values of multiple tokens within a block, across each channel, are concatenated to form the representative token for the current block.
- Quest: A combination of the previous two methods: the maximum and minimum values of multiple tokens within a block, across each channel, are taken as the representative tokens for the block. Under this method, the number of representative tokens is fixed at 2
- Dynamic: By calculating the cumulative attention score for each token using a specific method, each block selects the top-k tokens with the highest scores as the representative tokens for the block. This is similar to InfLLM but with some simplifications.
- Once the representative tokens for each block are determined, use Equation 2 from InfLLM to calculate the similarity between the input X and the k representative tokens of each block B, and only select the top rk blocks for attention computation, where lP represents the length of the historical tokens:
Since InfLLM requires calculating a representative score for each token during the prefill stage and then selecting a representative token for each block based on these scores, this operation involves invasive modifications to the prefill implementation, making it difficult to integrate with other methods. Furthermore, in actual testing, we found that in most scenarios, similar or even better results can be achieved through a combination of other methods. Therefore, we ultimately decided not to integrate this method into the framework.
Further Optimizations
We can see that both the original model and the accelerated KTransformers achieve perfect scores on the relatively simpler datasets, such as Single Needle Retrieval and passkey. At the same time, the generation speed has significantly improved, increasing from 4.86 tokens/s with llama.cpp to 27.49 tokens/s with KTransformers, achieving up to a 5.65x speedup. Although the current configuration shows a noticeable drop in performance on the more challenging kvretrieval dataset, in the next section, we will address this by implementing a more optimized selection strategy to compensate for or even surpass the original model's accuracy.
As mentioned earlier, the goal of the kvretrieval dataset is to find a matching key-value pair within a long sequence of semantically meaningless pairs. If tokens are generated by reselecting based on the current query each time, the likelihood of deviation increases as the text grows, leading to the selection of different KVCache blocks compared to previous selections. To address this, we introduced a preselection mechanism using SnapKV to calculate the method for selecting representative tokens, which preselects a portion of the KVCache blocks. During the subsequent inference process, the selection is limited to these blocks. After one round of preselection, the score increased from 15.4 to 24.2, surpassing the original model + full attention's performance of 21 points. Further research indicates that the sparsity effect of the KVCache in the first few layers of LLMs is not as significant. Therefore, we set the first two layers to fully reuse the KVCache, ultimately achieving a score of 24.4.
The Needle In a Haystack Test
It works by embedding specific, targeted information (the “needle”) within a larger, more complex body of text (the “haystack”). The goal is to assess an LLM’s ability to identify and utilize this specific piece of information amidst a vast amount of data.
The Needle in a Haystack test was first used to evaluate the recall of two popular LLMs, OpenAI’s ChatGPT-4 and Anthropic’s Claude 2.1. An out of place statement, “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day,” was placed at varying depths within snippets of varying lengths taken from essays by Paul Graham, similar to this:
不过这个“大海捞针”的测试,确实我觉得不太好,后面说到claude的一些改进得到更好效果,我觉得他们的改进是合理的,两种方法,一种是让这个Needle的内容和Haystack部分贴近一些,这样claude做assertion的时候就有更充分的理由,或者就干脆不要让它做论断,直接让它找最relevant的句子。
读读代码
这里感觉怪怪的,如果配置文件被装入到本地过后,那岂不是后面写配置文件也更新不了?我觉得有更改的时候应该把用户那里copy的部分也改了。(多加一个checksum?这样提高效率缓存)
没问题了,我理解错意思了,这个设计就是想让用户去到~/.ktransformers/config.yaml里面去修改
我想试试其它模型的时候,找了个最小的模型Mixtral-8x7B,但是去掉marlin的优化过后还是跑不通,溢出了,不过这里我发现了一个问题,对于有access control的情况,需要加token: 👇
Transformer
先粗泛的理解:
注意力机制
RNN:Recurrent Neural Networks,即循环神经网络
Transformer
学学论文:Attention Is All You Need
- 只应用了注意力机制,没有用recurrence and convolutions. 注意encoder-decoder architecture。transformer:eschewing recurence and instead relying entierly on an attention mechanism to draw global dependencies between input and output.能够并行
- 它:replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
- RNN特点(缺点):给一个序列,是从左到右一步步往前做,比如句子的话,就是一个词一个词的往前看。对第t个词,会计算出一个输出叫叫做隐藏状态(这是由前一个词的隐藏状态和当前第t个词本身所决定),这样就能把前面学到的历史信息通过放到当下,和当前词做一些计算然后得到输出。它把之前的信息全部放在隐藏状态里面然后一个一个放下去。它是一个时序,一步步计算(难以并行),历史信息一步步往后传递,如果时序比较长,早期的时序信息在后续可能会丢掉
- difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention. (每一次能看到所有像素,一层能把整个序列看到)
- 自回归:过去时候的输出也可以作为当前时刻的输入
- An attention function can be described as mapping a query and a set of key-value pairs to an output
- Scaled Dot-Product Attention 把query和key做内积算相似度,然后除以根号维度再进行softmax从而算出一个和为1非负的权重,再把权重作用到value上得到输出:
之所以要多除一个根号是因为在较大的时候,点积的值可能会比较大,相对差距就会变大,导致最大的值做出的softmax的值会更接近于1
注意这里的mask是为了避免在t时刻看到t时刻以后的东西,也就是对t时刻的,我应该只是一直看而不应该看和它之后的东西,但是在注 意力机制的时候,我会看到所有,我们的会跟所有k里面的东西进行运算。所以我们可以在计算权重的时候,不要用到后面的内容即可,将之后的那些k值,直接换成一个很大的负数,这样softmax的时候直接就消失了。
- Multi-Head Attention
- transformer如何使用注意力机制的
How transformer do inference?
Inference time(generating new text)
- transformer doesn’t come up with translation in a single forward pass.
- we tokenize the input sentence(source sentence → source tokens → input_ids)(using wordpiece, BPE,…) e.g. hello my dog is cute → hello,my,dog,is,cu,te → 38,69,54,2345,222,494
- the encoder will turn every one of input_ids/tokens to an embedding vector, and we call this the last hidden states. each vector size is 768(typically for a base size Transformer). So the last hidden states will have (1,6,768) e.g. we can get : [-0.2,0.5,…,0.9](→hello) [0.34,064,…,0.8](→my) …(6 in total)
then go to decoder
- we first feed decoder input_ids to the model.
感觉有用,但是目前可能来不及看(性价比不高)的链接/论文:
vscode+python的配置
对项目的理解
我感觉简单来说就是把很多层面的优化给模块化了,搞出一些接口然后形成一个融合的框架,然后优化的选择就可以配置调整(比如用flash atten、marlin(GPU)、llamafile(cpu)),并且也可以自己开发一些优化策略进行试验(比如长对话设计的那些借鉴InfLLm的方法)。
感觉每个地方里面可以做的事情还很多,但是又是大多来自于第三方做好的模块,给我感觉不少都是去对接优化的api,或者直接把实现搬过来?
- 作者:liamY
- 链接:https://liamy.clovy.top/article/madsys/ktransformer01
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章