CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion论文学习

type

status

slug

date

summary

边读边想

1.Introduction

To ensure high-quality and consistent responses, applications often supplement the user query with additional texts to provide the necessary context of domain knowledge or user-specific information.
A typical example is Retrieval-Augmented Generation (RAG) where a user query will be prepended by multiple text chunks retrieved from a database to form the LLM input.

这里就引入第一个发现的问题，在如RAG技术里面，我们会将几个文本chunk作为user query的前置，这样prefill阶段要做的东西就多了

before generating any token, an LLM first uses prefill to go through the entire LLM input to produce the KV cache—concatenation of tensors associated with each input token that embeds the token’s “attention” with its preceding tokens.

这种默认的prefill方式就会带来很大的延迟，我们称这种为full KV recompute

Despite many optimizations, the delay and computation of prefill grow super-linearly with the input length, and can easily slow down the service, especially on long LLM inputs

这里后文提到了一个cross-attention的问题，也就是上图你能看到chunk2对的kv cache里面有Chunk1的内容，而chunk3的kv cache里同时含有chunk1、2的内容，这部分我不能理解，因为prefill不是直接拿训练好的去乘对应的输入的embedding token，那为什么和preceding的token内容有关？

Cross Attention in Transformer

Cross attention is a key component in transformers, where a sequence can attend to another sequence’s information, making it essential for…

https://medium.com/@sachinsoni600517/cross-attention-in-transformer-f37ce7129d78

这里我理解的是，我们这里并不是严格的encoder-decoder模式，prefill并不是self-attention而应该是一个decoder only 的模式，prefill会使用cross-attention

但是prefill是怎么计算的cross-attention的呢？A：在下章节的background里面有解释

selective KV recompute performs prefill on the input text in a traditional layer-by-layer fashion; however, in each layer, it updates the KV of a small fraction of tokens while reusing the KV of other tokens.
Comparing with full KV reuse, CacheBlend achieves much better generation quality with a small amount of extra KV update.

论文提出的方式是把chunks的text预先计算的内容一层层传入，但是每层更新传入的时候计算一部分的新kv值（因为cross-attention）（约15%）就能起到很好的推理效果。

Fortunately, this small amount of extra computation does not affect the end-to-end inference latency, as it is hidden through pipeline parallelism by CacheBlend. Specifically, CacheBlend parallelizes partial KV update on one layer with the fetching of the KV cache on the next layer into GPU memory

由于流水线的机制，我们可以在当前层进行kv update的过程中同时执行取下一层的kv cache值到显存。

Though pipelining KV compute and fetching is not new, we note that pipelining enables CacheBlend to store KV caches in slower non-volatile devices (e.g., disk or a separate storage server) without incurring extra delay as putting them in CPU memory.

论文里面把流水线机制更进一步，让KV caches能存储在速率更慢的non-volatile设备，并且它的速率和update kvcache相近的话就不会造成额外的性能损失。

2.Background

The prefill phase computes the KV cache layer by layer. The input tokens embeddings on each layer are first transformed into query (Q), key (K), and value (V) vectors, of which the K and V vectors form one layer of the KV cache. The LLM then multiplies Q and K vectors to obtain the attention matrix—the attention between each token and its preceding tokens—and does another dot product between the (normalized and masked) attention matrix with the V vector. The resulting vector will go through multiple neural layers to obtain the tokens’ embeddings on the next layer.

这里就解答了上面提到的一个问题，为什么prefill阶段有一个cross-attention, 为什么prefill的时候token分chunk不能直接将各个chunk每层的kvcache的值进行拼接（加上positional encoding），因为这里很关键的是每层向后传递的时候，是通过算attention机制，将整个input embedding对应的Q、K算出attention matrix然后点乘V得到下一层的input embedding，对于prefix第一块确实，它的每层计算都没有preceding，那其它chunk（后续的）对于它各个layer传的input embedding就没有影响，所以可以直接用之前计算过缓存的kv cache值。然而当我们看到第二个chunk的时候，它缓存的kv cache每一层是没有preceding的，但是到了现在，它前面有第一个chunk，导致多了若干列的K和V，那么再通过第二个chunk产生的Q去计算下一层的input embedding的时候，这里的得到的权重乘V和缓存的K V的乘的值一定是有区别的，特别是在靠近上一个chunk的时候，那么注意力很可能会放不少权重到上一个chunk的内容去产生下一层的input embedding

When the KV cache of a prefix is available, the prefill phase will only need to compute the forward attention matrix (between the suffix tokens and the prefix tokens) on each layer which directly affects the generated token.

3. Motivation

3.1 Opportunities of reusing KV caches

Since the reused contexts typically contain more information than the user queries, the prefill on the “context” part of the input accounts for the bulk of prefill overhead. Thus, it would be ideal to store and reuse the KV caches of reused texts, in order to avoid the prefill overhead when these texts are used again in different LLM inputs.

问一些比如公司里面从Y大学毕业的人是哪些这种问题的时候，会需要prefill额外的公司人员信息，那么这个kvcache计算量较大，而且后续问一些关于公司员工其它问题的时候，这部分cache内容又能被复用。

3.2 Why is prefix caching insufficient?

in prefix caching, the KV cache of a reusable text chunk is precomputed once, and if the text chunk is at the prefix of an LLM input, then the precomputed KV cache can be reused to avoid prefill on the prefix. The advantage of prefix caching is that the KV cache of a prefix is not affected by the succeeding text, so the generation result will be identical to full KV recompute (without the KV cache).

我们prefix的机制就是对于第一个chunk的kvcache，它的计算结果是可以复用的（它没有preceding的内容），然而它有局限：

The disadvantage of prefix caching is also clear. To answer one query, applications, such as RAG, often prepend multiple text chunks in the LLM input to provide different contexts necessary for answering the query. As a result, except the first chunk, all other chunks’ KV caches are not reused since they are not the prefix of the LLM input.
In short, prefix caching can only save the prefill of the first text chunk, so the saving will be marginal when the LLM input includes more text chunks, even if they are reused

也就是通常现在的情况是chunk有很多，而prefix的chunk的reuse占比只有一点点，那速度仍然很慢

3.3 Why is full KV reuse insufficient?

It concatenates independently precomputed KV caches of recurring text chunks with the help of buffers to maintain the positional accuracy of each text chunk.

这里有个full kv reuse 的方式，就是各个chunk各自独立的kvcache计算结果进行拼接形成每一层的kv cache不过这里有点小细节，就是用buffers去维护positional信息：

For instance, to concatenate the KV caches of chunks 𝐶1 and 𝐶2, PromptCache first needs to precompute the KV cache of 𝐶2 by running prefill on a hypothetical input that prepends 𝐶2 with a dummy prefix of length greater or equal to 𝐶1 This way, even if 𝐶2 is not the prefix, we still correctly preserve the positional information of 𝐶2’s KV cache, though each chunk’s KV cache will have to be precomputed multiple times.

这里我的理解是，如果为了保证位置信息需要把C2填充C1长度（记为n）的假想数据，那应该是直接不需要完整的compute。我们从第一层开始看，第一层的时候只需要把K和V前面增加前n个假想数据*Weight的内容即可，然后得到第二层的input embedding，而第二层的input embedding同样会追加上假想数据，但是注意，这里input embedding得到的kv和对应没有假想数据添加的情况的cache的kv是不一样的，所以得预先重新计算。但是你看到这里的计算和前面的chunk内容无关，也就是说我们可以并行，只要length是提前已知的，然后我们就能根据chunk的index知道它的位置从而保留位置信息并且由于length长度固定，这个甚至能被缓存起来复用。不过这样还是有问题。

a more fundamental problem is that the KV cache of nonprefix text chunk (e.g., 𝐶2) ignores the cross-attention between the chunk and the preceding text (e.g., 𝐶1). This is because the preceding text is not known when precomputing the KV cache.Ignoring cross-attention can lead to a wrong response.

With full prefill or prefix caching, the result is clear and correct. With full KV reuse the KV caches of the two text chunks are precomputed, with each chunk having the right positional embedding, and then concatenated to form the KV cache. However, if the LLM uses this KV cache to generate the answer, it will start to ramble and not produce the right answer

这里就是一个cross-attention被ignore之后出现问题的例子，那我们就来仔细看看原因

The absence of cross-attention in full KV reuse causes significant discrepancies in the forward attention matrix (explained in §2), which contains the attention between context tokens and the last few tokens, and directly affects the generated tokens.

4 Fast KV Cache Fusing

Goal. When an LLM input includes multiple re-used text chunks, how to quickly update the pre-computed KV cache, such that the forward attention matrix (and subsequently the output text) has minimum difference with the one produced by full KV recompute.

目标就是结合full kv reuse的快，还能保证 full kv recompute的精准

To achieve our goal, we present CacheBlend, which recomputes the KV of a selective subset of tokens on each layer while reusing other tokens’ KV

通过cacheBlend，重新在每一层有选择的计算一部分tokens的kv cache，然后reusing其它的，接下来让我们详细看看它具体是怎么做的。

We begin with the notations (§4.1), and then describe how to recompute the KV of only a small subset of tokens (§4.2), and finally explain how to select the tokens on each layer whose KV will be recomputed (§4.3).

4.1 Terminology

Table 1 summarizes the notations used in this section. For a given list of 𝑁 text chunks, we use to denote the KV cache from full KV recompute, to denote the precomputed KV cache, and to denote the CacheBlend-updated KV cache. Here, each of these KV caches is a concatenation of KV caches associated with different text chunks. Each layer 𝑖 of the KV cache, , produces the forward attention matrix .

这里的forward attention matrix到底是什么？为什么每层的KV cache会产生出forward attention matrix，我原本的理解是这个forward attention matrix就是kv cache的一部分，只是它是对应的用户prompt产生的kv cache。

还有这里提到每一个kv cache是不同text chunks对应的kv cache的连接是什么意思？这里的连接应该不是单纯指把kv cache的矩阵接在一起吧，因为里面cross attention信息也得在，还有肯定也得把positional信息加进去，也不能直接拼接chunks的cache吧

KV deviation: We define the KV deviation of a KV cache 𝐾𝑉 on layer 𝑖 of token 𝑗 as the absolute difference between and , denoted as . It measures how much different the given KV is on a particular token and layer compared to the full-prefilled KV cache. We will later use the KV deviation to identify which tokens’ KV has higher deviation and thus need to be updated.

这个差值是什么？什么叫在第i层的第j个token的kv cache的差值？关键在于第j个token，kv cache能对准到token的粒度吗？因为input的一个token进入之后就embedding成为固定的大小的tensor了，那一个token应该是对应的kv cache矩阵的一行，下图👇是我找到一个kvcache计算过程图，这样说的话，那么比较就应该是比较一行的值吧,那应该是个向量的差值呀？

Attention deviation: Similarly, for the forward attention matrix on layer 𝑖, we define the attention deviation, denoted as , to be the L-2 norm of its difference with . Recall from §3.3 that full KV reuse suffers from deviation in the forward attention matrix (illustrated in Figure 4) due to the absence of cross-attention.

这里有个一直困扰我的是forward attention matrix到底是什么？论文里面只是这样提到 ”

When the KV cache of a prefix is available, the prefill phase will only need to compute the forward attention matrix (between the suffix tokens and the prefix tokens) on each layer which directly affects the generated token.

“

并且这个术语貌似在其它地方搜索不到，应该是论文里面自己invent的。prefix的kv cache available应该是说prefix是给用户自己prompt之前添加的内容，然后当它的kv cache is available的时候，prefill 阶段就只需要计算用户自己的prompt(在prefix和suffix之间的内容)，而这些内容计算的结果就是forward attention matrix

Using these notations, our goal can be formulated as how to quickly update the precomputed KV cache to the new KV cache , such that the attention deviation on any layer 𝑖, is minimized.

4.2 Selectively recomputing KV cache

怎么将选择好的一些tokens在每一层进行重新计算

Workflow: The default implementation of prefill (depicted in Figure 5(a)) does not “skip” tokens while only computes the KV of a subset of tokens. Instead, CacheBlend runs the following steps (depicted in Figure 5(b)): • It first applies a mask on the input of each layer 𝑖 to reduce it to a subset of selected tokens. • It then transforms the reduced input into the vectors will also be restricted to the selected tokens. • It then expands the vector and vector by reusing the KV cache entries associated with the un-selected tokens on layer 𝑖, so that the attention matrix includes attention between selected tokens and all other tokens. • Finally, it runs the same attention module to produce the input of the next layer

它的复用思路很简单，第i层选择一些token计算他们的kv，然后替换缓存部分对应的kv的那几行，就能算出i+1层新的input token。这里理解一下，对第一层而言，使用缓存的kv都是完全准确的值，那几个token重新算也是和缓存的一样，第二层的时候，因为cross-attention的效果，也就是我们算qk权重并赋到v之后的新的更新出的第二次input token内容就是不同的，所以缓存的哪些kv不太准确，我们挑了几个token进行更新部分kv，再一次计算出新的input token传入到下一层进行KV的计算

These changes make little assumption on the exact transformer process and can be integrated with many popular transformers (more details in §6). It is important to notice that the compute overhead is proportional to the number of selected tokens. This is because it only runs computation associated with the selected tokens. If we recompute 𝑟% of tokens per layer, the total compute overhead will be 𝑟% of full prefill.

4.3 Selecting which tokens to recompute

关于如何选择recompute的内容

Our intuition, therefore, is to prioritize recomputing the KV of tokens who have high KV deviations. Of course, this intuitive scheme is not feasible as it needs to know fullprefilled KV cache, and we will make it practical shortly.

最应该选择去更新的token是那些有很大KV deviation的token

To show the effectiveness of selecting tokens with high KV deviations, Figure 6 uses three models on the dataset of Musique (please see §7.1 for details). It shows the change of average attention deviation across all layers 𝑖, , after we use the aforementioned scheme (§4.2) to select and recompute the 𝑟% of tokens 𝑗 who have the highest KV deviation . As the recompute ratio (𝑟) increases, we can see that the attention deviation gradually reduces, and the biggest drops happen when the top few tokens with the highest KV deviations are recomputed.

Empirically, it suggests the following insight. Insight 1. On layer 𝑖, recomputing the KV of token 𝑗 who has a higher KV deviation (i.e. ) reduces the attention deviation (i.e., ) by a greater amount.
Insight 2. Tokens with the highest KV deviations on one layer are likely to have the highest KV deviations on the next layer.

有两个经验性的理论：1.重新计算那些有高kv deviation的token能够很大程度减少 attention deviation 2. 在一层有最高kv deviation的token很可能在下一层也有最高的kv deviation

Thus, if we recompute the KV of, say 10%, of tokens on a layer 𝑖, we should choose the 10% of tokens which have the highest KV deviations. We refer to these tokens as the High-KV-Deviation (or HKVD) tokens on layer 𝑖 .

对于第一个经验，那么我们只需要重新计算大概10%的最高 kv deviations的token，这些token被叫做 High-KV-Deviation(HKVD) tokens

Do we need to recompute KV for most tokens? In §7, we empirically show that choosing 10-20% tokens as HKVD tokens and recomputing their KV suffices to greatly reduce the attention deviation and preserve generation quality. This can be intuitively explained by attention sparsity, a well-studied property observed in many transformer models by prior research. It says that in an attention matrix, high attention typically only occurs between a small number of tokens and their preceding tokens.

我们并不需要重新计算大部分KV tokens，只需要选择10-20%的token作为HKVD tokens就能达到好的效果。因为注意力的机制的稀疏性，一个注意力矩阵中，高注意力的权重值发生在一小部分token以及他们之前的相邻的token里面。

We can see that a small fraction, about 10-15%, of tokens have much higher KV deviation than others, which corroborates the sparsity of cross-attention. If a token has very low attention with other chunks’ tokens (i.e., low cross-attention with other chunks), the KV deviation between and will be low and thus do not need to be recomputed. Only when a token has a high attention with other chunks (high KV deviation compared with ground truth), should its KV be recomputed.

这里的图就能看出kv-cache很大的token的占比很少，这个原因除了稀疏性外，还因为大部分token都和other chunks有非常低的attention所以就不需要重新计算了。这里就说明了，kv deviation的偏差就源于corss attention的原因。

How to identify the HKVD tokens without knowing the true KV values or attention matrix? Naively, to identify the HKVD tokens, one must know the fully recomputed 𝐾𝑉 full 𝑖 of each layer 𝑖 in the first place, but doing so is too expensive and defeats the purpose of selective KV recompute. Instead, we observe that the HKVD tokens on different layers are not independent: Insight 2. Tokens with the highest KV deviations on one layer are likely to have the highest KV deviations on the next layer.

这里就是重要的经验2，我们选择token的方式肯定不能每层的都是full recompute，这样根本就没有优化效果，但是注意到大多数情况，每层的highest kv deviation token几乎是重合的。

For instance, if the HKVD tokens on the first layer are tokens 2, 3, and 5, these three tokens will likely also have higher KV deviations than most other tokens on the second layer. Figure 8 uses the same setting as Figure 7 and shows Spearman’s rank correlation score between the KV deviation of tokens between two neighboring layers. The figure shows a consistently high similarity of HKVD tokens between different layers

这里图看起来的确HKVD tokens的差异在相邻层之间并不大，几乎是重合的。

The intuition behind this correlation lies in the previous observation that the input embedding of each token changes slowly between layers in transformer models. Therefore, KV cache between layers should also bear similarity as KV cache is generated from the input embedding with a linear transformation.(We should clarify that although the HKVD tokens are similar across layers, the attention matrices between layers can still be quite different.)

这里为什么我们会有想法认为HKVD tokens在层之间变化不大呢？因为之前已经发现input embedding在每层之间的变化是缓慢的，那么每层kv cache不过是对每层的input embedding做固定的线性变换，那么很可能它的变换也是缓慢的，那么HKVD tokens大体上的变化应该不大。

Given the substantial correlation between the HKVD tokens, a straightforward solution is that we can perform prefill on the first layer first, pick the HKVD tokens of the first layer, and only update their KV on all other layers. Since an LLM usually has over 30 layers, this process can save most of the compute compared to full KV recompute. That said, using only the attention deviation of different tokens on the first layer may not be statistically reliable to pick HKVD tokens of all layers, especially deeper layers.

那么我们可以在第一层进行完整的prefill，然后选出这一层的HKVD tokens。

Q:第一层为什么会有kv deviation，第一层的K和V是由prefix的embedding的线性变换得来的，本身和任何的preceding内容无关，只有经过第一层的注意力matrix乘了V之后得到的下一层的input embedding才会和之前缓存内容不一样，造成kv deviation。或者说这里说的第一层的HKVD tokens的计算，就是把算出的第二层的input embedding的值对应的KV与缓存的KV的deviation？但是这样的定义感觉不太对啊

Thus, we opt for a gradual filtering scheme (depicted in Figure 9). If on average we want to pick 𝑟% HKVD tokens per layer, we will pick tokens based on the token-wise attention deviation on the first layer, with being slightly higher than 𝑟, and use them as the HKVD tokens on the second layer. Then we recompute the KV of these HKVD tokens on the second layer and pick tokens that have the highest token-wise attention deviation, with slightly less than , as the HKVD tokens on the next layer, and so forth. Intuitively, this gradual-filtering scheme eventually picks the HKVD tokens who have high attention deviation, not only on the first layer but also on multiple layers, which empirically is statistically more reliable to identify the HKVD tokens on each layer.

这里的gradual filtering scheme的逻辑在图的描述里面反而更清楚：CacheBlend选择一层的HKVD token的方式是通过只计算上一层选择的那些KV tokens在这一层的kv deviation并且从中选择高的KV deviation进入下一层。

5 CacheBlend System Design

We present a concrete system design for CacheBlend, which reduces the impact of the selective KV recompute using the following basic insight.
Basic insight: If the delay for selective KV recompute (§4.3) is faster than the loading of KV into GPU memory, then properly pipelining the selective KV recompute and KV loading makes the extra delay of KV recompute negligible.

运用流水线的话，在我们选择重计算的延时小于将kv cache载入GPU memory的延时的话，那重计算这部分的开销就几乎没有

Pipelining KV loading and recompute: In CacheBlend, the selective recompute of one layer can start immediately after pre-computed the KV cache of the previous layer is loaded into the GPU. This is because which tokens’ KV to recompute on one layer only depends on the KV deviation of the previous layer’s tokens. As a result, if loading the pre-computed KV for one layer is faster（? slower） or equal to selective KV recompute of one layer, the KV-loading delay should be able to hide the selective recompute delay, i.e., without incurring any extra delay on time-to-first-token (TTFT).

一层的选择重计算能够迅速开始在上一层pre-computed计算的KV cache被装载入GPU的时候（这里的pre-computed的kv cache应该是更新过后的）。这是因为一层需要recompute的tokens只依赖于上一层KV deviation（上一层选的那些token的kv deviation）。结果就是，如果为一层load precomputed KV是比一层的selective KV的recompute更慢的话（我感觉原文这里说反了，原文是更快），KV-loading的延迟就能掩盖recompute的延迟，也就是不会造成额外的delay对于TTFT。

Take the Llama-7B model and a 4K-long context, recomputing 15% of the tokens (the default recompute ratio) only takes 3 ms per layer, while loading one layer’s KV cache takes 16 ms from an NVME SSD (§7). In this case, KV loading can hide the delay for KV recompute on 15% of the tokens, i.e., KV recompute incurs no extra delay. Recomputing more tokens, which can slightly improve generation quality, may not incur extra delay either, as long as the delay is below 16 ms. On the contrary, with another model, Llama-70B, recomputing 15% of tokens takes 7 ms, but it only takes 4 ms to load one layer’s KV from an NVME SSD. Here KV loading does not completely hide the recompute delay. In short, a controller is needed to intelligently pick the recompute ratio as well as where to store the KV cache (if applicable).

因为kv recompute的时间随着模型会浮动，同样load的时间也是，我们想要尽可能去保证recompute的比例能使得recompute的延迟低于load的时间这样在pipeline里面recompute就不造成额外开销，并且甚至有机会更进一步还能动态选择将kv cache存储在什么地方，这就引出本论文提出的一个组件：controller。

5.1 Key Components

To realize the benefit of pipelining KV loading and recompute, our system has three major components. Loading Controller: We face two design questions in practice: First, given a storage device, how to choose a recompute ratio (what fraction of tokens to recompute KV per layer) without incurring extra delay to time-to-first-token (TTFT)? Figure 10(a) illustrates an example that, if we select a recompute ratio wisely, the recompute will not cause any extra delay to loading.

图b这个地方很怪，对于GPU、CPU RAM，我们发现with out pipelining和with pipeline的区别不大，考虑到这个时候存储设备的巨大提升，那肯定是延迟由recompute贡献，recompute的延迟已经超过load，而看到15%的时候在SSD上，虽然load速度减慢，导致with out pipeline的时候recompute+load时间使得明显SSD比其它两个存储方式要慢，但是到利用了pipeline之后，因为recompute的延迟并不低（在GPU和CPU RAM的延迟中起到决定作用），可以认为在这里recompute的延迟和load差不多，所以当通过pipeline掩盖recompute的延迟的时候，那么CPU RAM和GPU的有pipeline情况（这个时候速度由recompute延迟决定的），和SSD的值差不多。但是这里，我想说这个15%的recompute有可能延迟比load大不少，那同样会是这样的结果，因为这样SSD使用了pipeline之后，load的时间被recompute延迟掩盖，那么同样GPU、RAM和SSD的pipeline结果相近，需要很精细的调节到load的延迟和recompute的延迟差不多，所以我觉得它这里觉得能SSD存储和GPU能达到相近效果的前提是我们真的需要就算足够比例的token，这部分时间是要求死的，那么我们就尽可能选择存储设备的load时间和recompute时间相当，不然再快的存储，加上pipeline的优化后，时间取决于短板，也就是都会取决于recompute值，但是如果当我们recompute的延迟更下（比例变小或者运算加快），那SSD(32Gbps)就肯定是不优的。所以怎么去选择，选择存储位置和recompute的ratio就显得很重要了。

For this, the controller uses two delay estimators to find an idealized recompute ratio, such that the recompute delay is close to the loading delay. Given the recompute ratio 𝑟, length of context to be loaded 𝐿, and LLM, the recompute delay estimator calculates the expected delay (𝑟%, 𝐿𝐿𝑀, 𝐿).The loading delay estimator estimates the loading delay of the KV cache of one layer, (𝐿𝐿𝑀, 𝐿, 𝑠𝑡𝑜𝑟𝑎𝑔𝑒_𝑑𝑒𝑣𝑖𝑐𝑒) , based on the LLM, the storage device’s speed (which is measured offline), and the length of context 𝐿.

有两个delay estimators分别估计recompute delay和loading delay

The controller calculates an idealized recomputation ratio such that the loading delay can hide the recompute delay, without degrading the inference quality. It first picks the recompute ratio 𝑟% such that (𝑟%, 𝐿𝐿𝑀, 𝐿) is equal to (𝐿𝐿𝑀, 𝐿, 𝑠𝑡𝑜𝑟𝑎𝑔𝑒_𝑑𝑒𝑣𝑖𝑐𝑒), and then takes the max of 𝑟% and %, where % is the minimal recompute ratio that empirically has low negligible quality drop from full KV recompute. In practice, we found % to be 15% from Figure 16.

通过让 (𝑟%, 𝐿𝐿𝑀, 𝐿)等于 (𝐿𝐿𝑀, 𝐿, 𝑠𝑡𝑜𝑟𝑎𝑔𝑒_𝑑𝑒𝑣𝑖𝑐𝑒)，找到合适的r%，这样能充分让recompute的开销被load掩盖，并且设置一个下界%（上图的经验值测出来是15%）我们就能保证准确度不会下降得太多（因为recompute的比例太小会丢失准确性）

Another design question facing the loading controller is: If we only want to do KV recompute of a fixed selective recompute ratio 15%, how can we choose the right storage device to store KVs such that no extra delay is caused? As shown in Figure 10(b), under a fixed recompute ratio, the controller should pick the cheapest storage device among all devices that do not increase the delay. The system developers can provide a list of potential storage device(s), and the controller uses a storage cost estimator which estimates the cost of storing KVs for each device, namely (𝐿𝐿𝑀, 𝐿,𝑇 , 𝑠𝑡𝑜𝑟𝑎𝑔𝑒_𝑑𝑒𝑣𝑖𝑐𝑒), based on the LLM, length of context 𝐿 and time duration 𝑇 needed to store it (if it is cloud storage). Then it uses (15%, 𝐿𝐿𝑀, 𝐿) and (𝐿𝐿𝑀, 𝐿, 𝑠𝑡𝑜𝑟𝑎𝑔𝑒_𝑑𝑒𝑣𝑖𝑐𝑒) to estimate the recompute and loading delays for all devices. Lastly, it finds out which storage device is the cheapest where ≥ .

这里就是前面提到的，固定15%这种recompute 的token比例之后，那我们在选择存储设备的时候就尽可能保证到最便宜的（也就是速率最慢的）能够满足 ≥ 的设备。因为这个时候的最大优化的延迟就是，所以我们只需要load的值刚刚好覆盖它就行，更快的设备也不能继续优化，因为pipeline会让短板起决定作用。

KV cache store (mapping LLM input to KV caches): The KV cache store splits an LLM input into multiple text chunks, each of which can be reused or new. For instance, a RAG input typically consists of multiple retrieved context chunks (likely of a fixed length) and the user input. The splitting of LLM inputs is specific to the application, and we implement the same strategy as described in recent work. Once the input is split into text chunks, each chunk is hashed to find their corresponding KV cache, in the same way as the block hashing is implemented in vLLM. The KV caches of new chunks by the fusor (explained soon) are added to the devices. When the storage devices are full, the least recently used KV caches are evicted.

KV cache的存储机制会把一个LLM的输入分隔成许多text的chunk，每一个chunk会复用或者更新，这些分隔的text chunks会被通过hash来映射到对应的kv cache（这样我们就能通过input的text chunks找到对应的kv cache）（论文中说这种hash的方式和vLLM里面block hashing方式一样，后面可以看看）。新chunks的kv cache（由fusor产生？后文会说）被加进（存储）设备里面。当存储设备满的时候通过LRU算法更新。

Fusor: The cache fusor (§4) merges pre-computed KV caches via selective recompute. Recall from §4.3, the decision of which tokens need to be recomputed for one layer depends on the recompute of the previous layer. Thus, the fusor waits until the recompute for the previous layer is done, and the KV caches for layer 𝐿 are loaded into the queue on GPU memory and then perform selective recompute using the recompute ratio 𝑟% calculated by the loading controller. The fusor repeats this process until all the layers are recomputed.

fusor就是执行前面第4章探讨过的那些算法，因为我们需要去merge pre-computed的KV caches（通过selective recompute），在每一层selective recompute一些token的来更新kv cache值，也就是得到Fused KV Cache。fusor会等到上一层的recompute计算完成，并且当前层的KV cache被load了，就会执行selective recompute（参数r由loading controller确定），然后一层层这样算完。

下图就是完整的流程图👇：

We put the key components together in an LLM inference workflow in Figure 11.

5.2 Putting them together

When a user of an LLM application submits a question（1）, a list of relevant text chunks will be queried. The loading controller then queries the KV cache manager on whether the KV caches for those text chunks exist, and where they are stored.（2）（3） Next, the KV cache manager returns this information back to the loading controller and the controller computes the idealized selective recomputation ratio, sends it to the fusor（4）, and loads the KV caches into a queue in GPU memory.（4） The KV cache fusor continuously recomputes the KV caches in the queue, until all layers are recomputed（5）. Lastly, the fused KV cache is input into the LLM inference engine, which generates the answer to the user question based on the KV cache.

参考上图11，这里讲述它的过程，我将对应图里面的步骤标注了。

6 Implementation

• fetch_kv(text, layer_id) -> KVCache: given a piece of text and a layer id, CacheBlend fetches the corresponding KV cache from KV store into the GPU. Returns -1 if the KV cache is not in the system. • prefill_layer(input_dict, KVCache) -> output_dict: CacheBlend takes in the input and KV cache of this layer and performs the partial prefill process for this particular layer. The output is used as the input for the next layer. • synchronize(): CacheBlend requires synchronization before prefilling every layer to make sure the KV cache of this layer has already been loaded into the GPU.

这里实现了三个layer-wise mannar的接口。

fetch_kv就是从kv cache store里面fetch对应的kv cache(对应layer和text)到GPU

prefill_layer就是把某一层的input 和对应KV cache进行partial prefill 操作，它的输出就会给到下一层的作为输入。

synchronize会在prefill每层之前进行同步保证该层在这个时候需要的kv cache都是被load进GPU了的。

We implement these three interfaces inside vLLMs. For fetch_kv, we first calculate the hash of the text and search if it is inside the KV store system. If it is present, we call torch.load() to load it into GPU memory if KV cache is on disk or use torch.cuda() if the KV cache is inside CPU memory.
For prefill_layer, we implement this interface on top of the original layer function in vLLM that performs one layer of prefill. Three key-value pairs are recorded in the input_dict: (1) the original input data input_org required for prefilling an LLM layer (e.g., input_tensor, input_metadata), (2) a check_flag indicating whether HKVD tokens will be selected in this layer, and (3) HKVD_indices that track the indices of HKVD tokens. If check_flag is True, the input tokens with the largest deviation between the newly computed KV cache and the loaded KV cache will be selected as the HKVD tokens. If check_flag is False, partial prefill will only be performed on the current HKVD tokens indicated by HKVD_indices. Only the KV cache of the HKVD tokens will be computed and updated at each layer.
In the partial prefill for layer 𝑖, two threads are used to pipeline the computation (prefill_layer) of layer 𝑖 and the KV cache loading (fetch_kv) of the next layer 𝑖 + 1. synchronize is called before prefill_layer to assure the KV cache needed for prefill has been loaded into GPU.

fetch kv会计算text的hash值，然后如果有cache，根据cache在的位置决定怎么加载（磁盘用torch.load，cpu上用toch.cuda）

prefill_layer的实现里面有三对值被记录：1）input_org:是原始的用于prefill的data（input_tensor, input_metadata） 2）check_flag用于标志是否需要再选择HKVD token（也就是是否用前面的那种gradual filtering scheme） 3）HKVD_indices 用于索引记录HKVD的tokens

因为我们用了2个线程来pipeline prefill_layer和fetch的过程，所以通过synchronize来达到统一，保证在prefill的时候，该层所需要的kv cache已经被load到GPU了。

Managing KV cache: CacheBlend manages the KV caches such that: If KV cache is not inside the system and is recomputed by the LLM engine in the runtime, we will move the KV cache into CPU by torch.cpu() and open a thread to write it back to disk in the background with torch.save(). During fetch_kv, we go through ~~through~~ the hash tables to fetch KV cache for the fusor. The hash tables are kept in CPU for their relatively small size (16MB for one million chunks).

我有个问题，这里由LLM engine recompute出的kv cache是什么情况？这个kv cache应该是不能有preceding的内容的吧？不然这部分kv cache应该不是用在prefill阶段（也即是不会是属于pre-computed的那一类kv cache）。而是在推理的过程中，一直不断加入的kv cache？

7 Evaluation

Our key takeaways from the evaluation are: • TTFT reduction: Compared to full KV recompute, CacheBlend reduces TTFT by 2.2-3.3× over several models and tasks. • High quality: Compared with full KV reuse, CacheBlend improves quality from 0.15 to 0.35 in F1-score and Rouge-L score, while having no more than 0.01-0.03 quality drop compared to full KV recompute and prefix caching. • Higher throughput: At the same TTFT, CacheBlend can increase throughput by up to 5× compared with full KV recompute and 3.3× compared with prefix caching.

复现

CacheBlend

YaoJiayi • Updated Jan 14, 2025

关于推理过程（特别是hidden states）

How a Transformer works at inference vs training time

I made this video to illustrate the difference between how a Transformer is used at inference time (i.e. when generating text) vs. how a Transformer is trained. Disclaimer: this video assumes that you are familiar with the basics of deep learning, and that you've used HuggingFace Transformers at least once. If that's not the case, I highly recommend this course: http://cs231n.stanford.edu/ which will teach you the basics of deep learning. To learn HuggingFace, I recommend our free course: https://huggingface.co/course. The video goes in detail explaining the difference between input_ids, decoder_input_ids and labels: - the input_ids are the inputs to the encoder - the decoder_input_ids are the inputs to the decoder - the labels are the targets for the decoder. Resources: - Transformer paper: https://arxiv.org/abs/1706.03762 - Jay Allamar's The Illustrated Transformer blog post: https://jalammar.github.io/illustrated-transformer/ - HuggingFace Transformers: https://github.com/huggingface/transformers - Transformers-Tutorials, a repository containing several demos for Transformer-based models: https://github.com/NielsRogge/Transformers-Tutorials.

https://www.youtube.com/watch?v=IGu7ivuy1Ag

hidden states

encoder最后产生的这一堆embedding vectors就可以叫做last hidden states in encoder，而同样decoder最后产生的每个token对应的embedding vectors也可以叫last hidden states in decoder，也就是说hidden states实际上就是每一层产生的输出embedding vectors，只不过最后一层的话可以加一个last

tokenizer

为什么这里要去掉第一个，看到第一个的id是1，对应的是开始符号<s>,那就是说因为这里是input文本，都不要<s>开始符

python相关

python的装饰器

Python 函数装饰器 | 菜鸟教程

装饰器(Decorators)是 Python 的一个重要部分。简单地说：他们是修改其他函数的功能的函数。他们有助于让我们的代码更简短，也更Pythonic（Python范儿）。大多数初学者不知道在哪儿使用它们，所以我将要分享下，哪些区域里装饰器可以让你的代码更简洁。首先，让我们讨论下如何写你自己的装饰器。这可能是最难掌握的概念之一。我们会每次只讨论一个步骤，这样你能完全理解它。一切皆对象首先我们来理解下 Python 中的函..

https://www.runoob.com/w3cnote/python-func-decorators.html

主要是这个例子：

python的类方法

python 中 cls 指的是什么

在Python中，cls是指向类本身的引用、用于指代当前类或构造类的实例。它主要用于类方法的第一个参数，这与实例方法的第一个参数self相对应，self指向类的实例对象。使用cls的好处是，它可以让类的某个方法被类本身调用，而不是由类的某个具体实例调用。这种机制对于创建类的实例或实现基于当前类属性的逻辑非常有用。

https://docs.pingcode.com/ask/294997.html

cls的使用也有一定的限制。由于它指的是类本身，因此无法访问类的实例属性。这就意味着，如果你需要在方法中处理个别实例的数据，应该使用实例方法而不是类方法。

深入代码：

师兄这个指的是调用这个forward函数的其它函数吗？

这个forward函数我看到应该是进行了它的deviation计算，根据recomp_ratio比例取前K个tokens的index，然后

这里统一来判断需不需要update，但是这里不懂，key_old应该是之前的key，那这里更新操作把old_key中topk的token对应的值换成变量key对应的新值，那么是说明变量key的值都是重新计算过的key？因为它的来源是：

因为我看注释里面写着status 1和2分别对应partial update 和 full update，partial和full应该是对应blend的gradual filtering kv cache和按照固定比例，由第一层决定token的kv cache的方式吧？然后后面

这里做decode。

那这个forward就是它论文里面的prefill_layer函数的操作？除了这个kv cache我感觉上层应该是传入的recompute之后对应ratial的kv的结果。

那我们就看到上一层：

暂时还不太明白上面rotary_emb的操作，为什么要rotate the old K 就能modify the kernel to only take K as input。

然后再往上走一层：

这里就是self_attn的操作进入下一层，走向更深入的kv cache计算和token的生成

这里是指残差连接只去使用那些更新的token的（也就是选择的partial的部分（默认的比例是16%））？

但是这里同样，kv_cache是上一层传入的参数，于是还得看它的上一层调用。

这里就是进行每一层的kv cache计算处理，当需要check的时候，根据self.cache_fuse_metadata["check_layers"]决定是否是用新的partial比例来部分更新kv cache，temp_status为1就代表用新的比例（也就是gradual filtering），2就是用上一层的比例来更新。注意这里看到，第一层i=0的时候，其实都不在这两个if里面，temp_status为0，这就说明第一层是能直接用之前的kv cache内容的（只要有缓存），而不需要有任何更新，它本身就是准确的。也就是

在这里，直接会复用kv cache。并且这里blend.py这个示例用的cache_fuse_metadata["check_layers"]是默认值[1]，所以我感觉这里有个地方注意，代码里面对于需要check的情况，是这样处理的：

如果我们的参数是[1,2,3]，也就是想要check 操作1，2，3层，那么第一层的时候，就会让self.cache_fuse_metadata["check_layer"]=1，那不就丢掉了原来的[1,2,3]这个数组的信息了吗？那不是参数这个数组只有第一个最小的数字有用？

注意是self.cache_fuse_metadata["check_layer"]，没有s

不过这里的kv_cache也是上层传入的，我们再向上走

这个函数的上层调用是：

这部分，前面的self.cache_swap以及以前，就应该是fetch cache操作，这里我有怀疑~~selc.cache_swap里面是开了一个线程来单独做这个fetch操作，这样就能并发的流水线了。~~

我感觉好像是kv_cache的内容在gpu上，读取它的时候就是并行的和recompute的过程（只要在gpu上放到不同的kernel里面应该就能并行了吧），但是这样的话我感觉需要async的异步支持吧？而且我一直想找到它究竟是在哪里进行的kv cache的recompute，但是python代码里面貌似没有找到，难道是这个计算过程被放入到cuda graph,是cuda写的？如果是这样的话，我感觉系统的并行那里应该是gpu内部各个thread的并行的流水线吧。但是调用cuda的那个函数在哪里呢？它好像是给不需要计算更新的kv-cache加掩码，然后合并到vllm的执行流程里面？

我好像是找到了一个计算的地方：

但是它也并不是，因为这个是vllm框架里面使用的操作：

同时注意到这里的call stack里面并没有流水线的多线程：

问题

Q1：

这个图是挑的一个input（由多个chunks组成），然后critical score是指的token对应的kv deviation值？粉色柱状图表明了这个比例，看得到位置靠后的那些chunks的deviation的普遍高。

Q2：

这个看起来第一种就是full recompute的情况呢？然后第二种是没有改变位置编码的情况，第三种是论文里面提到的full kv reuse（调整了位置编码，但是没有cross attention）

Q3:

“我理解它的forward attention matrix就是在prefill query的时候，对前面材料块的注意力分数”

是指的是每一层在prefill用户的query内容（非chunks的补充内容）的时候，对于前面材料块的attention值是吧？

因为我看的原文的描述是：

When the KV cache of a prefix is available, the prefill phase will only need to compute the forward attention matrix (between the suffix tokens and the prefix tokens) on each layer which directly affects the generated token.

复现问题Q4:

这里就是进行每一层的kv cache计算处理（但是也不是在这里头进行的recompute，这里只是设置了一下参数，往下走的layer），当需要check的时候，根据self.cache_fuse_metadata["check_layers"]决定是否是用新的partial比例来部分更新kv cache，temp_status为1就代表用新的比例（也就是gradual filtering），2就是用上一层的比例来更新。注意这里看到，第一层i=0的时候，其实都不在这两个if里面，temp_status为0，这就说明第一层是能直接用之前的kv cache内容的（只要有缓存），而不需要有任何更新，它本身就是准确的。也就是

q1

对于上面的代码，如果我们的参数是[1,2,3]，也就是想要check 操作1，2，3层，那么第一层的时候，就会让self.cache_fuse_metadata["check_layer"]=1，那不就丢掉了原来的[1,2,3]这个数组的信息了吗？

看错了，是self.cache_fuse_metadata["check_layer"]没有s，是个新的变量，难崩

q2

我感觉好像是kv_cache的内容在gpu上，读取它的时候就是并行的和recompute的过程（只要在gpu上放到不同的kernel里面应该就能并行了吧），但是这样的话我感觉需要async的异步支持吧？而且我一直想找到它究竟是在哪里进行的kv cache的recompute，但是python代码里面貌似没有找到，难道是这个计算过程被放入到cuda graph,是cuda写的？如果是这样的话，我感觉系统的并行那里应该是gpu内部各个thread的并行的流水线吧。但是调用cuda的那个函数在哪里呢？它好像是处理为给不需要计算更新的kv-cache加掩码，然后合并到vllm的执行流程里面？

Q3：

我找到一个hash block的代码，但是我感觉这个并不是论文中提到的把chunk hash去寻找的操作？

因为这个是vllm的代码，所以我感觉要么论文提到的那个hash寻找是指的复用这vllm的这部分kv cache存储操作，而不是它单独实现的内容。因为我看这个调用基本是在vllm的代码部分它内部使用的。

所以我就感觉好像这个论文的代码部分很少？基本就是把原有vllm的kvcache计算魔改了一下，不是prefix caching或者full recompute，加了一点index去recompute指定的内容，然后它的系统的pipeline那些好处什么的，完全是由原有的vllm给的效果。