type
status
slug
date
summary
tags
category
password
icon
额外(我也没仔细看)
flashAttention的内容
something we may forgotten
前置补充回顾边读边想
Abstract
We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an ‘observation’ window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to the baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV’s potential for practical applications.
snap kv主要是利用 在生成时,模型里每个attention head会始终关注特定的prompt attention feature并且这个特征能从位于prompts末尾的observation window中获取到。
LLMs在生成过程中对输入tokens的关注模式具有很强的一致性和稳定性,而这种模式可以在输入序列末尾的一个小窗口内就观察到。SnapKV利用这个洞见,只保留那些被注意力heads持续关注的输入tokens对应的KV,大幅压缩KV缓存的尺寸。
SnapKV的出发点是:大模型应用如对话(特别是RAG)和文档处理通常是输入很长,而输出相对较短。下面这句文章好像没说,但我结合他的代码意会了一下:因为输入长而输出短,因此可以在prefill阶段对KV Cache进行处理,drop一些不重要的token,而解码阶段就照常进行,这样不需要在解码阶段进行额外的计算,整体又省显存又加速了。
1 Introduction
There are many approaches to mitigate these problems, such as KV cache eviction during generation stage [5–8]. However, most of these methods lack a detailed evaluation in long-context settings. Moreover, they mainly focus on compressing the KV cache appended during decoding steps, while overlooking the realistic problem of compressing KV cache for prompts, which is typically the bottleneck in memory efficiency.Additional challenge lies in compressing KV cache for such vast prompts without losing crucial information for accurate generation, especially in scenarios with various noisy contexts.
其它的方法主要是在压缩decoding的时候的kvcache,但是关键问题是在prompts因为很长,这部分的kv cache是瓶颈。但是如何压缩能保证信息不丢失呢?毕竟prompts很长的话,那些无关信息也多,也不知道怎么筛选啊。
In our paper, we find an important attention allocation phenomenon: only a portion of prompt tokens convey essential information for response generation, and these tokens remain unchanged during generation.
这里不懂他说的,这部分prompt tokens remain unchanged during generation是什么意思?什么叫这部分prompt tokens再generation的过程中保持不变?难道是说这部分tokens在generation的过程中,整个prompt里面一直都只有这部分tokens传递了主要的核心信息。
From our observations, we derive an innovative and intuitive method, SnapKV, which can smartly identify the attention allocation pattern and compress the KV cache for long sequence prompts without compromising the model’s accuracy.
这里说snap kv能够智能的识别出attention allocation pattern从而为长序列prompts压缩kv cache
We design experiments to explore the attention allocation pattern during generation, focusing on two key questions:
- Is there a consistent attention allocation pattern for input sequence tokens?
- Is it feasible to identify this pattern prior to the generation stage?
Our finding suggests that for LLMs, the attention allocation of most input sequence tokens stay consistent during generation. Thus, LLMs knows what you are looking for before generation.
那我觉得这里说的consistent attention allocation pattern指的就是generation的时候,整个prompt只有一部分tokens起到作用,这部分tokens(可以说是pattern)一直都是整个generation过程中的起作用的部分,不会很大幅度增加和减少或者更换这里面的tokens,那么在生成阶段之前把这个pattern找出来,是可行的。
Inspired by our observations above, we develop an efficient and fine-tuning-free algorithm, SnapKV, which efficiently identifies critical attention features and compresses KV cache correspondingly with minimal model modification (See Fig. 1).
SnapKV,这里重点是图。橙色部分代表的是由SnapKV选出来的每个head的一簇features。这些features会接着被用来产生新的KV pairs 和observation window里面的feature连接在一起。 总的来看,选择的prefix和observation windows组成了用于generation的新KV cache
我在知乎里面看到的一个例子解释
举个简单例子,假设Prompt序列长度为1000,Observation Window大小为16,我们希望将KV缓存压缩至256。SnapKV首先基于最后16个Token的注意力分布,通过Voting算法选出最重要的240个位置。然后将这240个位置对应的Key和Value与Observation Window拼接,形成大小为256的新KV缓存。与h2o相比,增加了基于Pooling的聚类操作。具体而言,在Sum注意力权重后,使用带Padding的1维Pooling层进行滑动窗口聚合
2 Related Works
Heavy-Hitter Oracle (H2O) introduces a policy that greedily drops KVs during generation based on a scoring function derived from cumulative attention. While this approach effectively compresses the KVs appended to the cache during generation, it overlooks compression of prompt KVs, which is crucial for reducing memory and computational overhead.
H2O是压缩的generation产出的kv cache,主要是高效的压缩了在generation过程中追加到cache中的KVs,但是对于在长文本推理里面开销很大的prompt 的kv它并没有做优化。
Building on a similar concept, Adaptive KV Compression (FastGen) implements a dual-phase algorithm that encompasses(包含) four KV cache compression policies. Initially, it identifies optimal policies through profiling results obtained from prompt encoding. Subsequently, it dynamically evicts caches during the generation phase based on these policies. Nonetheless, it faces the similar problem with H2O.
这里Adaptive KV Compression会首先通过prompt encoding来识别出一个最优的压缩策略(从四个compression里面选择),然后同样是在generation阶段基于这个最优策略动态的减少生成的kvcache。这个问题和H2O类似,都是不能减少prompt 的kvcache内容
ScissorHands focuses on identifying and retaining pivotal(关键) tokens that exhibit a consistent attention weight pattern with previous token windows during generation steps. However, this method concentrates solely on the window of previous pivotal tokens in generation and neglects the extensive prompt that contains essential information for generating accurate responses. This oversight could lead to an inability to extract detailed information from prompts.
这个ScissorHands(剪刀手)的思路,是识别并保存那些在generation的时候与上一个token windows表现出一致的attention weight的tokens(pivotal tokens) 。但是这个方法值关注generation时上一个pivotal tokens的window,从而会忽略prompt(包含了大量的基本信息),因此丢失了精度信息。
有个问题是这里ScissorHands的token window是什么?我感觉是一个token的领域,所以这个算法只在generation的时候把领域内的一些关键token找出来,但是肯定覆盖不了多少prompt的内容。
3 Observations
-Pattern can be identified before generation. In this experiment, we split the attention features of input sequence of each layer into multiple windows, each with 128 tokens, and calculate the averaged attention weights of the last 20 windows separately. To understand the attention allocation patterns along input sequences, we calculate the overlap rates between important attention features of input sequence (those with high average attention weights) identified by each window and the actual ones used by generation. The experimental results are shown in Fig. 2. We observe that the last window of input sequence recognizes highly similar attention allocation pattern with the actual generation.
这里是把input 切成许多个window,在最后的20个window里面分别单独算attention的pattern,然后将这个pattern和真正generation的时候的attention pattern做比较,发现用input的最后一个window预测出的pattern和generation时的注意力pattern是很重合的。
但是这种重合能够说明我们能够找到一个attention pattern来筛选很长的input中的那些有效信息,但是这个pattern会不会每层之间变得很大,也就是说后面的层可能会关注的是被筛选的信息呢?这就是下一个observation:
-Pattern is consistent during generation. We study if the positions of features identified as crucial in the last window of input sequence maintain their significance in the subsequent token generation. In the experiment, we split the generated tokens into 4 windows for every layer, each spanning 128 tokens, to compute the averaged overlap rates of these windows versus the last window of input sequence. As shown in Fig. 3, active attention features of input sequence obtained from the last window exhibit remarkable consistency throughout the generation process, as evidenced by high overlap rates.
在不同层上,真实generation的时候选出的input sequence attention features和从last window of input sequence 里面选出的input sequence attention features预测出来的重叠率一直的很高,在每一层,所以这个现象是持续的。
其实这样很单纯的想想,这个黑盒一样的transformer过程,几乎是线性的计算(embedding token线性变化为qkv,qk权重乘上v得到结果,叠加一些非线性的处理,比如残差连接、softmax等),那么它的注意力,很可能灵活度并不高,对于一段信息(input prompt),那么接着它的文本在进行generation的时候,对它的持续注意部分应该是保持高度一致的,所以我们就能看到这段信息(input prompt)的结尾部分的注意力pattern和后续的文本对这个信息(input prompt)的注意力pattern是一致的,那么就能说明这两个observation:
Pattern can be identified before generation.(我们用input prompt的结尾的window部分对这个文本的attention pattern就能做到identify)
Pattern is consistent during generation. (也正是因为这一点,我们既能用input prompt末尾进行预测,也能将这个预测结果直接复用到每一层)
Q:这里我有个想法,其实看图3(甚至图2也能看出)的话,在比如13层、30层的地方,hit rate的重叠率不太高,那是不是可以稍微drop少一点,通过last window预测的时候,可能层数往后走的时候token的权重随着层数有点变化而不是都一个权重?然后drop少一点,增加对这种偏差的考虑,然后看重叠率?不过说不定snapKV本来就有冗余的处理,先接着看吧。
4 SnapKV
In the attention mechanism, the growth in prompts will significantly increase time complexity for generation due to the Query-Key matrix multiplication. SnapKV addresses this issue by maintaining a constant amount of prompt KVs during generation, significantly reducing serving times for longcontext LLMs. To structure our method coherently, we propose the following terminologies:
- Prompt Length ( ): The total length of the user-provided input.
- Observation Window ( ): The last segment of the prompt. This window is crucial for analyzing the influence of different contexts on attention allocation patterns.
- Prefix Length ( ): The length of the input preceding the observation window. It is part of the prompt and does not include the observation window. Overall, we have:
- Voting: The process of calculating attention weights for each query within the observation window across all heads, aggregating these weights to highlight the prefix positions that are considered most significant. For a single batch of sequence, formally:
- where selects the indices I of the top k values in tensor C per head. k is defined as , where p stands for the compression rate. The tensor represents the subset of the prompt softmax-normalized attention features over N heads.
- Hit Rate: We define attention features above a predefined threshold during generation as important features. The hit rate, H, is the number of important features successfully selected by the previous voting process over the total number of important features. H quantifies the effectiveness of the voting mechanism and is calculated as follows:
represents the attention features between the current generated query and prefix keys. M selects attention features by indices. M selects attention features by indices. The threshold operation filters to retain only features with values over , indicating important attention activations. The measures the overlap between attention features selected by , quantifying the alignment of the current attention with previously identified important features.The hit rate H is then computed as the ratio of the sum of overlap to the sum of important features , providing a metric for the efficacy of the attention mechanism in recognizing and emphasizing important attention features within the context.We use to denote combination of the last two equation.
这部分主要是定义了一些数学符号便于后续的理论讲述。
这里重要的是voting这个过程,它是利用observation window来得到一个整个obs window里面的tokens在各个head里对于前面prefix的注意力权重的平均值矩阵 ,然后 会从每个head(共N个head)里面取出权重最高的前k个值的所在索引,这些就是我们所需要的(并且认为是)固定的attention pattern/feature,这些地方的索引是不是真是那些重要的token,我们会通过真实的generation过程中的attention feature来比较计算一个Hit Rate。这个hit rate的计算思路很简单,就是把当前的attention feature的矩阵 我们构造一个zero矩阵和A一样大,然后将observation里面找出来的那些关键token的索引对应的位置给置1,然后把A里面那些超过阀值(可以认为是关键token)的数据也给置1,然后我们来看两个矩阵的重合率。
4.1 Observation Window-based Algorithm
Overall, SnapKV operates through two stages as follows:
- Vote for important previous features. By the voting process defined above (Eq. 2), we select the important attention features based on the observation window. Sec.3 highlights the consistency of the attention allocation pattern within observation windows throughout the generation, suggesting that these selected attention features are also vital for subsequent generation. Furthermore, we implement clustering to retain the features surrounding the selected attention features (Sec. 4.3). Line 8-17 shows the pseudo code of the voting process.
- Update and store compressed keys and values. We concatenate the selected attention features with all features within the observation window, which encompasses all features containing the necessary prompt information. Line 18- 24 shows the compressing process. The concatenated KVs are stored for later use in generation, thereby saving memory usage.
第13行有个pool的操作(池化操作)我还不是很懂是干嘛的
看到后文有提到关于池化的优势:
4.3 Efficient Clustering via Pooling In LLMs, information retrieval and generation rely on features with high attention weight and are supplemented by copying the rest of features in context using induction heads [15]. Hence, naively selecting the top features results in retaining only portions of details and then losing the completeness of the information. For example, such compression might cause the LLMs to retrieve only the country code of a phone number and hallucinate the rest. Our experiment also revealed that only selecting the features with the highest weights is insufficient (Sec. 5.2). Such sparse selection risks compromising the contextual integrity encapsulated in between features, thereby reducing accuracy. Based on the insights, we propose a fine-grained clustering algorithm utilizing a pooling layer shown in Line 13.
解释一下第17行的矩阵expand扩张操作:
indices.unsqueeze(-1)
:unsqueeze(dim)
方法在指定的维度dim
上增加一个大小为 1 的新维度。1
表示在最后一个维度上增加一个新维度。- 假设
indices
的原始形状是(batch_size, num_heads, seq_length)
,那么indices.unsqueeze(-1)
的形状将变为(batch_size, num_heads, seq_length, 1)
。
.expand(-1, -1, -1, head_dim)
:expand(*sizes)
方法将张量扩展到指定的形状,但不会实际分配新的内存。它返回一个新的张量视图,其中每个维度的大小可以是原始大小或 1。1
表示保持该维度的大小不变。- head_dim 是要扩展到的目标大小。
- 假设 head_dim 是 64,那么
indices.expand(-1, -1, -1, head_dim)
的形状将变为(batch_size, num_heads, seq_length, head_dim)
。
4.2 Robustness Analysis of Hit Rate
Overall, we want to answer the following two questions:
- Does the nature of instructions in the prompt affect the hit rate?
- Does the context and instruction positioning affect the hit rate?
4.2.1 Contextual Dependency of Patterns
We analyze whether instructions will affect the selection of important features even if the provided context is the same. Our experiment utilizes different instructions on the same document and selects the important features based on the observation window that consists of both the instructions and their corresponding responses. Then we calculate the hit rates between important features selected by different instruction-response pairs within the same document by using . By varying the instructions, we observe that different instructions prioritize different prefix attention features, as indicated by the descending trend in hit rates shown in Fig. 4. Our findings reveal an interesting aspect of KV cache in LLMs: the important attention features change with different instructions. This variability challenges the effectiveness of static compression methods that depend on constant weighted importance or fixed policies. Thus, the complex relationship between context and related KV cache emphasizes the need for context-aware compression strategies and highlights the capability of SnapKV that recognizes this dynamic.
这里的instructions是指的什么内容?我感觉指的是测试用的一些问题(也就是用户的prompt),response就是generate的内容。
这里主要是说明了对同样的context,注意力不是很固定的,对于不同的instruction问题,那么需要注意的信息确实应该是不一样的,所以在不同的instruction-response对的测试结果下,我们能看到他们选出的important features是很不一样的。所以不能采用静态(static)的方式来锁定一段context会被用到的important positions,我们需要动态的确定(根据当前的instruction信息这样)。
4.2.2 Invariance to Instruction Positions
Our analysis also extends to the significance of instruction positioning on the interpretability of LLMs and their selection of important features. We calculate the average hit rate for the responses using the same observation window size as in the previous experiment. Our results shown in Fig. 5 indicate that across all three datasets, the hit rates are consistently high regardless of whether instructions are positioned before or after extensive supplementary contexts. This consistency suggests that SnapKV is able to identify attention allocation patterns regardless of the question’s positions.
这里是回答了开章的第二个问题“Does the context and instruction positioning affect the hit rate?” → No。
测试下来发现通过instruction的时候找到的那些important features是和后续生成时用到的那些important features高度重叠的(并且instruction在最开始还是最后的区别都不大)(存疑问,第10-20层(特别是15层)那里,question在beginning和end结果看起来是有一定差别的,beginning的位置hit rate看起来更低)。这就表明我们可以无视问题所在的position进行预测。
4.3 Efficient Clustering via Pooling
In LLMs, information retrieval and generation rely on features with high attention weight and are supplemented by copying the rest of features in context using induction heads [15]. Hence, naively selecting the top features results in retaining only portions of details and then losing the completeness of the information. For example, such compression might cause the LLMs to retrieve only the country code of a phone number and hallucinate the rest. Our experiment also revealed that only selecting the features with the highest weights is insufficient (Sec. 5.2). Such sparse selection risks compromising the contextual integrity encapsulated in between features, thereby reducing accuracy. Based on the insights, we propose a fine-grained clustering algorithm utilizing a pooling layer shown in Line 13.
主要是我们不能只把那些高权重的token对应的kv孤零零的取出来,他们周围的那些token也是需要的,它们起到了至关重要的辅助作用,所以就有个池化(归类)操作,一次选择一片token,也即是文中说的用了一个fine-grained clustering algorithm。
5 Experiments
5.1.1 Needle-in-a-Haystack
The Needle-in-a-Haystack test challenges the model to accurately retrieve information from a specific sentence ("needle") concealed within an extensive document (the "haystack"), with the sentence placed at a random location. Typically, sentences that are inserted in the middle of prompts are harder to retrieve. To rigorously evaluate SnapKV’s capabilities, we extended the document length to 380k tokens which is the longest content that can be processed by a single A100-80GB GPU. We configured the prompt KV cache size to 1024, enabling SnapKV to select the most crucial 1024 attention features from the prompt for answer generation, with a maximum pooling kernel size of 5 and an observation window size of 16, both of which are hyperparameters that can be customized. The compelling outcomes in Fig. 6 from the Needle-in-a-Haystack test underscore SnapKV’s potential to precisely manage small details on extremely long input contexts with a 380x compression ratio.
这里的结果我觉得确实是可预见的,但是并不是给我很意外的感觉。因为前面也说了,这个大海捞针的测试是考察模型对于一些细节内容的注意,模型有一个在相同的instruction下attention的一致性,那么这部分压缩充分考虑并且利用了这点至少保证了原有模型推理中对材料内容的注意程度,虽然这样做肯定在看模型的找细节的性能上不会有很大的削弱,但是想想看前面在提到pool/clustering(池化)操作的时候也说到,推理的时候的语境需要一些权重不那么高的token去参与,这个问题并不会很大的影响这种查看有无注意到文本的测试,我感觉会影响到模型做推理的时候的一些正确性,这种是很重要的,这种压缩方式在这里测试的较好效果我觉得的确是理所当然的,但是推理产生新内容的时候我觉得会有不小的影响。
所以我先跳到5.2的测试来看看:
5.2 Ablation Study of Effectiveness of Pooling
Our evaluation utilizes the modified LongEval-Lines benchmark, incorporating randomly generated pairs and averaged scores. LongEval-Lines presents a greater challenge compared to Needle-in-a-Haystack because it involves identifying key-value pairs in noisy contexts of the same format, while in Needle-in-a-Haystack, the relevant information is more distinctly separated from other contexts. We apply max pooling with a kernel size of 5 and use the observation window with a size of 16, which are hyperparameters and could be customized according to different models. As illustrated in our results (Fig. 8), we find that pooling significantly enhances retrieval accuracy compared to methods not utilizing pooling. We hypothesize that this is because the initial portions of critical token clusters are weighted higher by attention mechanisms. Typically, large language models tend to copy the tokens surrounding the initial portions to keep the contextual integrity. However, naively compressed KV cache breaks this mechanism and could lead to partially correct results (Fig. 8). Note that throughout our experiments, the choice between max pooling and average pooling did not yield significant differences in performance.
捉个虫:这里图的第二张的title打错了,应为Mistral-7B-Instruct-v0.2 with Pooling
黄色加粗的部分说明作者也考虑到大海捞针测试的问题了,因为大海捞针里面需要找的信息和其它内容太过割裂了,所以注意力很自然的也会升上去,通过大海捞针测试在这种实现方式下是很自然的,但是想要真正看这种压缩方式有没有损害,就需要进行更好的测试,比如这里魔改的LongEval-Lines benchmark。
作者这里的解释(编故事)我觉得也有一定说服力:
就是认为那些关键信息的token的最开始部分的attention比较高,但是后面连着的同等重要的信息attention没那么高,但是典型的大语言模型会把这些token(因为连着这个高attention部分)也给copy(这里我没懂copy是什么意思?我觉得应该是给这些token手动的抬高权重?或者说它会把原文给扩充复制几遍surrounding的内容)去保证上下文的连贯性。
看它的测试方式,它是把输入都统一成一种形式:"line makeshift-penguin: REGISTER_CONTENT is <10536>"。然后模型需要找到对应key的value。可以看到结果上pool确实起到了很好的效果。但是看看没有pool的情况,看起来非常的糟糕,我认为pool把一些相关的重要tokens的kv也保留的确是合理的,但是这个比例真是固定的吗?(论文里面提到的kernel size为5,因为确实至少论文里面没有详细阐述它pooling的原理,我认为固定的范围肯定是有问题的),并且这样的测试我觉得只是说明了在模型寻找已知细节内容的时候,通过这种压缩方式结合pooling进行的细节弥补,保证了对于查询到资料信息的能力,但是基于这些信息的扩充一点的思维问题,是否这部分信息不受影响?因为我有感觉这部分信息可能attention 的权重不是那么的高,因为它只会在推理新的内容的时候用到,它利用的很可能是偏向于整个文本资源语境的信息,而这部分信息在这种压缩模式里面我觉得很难说它能被保留。
不过我看到它文章末尾有个Visulization of the Generated Context的示例表现的结果好像还不错:
所以我打算再看看它后面的测试继续看看它的表现情况。
BTW:我看到这里确实没找到作者有说明他的池化是怎么实现的,我甚至莫名有种感觉他会不会有点fake?还是说这个pool/clustering操作是个很著名的操作?
5.3 Experiments on LongBench
We evaluate SnapKV on these four models using LongBench , a multi-task benchmark designed to rigorously evaluate long context understanding capabilities across various datasets, spanning single and multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. We choose LWM-Text-Chat-1M with 1 million context length, LongChat-7b-v1.5-32k, Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1 with 32k context length as our baselines. For each model, we test SnapKV with various settings: compressing KV caches in the prompt to 1024, 2048, and 4096 tokens. We use max pooling with kernel size 7 and observation window size 32. Table 1 illustrates a negligible performance drop from models with SnapKV compared with original implementations for 16 different datasets, even with prompt-KV with 1024 tokens. Some models even outperform the baseline.
在测的单文档QA、多文档QA、总结、少样本学习、合成和代码补全,可以从图里面看到,加粗的是单项里面的最大值,snapkv的结果没有差太多,有些地方甚至还是比baseline强的。那我之前的疑虑貌似是解除了?(因为我感觉code completion的测试应该是需要那种产生未有过信息的能力的?但是在我这里不太熟悉了解比如few-shot learning、synthetic tasks是干嘛的?搜关键字并不能搜到,难道这个是默认大家知道的?)
我查了一下few-shot learning:
The goal of zero-shot and few-shot learning is to get a machine-learning model to perform a new task it was not trained for.
我感觉这个就是我比较在意的测试,就是给一些提示,让模型学习一些pattern或者思路,然后“创造”一些新的内容。
Is few-shot prompting the same as few-shot learning?“Few-shot learning” and “zero-shot learning” are well-known concepts in machine learning that were studied long before LLMs appeared on the scene. In the context of LLMs, these terms are sometimes used interchangeably with “few-shot prompting” and “zero-shot prompting.” However, they are not the same.Few-shot prompting refers to constructing a prompt consisting of a couple of examples of input-output pairs with the goal of providing an LLM with a pattern to pick up.Few-shot learning is a model adaptation resulting from few-shot prompting, in which the model changes from being unable to solve the task to being able to solve it thanks to the provided examples.In the context of LLMs, the “learning” is temporary and only applies to a particular chat conversation. The model’s parameters are not updated, so it doesn’t retain the knowledge or capabilities.
它后面的一些测试我没有那么关心了,于是我就暂且略过。。。
5.5 Case Study: Compatibility with Parallel Decoding
In this section, we provide a novel perspective on employing KV cache compression synergistically with parallel decoding. Parallel decoding leverages a lightweight model or an adaptor to draft initial tokens, which are subsequently verified by larger LLMs.This strategy effectively reduces memory overhead, a critical concern given the autoregressive nature of LLMs that renders them more memory-intensive than computationally demanding. Specifically, in LLMs, each decoding step involves generating a single token, with the transfer of weights between High Bandwidth Memory (HBM) and cache contributing to significant overhead
这个美杜莎的decoding框架是什么?有啥原理?
Medusa
FasterDecoding • Updated Dec 15, 2024
- 作者:liamY
- 链接:https://liamy.clovy.top/article/madsys/SnapKV
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章