trasformer 模型推理和训练过程

A10 key specs Calculating the operations per byte (ops:byte) ratio Calculating arithmetic intensity Breaking down the attention equation Discovering our inference bottleneck Batching memory-bound processes on a GPU Evaluating GPUs for LLM inference Generating a single token on each GPU Prefilling with batched prompt tokens on each GPU Estimating total generation time on each GPU Optimizing LLM model inference with transformer math And I still have some confusion about kv cache and inference Bringing The Tensors Into The Picture Now We’re Encoding!Self-Attention at a High Level Self-Attention in Detail Matrix Calculation of Self-Attention The Beast With Many Heads Representing The Order of The Sequence Using Positional Encoding The Residuals The Decoder Side The Final Linear and Softmax Layer Recap Of Training The Loss Function kvcache的解释的文章更详细的计算 Size of the KV Cache 上手试试

A guide to LLM inference and performance

Learn if LLM inference is compute or memory bound to fully utilize GPU power. Get insights on better GPU resource utilization.

https://www.baseten.co/blog/llm-transformer-inference-guide/

One common workload is running a seven billion(7B) parameter LLM like Llama 2 or Mistral.

with A10显卡：

A10 key specs

Column A	Column B
FP32	31.2 TF
TF32 Tensor Core	62.5 TF \| 125 TF*
BFLOAT16 Tensor Core	125 TF \| 250 TF*
FP16 Tensor Core	125 TF \| 250 TF*
INT8 Tensor Core	250 TOPS \| 500 TOPS*
INT4 Tensor Core	500 TOPS \| 1000 TOPS*
GPU Memory	24 GB GDDR6
GPU Memory Bandwidth	600 GB/s
Max TDP Power	150W

重点关注：

FP16 Tensor Core: This is our compute bandwidth. We have 125 TFLOPS (teraflops, or a trillion（万亿） float point operations per second) of available compute for models in half-precision (also known as FP16).

Half-precision is a binary number format that occupies 16 bits per number, as opposed to full-precision, which refers to a binary format that utilizes 32 bits per number. For many ML applications, using half-precision is a practical choice as it requires less memory without losing accuracy. In this blog post, we ignore datasheet values associated with sparsity (denoted by an asterisk).

GPU Memory: We can quickly estimate the size of a model in gigabytes by multiplying the number of parameters (in billions) by 2. This approach is based on a simple formula: with each parameter using 16 bits (or 2 bytes) of memory in half-precision, the memory usage in GB is approximately twice the number of parameters. Therefore, a 7B（指的是参数数量是7Billion(十亿)） parameter model, for instance, will take up approximately 14 GB of memory. Why does this matter? Well, with our A10's 24 GB of VRAM, we can comfortably run a 7B parameter model and still have about 10 GB of memory remaining as a buffer. This spare memory plays an important role in model execution, something we will elaborate on（详细说明） later.

GPU Memory Bandwidth: We can move 600 GB/s from GPU memory (also known as HBM or high bandwidth memory) to our on-chip processing units (also known as SRAM or shared memory).

Calculating the operations per byte (ops:byte) ratio

we can calculate the ops:byte ratio of our hardware. This tells us how many floating point operations per second (FLOPS) we can complete for every byte of memory we access.

Performance of a function on a given processor is limited by one of the following three factors; memory bandwidth, math bandwidth and latency. Consider a simplified model where a function reads its input from memory, performs math operations, then writes its output to memory.

Given the numbers from the spec sheet, we calculate the ops:byte ratio for the A10:

This means to take full advantage of our compute resources, we have to complete 208.3 floating point operations for every byte of memory access.

If we find ourselves only able to complete fewer than 208.3 operations per byte, our system performance is memory bound. This essentially means that the speed and efficiency of our system are constrained by the rate at which we can transfer data or the input-output operations that it can handle.

If we want to do more than 208.3 floating point operations per byte, our system is instead compute bound. In this state, our effectiveness and performance are restrained not by the memory, but rather the number of compute units that our chip possesses.

那么问题就在于我们需要求到我们的algorithm’s arithmetic intensity.

Calculating arithmetic intensity

we need to calculate the arithmetic intensity of our 7 billion parameter LLM. Arithmetic intensity is the number of compute operations an algorithm takes divided by the number of byte accesses it requires and is a hardware-agnostic measurement（硬件无关的度量）.

The most computationally expensive parts of our 7B parameter LLM are the attention layers, which ensure next token predictions are weighted based on the relevance of previous tokens. Because attention layers are the most computationally demanding part of the inference, we’ll calculate our arithmetic intensity there.

Understanding attention layers requires getting just a bit more specific with how the model works under the hood. When sampling from a transformer, there are two phases:

Prefill: In the first phase, the model ingests your prompt tokens in parallel, populating the key-value (KV) cache. The KV cache can be thought of as the state for your model, nestled within the attention operation. During the prefill, no tokens are being generated.

Autoregressive sampling: In the second phase, we leverage（充分利用） our current state (stored in the KV cache) to sample and decode the next token. We pay a small price in storage in order to not recalculate the cache for every single new token. Without the KV cache, every successive token would take longer to sample because we would have to pass all previously seen tokens through the model.

Breaking down the attention equation

The authors of the FlashAttention paper have a great implementation for the standard attention algorithm. This framing will make it easier for us to calculate memory and compute in the algorithm.

Eagle-eyed readers might notice this algorithm drops the scaling by sqrt(d_k). It’s a minor factor that we can safely ignore.（但是我读论文-attention is all you need的时候作者说这个还是挺需要的，不过可能忽略了影响不大吧）HBM is high bandwidth memory.

N is the sequence length of the LLM, which sets the context window.

For Llama 2 7B, N = 4096.

d is the dimension of a single attention head.

For Llama 2 7B, d = 128.

Q, K, and V are all matrices used to compute attention.

Their dimensions are N by d, or in our case 4096x128.

S and P are both matrices calculated during the equation.

Their dimensions are N by N, or in our case 4096x4096.

O is the output matrix with the results of the attention calculation.

O is an N by d matrix, or in our case 4096x128.

HBM is high bandwidth memory.

From the data sheet, we know that we have 24 GB of HBM on the A10 operating at 600 GB/s.

Line in algorithm	Load from memory	Compute	Store to memory
Line 1:	size_fp16 * (size_Q + size_K)	cost_dot_product_QK * size_S	size_fp16 * size_S
ㅤ	= 2 * 2 * (N * d)	= (2 * d) * (N * N)	= 2 * (N * N)
Line 2:	size_fp16 * size_S	cost_softmax * size_P	size_fp16 * size_P
ㅤ	= 2 * (N * N)	= 3 * (N * N)	= 2 * (N * N)
Line 3:	size_fp16 * (size_P + size_V)	cost_dot_product_PV * size_O	size_fp16 * size_O
ㅤ	= 2 * ((NN) + (N d))	= (2 * N) * (N * d)	= 2 * (N * d)

We calculate total memory movement by summing the first and third columns (the loads from and stores to memory).

And calculate total compute by summing the second column (the compute on the loaded data).

The arithmetic intensity can be calculated as follows.

N=4096, d=128

Discovering our inference bottleneck

Our arithmetic intensity for Llama 2 7B is 62 operations per byte, which is way less than our A10’s ops:byte ratio of 208.3.

Thus, during the autoregressive phase, our model is memory bound. In other words, in the time it takes us to move a single byte from memory to compute, we could have completed many, many more calculations than just on that byte.

This is a problem. We’re paying good money to keep our GPUs up, but are not using the compute that’s available to us.

Batching memory-bound processes on a GPU

One solution is to leverage the extra on-chip memory to run forward passes through our model in batches. In other words, we can wait a couple hundred milliseconds to rack up(积累) a few requests and run them all in a single pass instead of greedily processing requests as they arrive. This enables us to reuse parts of the model that we’ve already loaded into the GPU’s SRAM. Batching increases the model’s arithmetic intensity by doing more computation for the same number of loads and stores from memory, which in turn reduces the degree to which the model is memory bound.

How big can we make our batches? Recall that we have 10 GB of memory left on our A10 after loading in our 7B parameter model:

Now, the question is how many sequences can we fit in that spare GPU memory at once?

To calculate this figure, we’ll need to return to the KV cache. Recall that during the prefill step in the attention layer, we populate the KV cache based on the prompt, or input sequence.

The KV cache contains the matrices K and V that we used during attention calculation. We need some of the values from earlier and a couple of new ones to calculate the size of the KV cache:

d, which can be notated as d_head, is the dimension of a single attention head.

For Llama 2 7B, d = 128.

n_heads is the number of attention heads.

For Llama 2 7B, n_heads = 32.

n_layers is the number of times the attention block shows up.

For Llama 2 7B, n_layers = 32.

d_model is the dimension of the model. d_model = d_head * n_heads.

For Llama 2 7B, d_model = 4096.

It’s worth noting that d_model being the same as N (the context window length) is coincidental（偶然的）. As the Llama paper shows, other sizes of Llama 2 have a larger d_model (see the “dimension” column).

At half precision (FP16), each floating point number takes 2 bytes to store. There are 2 matrices, and to calculate the KV cache size, we multiple both by n_layers and d_model, yielding the following equation:

Given that the KV cache requires 524288 bytes per token, how large can the KV cache be in terms of tokens?

Our KV cache can comfortably accommodate 19,230 tokens. Thus, for Llama 2's standard sequence length of 4096 tokens, our system has the bandwidth to handle a batch of 4 sequences concurrently.

Evaluating GPUs for LLM inference

In some cases, batching may not make sense. For example, if you’re building a user-facing chatbot, your product is much more sensitive to latency, so you can’t wait for a batch to fill before running inference. What should we do in this case?

One option is to recognize that we won’t be able to fully utilize our GPU’s on-chip memory, and downsize. For example, we can move to a T4 GPU, which has 16 GB of VRAM. This can still hold our 7B parameter model, but there’s much less leftover capacity — only 2 GB — for batching and KV caching.

Generating a single token on each GPU

Recall that during the autoregressive part of generation, we are memory bandwidth bound if our batch size is 1. Let’s quickly calculate how long it takes to generate a single token using the following equation: time/token = total number of bytes moved (the model weights) / accelerator memory bandwidth

On an T4: (2 * 7B) bytes / (300 GB/s) = 46 ms/token

On an A10: (2 * 7B) bytes / (600 GB/s) = 23 ms/token

On an A100 SXM 80 GB: (2 * 7B) bytes / (2039 GB/s) = 6 ms/token

These numbers are only an approximation, because they assume there is zero communication within the GPU during inference, zero overhead on each forward pass, and perfect parallelization during computation.

Prefilling with batched prompt tokens on each GPU

We can also compute the time it takes for the prefill section assuming that we batch all of the prompt tokens into a single forward pass. Let’s assume that the prompt has 350 tokens, for simplicity, and that the limiting bottleneck is compute, and not memory. Prefill time = number of tokens * ( number of parameters / accelerator compute bandwidth)

On a T4: 350 * (2 * 7B) FLOP / 65 TFLOP/s = 75 ms

On an A10: 350 * (2 * 7B) FLOP / 125 TFLOP/s = 39 ms

On an A100 SXM 80 GB: 350 * (2 * 7B) FLOP / 312 TFLOP/s = 16 ms

Estimating total generation time on each GPU

Assuming we allow for 150 completion tokens (and we suppress any stop tokens), our total generation time will be as follows.

Total generation time = prefill time + number of tokens * time/token

On a T4 = 75 ms + 150 tokens * 46 ms/token = 6.98 s

On an A10 = 39 ms + 150 tokens * 23 ms/token = 3.49 s

On an A100 SXM 80 GB: 16 ms + 150 tokens * 6 ms/token = 0.92s

Optimizing LLM model inference with transformer math

We want to make the most of compute capacity during LLM inference, but we can’t do that when we’re memory bound. Calculating the operations per byte possible on a given GPU and comparing it to the arithmetic intensity of our model’s attention layers lets us understand if we’re memory bound or compute bound.

When memory bound, batching lets us make the most of our compute capacity, though batching isn’t possible for many latency-sensitive use cases. When we start with a strong latency requirement, we can use similar calculations to estimate which GPUs can meet our needs.

Looking under the hood of LLM inference is fascinating, and there’s always further to dig. Here are some great resources for learning more:

And I still have some confusion about kv cache and inference

The Illustrated Transformer

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, Vietnamese Watch: MIT’s Deep Learning State of the Art lecture referencing this post Featured in courses at Stanford, Harvard, MIT, Princeton, CMU and others In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions. The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter. 2020 Update: I’ve created a “Narrated Transformer” video which is a gentler approach to the topic: A High-Level Look Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

https://jalammar.github.io/illustrated-transformer/

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

Bringing The Tensors Into The Picture

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

the word2vec
Word Embeddings are similarities based on context, which might be gender, tense, geography or something else entirely. The lines shown are just mathematical vectors, so see how you could move ‘across’ in embedding space from “Man” to “Queen” by subtracting “King” and adding “Woman”.
Keep in mind that Word2vec is a two-layer shallow neural net, and so is not itself an example of deep learning. But techniques like Word2vec and GloVe can turn raw text into a numerical form that deep nets can understand, for instance, using Recurrent Neural Networks with Word Embeddings. In summary then, the purpose of word embedding is to turn words into numbers, which algorithms like deep learning can then ingest and process, to formulate an understanding of natural language.
It’s like numbers are language, like all the letters in the language are turned into numbers, and so it’s something that everyone understands the same way. You lose the sounds of the letters and whether they click or pop or touch the palate, or go ooh or aah, and anything that can be misread or con you with its music or the pictures it puts in your mind, all of that is gone, along with the accent, and you have a new understanding entirely, a language of numbers, and everything becomes as clear to everyone as the writing on the wall. So as I say there comes a certain time for the reading of the numbers.

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512– In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer(在自注意力层中，模型会计算每个词与其他所有词之间的注意力权重。这些权重表示每个词对其他词的关注程度。因此，自注意力层中的计算是相互依赖的，因为每个词的表示需要考虑其他所有词的信息). The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

Self-Attention at a High Level

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

Self-Attention in Detail

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector.

These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients(梯度). There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

The self-attention calculation in matrix form

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. If we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, it would be useful to know which word “it” refers to.

It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that? We concat the matrices then multiply them by an additional weights matrix WO.

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

If we add all the attention heads to the picture, however, things can be harder to interpret:

Representing The Order of The Sequence Using Positional Encoding

the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence.

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

A real example of positional encoding with a toy embedding size of 4

What might this pattern look like?

In the following figure, each row corresponds to a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

July 2020 Update: The positional encoding shown above is from the Tensor2Tensor implementation of the Transformer. The method shown in the paper is slightly different in that it doesn’t directly concatenate, but interweaves the two signals. The following figure shows what that looks like. Here’s the code to generate it:

The Residuals

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step. If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

The Decoder Side

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

The Final Linear and Softmax Layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

Recap Of Training

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

Example: one-hot encoding of our output vocabulary

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.

The Loss Function

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak(调整) all the model's weights using backpropagation（反向传播） to make the output closer to the desired output.

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively（接连地） output probability distributions where:

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)

The first probability distribution has the highest probability at the cell associated with the word “i”

The second probability distribution has the highest probability at the cell associated with the word “am”

And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

The targeted probability distributions we'll train our model against in the training example for one sample sentence

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.

kvcache的解释的文章

Transformers KV Caching Explained

How caching Key and Value states makes transformers faster

https://medium.com/@joaolages/kv-caching-explained-276520203249

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/659770503

This is where KV comes into play. By caching the previous Keys and Values, we can focus on only calculating the attention for the new token.

Why is this optimization important? As seen in the picture above, the matrices obtained with KV caching are way smaller, which leads to faster matrix multiplications. The only downside is that it needs more GPU VRAM (or CPU RAM if GPU is not being used) to cache the Key and Value states.

更详细的计算

Let’s look at how it works in the context of a decoder-only Transformer.

What is the Transformer KV Cache?

The personal website of Peter Chng

https://peterchng.com/blog/2024/06/11/what-is-the-transformer-kv-cache/

All previous rows were unaffected due to the self-attention mask.

Compute new q, k, v rows for only the new token.

New q row will be used immediately. (This is why there is no query cache)

Append new key, value entries to existing K, V caches

Compute new att row by doing matrix-vector multiplication between new q row and k_cache.transpose()

Compute new v row by doing matrix-vector multiplication between new att row and v_cache.transpose()

The output (which is just for the latest token) is passed to the next layer.

This can proceed through subsequent layers because we only care about the latest token.

This is a tradeoff to save repeated computation by increasing memory usage, but is worthwhile because without this optimization, we’d be be wasting cycles recomputing the key, value and attention matrices.