type
status
slug
date
summary
tags
category
password
icon
Intro
Technique Report
Multi-Head Latent Attention
recap MHA
.jpg?table=block&id=1ca652d0-85b7-8077-928f-cf3b23fc6585&t=1ca652d0-85b7-8077-928f-cf3b23fc6585)
we need to know how the Rope are added in MHA
Detailed Guide to Rotary Position Embedding
GO To MLA

.jpg?table=block&id=1d1652d0-85b7-80ec-b1a8-c418495490a1&t=1d1652d0-85b7-80ec-b1a8-c418495490a1)




.jpg?table=block&id=1d7652d0-85b7-8000-b6b0-f64c3605f638&t=1d7652d0-85b7-8000-b6b0-f64c3605f638)
.jpg?table=block&id=1d7652d0-85b7-8008-83ba-ff2da19ddb00&t=1d7652d0-85b7-8008-83ba-ff2da19ddb00)

上图是一个没有矩阵吸收,展开为 MHA 的实现


optimizing-mla.md
madsys-dev
主要计算量的比值:
矩阵吸收:
q(after q_absorb) * compressed_kv^T :
[bsz, num_heads, q_len, kv_lora_rank] * [bsz, kv_lora_rank, kv_len]
bsz * num_heads * q_len * kv_len * kv_lora_rank
非矩阵吸收 CD:
compressed_kv * W^U :
[bsz, kv_len, kv_lora_rank] * [kv_lora_rank,num_heads*nope]
bsz * kv_len * kv_lora_rank * num_heads * nope
两者的比值是:
矩阵吸收 : 非矩阵吸收 CD ->
q_len / nope(常数)
也就是说在关键位置,计算量 q_len > nope(128)的时候,矩阵吸收的计算量会更大。
不过这个估计是粗略的,不妨把整个环节里面的所有计算量整理为一个算式,我们计算比值:
.jpg?table=block&id=1dd652d0-85b7-80a5-9200-ed76a5fa6992&t=1dd652d0-85b7-80a5-9200-ed76a5fa6992)
.jpg?table=block&id=1dd652d0-85b7-808f-8e6e-e3fce1595ae8&t=1dd652d0-85b7-808f-8e6e-e3fce1595ae8)
FFN
Embedding (Positional Encoding)





CUDA_GRAPH
Chunked Prefill & Batch

RMSNorm

Attention detail
numa aware 的线程池设计
有一个 worker pool 去管理所有的线程,subpool_cout 是逻辑的numa 数量,通过 subpool_numa_map 映射到物理的 numa 结点。并且对应的逻辑的 numa 的线程数量是 subpool_thread_count。
hwloc 使用
一些术语
Processing Unit (PU)
The smallest processing element that can be represented by a hwloc object. It may be a single-core processor, a core of a multicore processor, or a single thread in a SMT processor (also sometimes called "Logical processor", not to be confused with "Logical index of a processor"). hwloc's PU acronym stands for Processing Unit.
在 hwloc(Hardware Locality)中,PU(Processing Unit) 是表示系统中最小的处理单元。它可能是:
- 一个单核处理器,
- 多核处理器中的一个核心,或
- 支持超线程(SMT)的处理器中的一个线程。
package
A processor Package is the physical package that usually gets inserted into a socket on the motherboard. It is also often called a physical processor or a CPU even if these names bring confusion with respect to cores and processing units. A processor package usually contains multiple cores (and may also be composed of multiple dies). hwloc Package objects were called Sockets up to hwloc 1.10.
NUMA Node
An object that contains memory that is directly and byte-accessible to the host processors. It is usually close to some cores as specified by its CPU set. Hence it is attached as a memory child of the object that groups those cores together, for instance a Package objects with 4 Core children (see Hierarchy, Tree and Levels).
Memory-side Cache
A cache in front of a specific memory region (e.g. a range of physical addresses). It caches all accesses to that region without caring about which core issued the request. This is the opposite of usual CPU caches where only accesses from the local cores are cached, without caring about the target memory.
In hwloc, memory-side caches are memory objects placed between their local CPU objects (parent) and the target NUMA node memory (child).

The NUMA node is on the side because it is not part of the main tree but rather attached to the object that corresponds to its locality (the entire machine here, hence the root object). It is attached as a Memory child (in green) and has a virtual depth (negative). It could also have siblings if there were multiple local NUMA nodes, or cousins if other NUMA nodes were attached somewhere else in the machine.
关于跳变的观察:


效果图

一些 trick
Tensor
- 一个 tensor形状是[512,256,64],对应的是page_num page_size cache_dim我希望索引到 0 页,1 页的全部,以及 2 页到 128 的位置,也就是页索引是 kv_index[0,1,2],page 索引是[[0~255],[0,255],[0,127]]这个用 tensor 怎么实 现
Debug




有个 x 突变

在这个位置:

成功:


- 作者:liamY
- 链接:https://liamy.clovy.top/article/madsys/deepseekV3
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章