DeepSeekV3-MOE结构细致梳理

type

status

slug

date

summary

Intro

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/21455638257

Technique Report

arxiv.org

https://arxiv.org/pdf/2412.19437

Multi-Head Latent Attention

recap MHA

缓存与效果的极限拉扯：从MHA、MQA、GQA到MLA - 科学空间|Scientific Spaces

前几天，幻方发布的DeepSeek-V2引起了大家的热烈讨论。首先，最让人哗然的是1块钱100万token的价格，普遍比现有的各种竞品API便宜了两个数量级，以至于有人调侃“这个价格哪怕它输出乱...

https://www.kexue.fm/archives/10091

we need to know how the Rope are added in MHA

Detailed Guide to Rotary Position Embedding

RoPE: A Detailed Guide to Rotary Position Embedding in Modern LLMs

Rotary Position Embedding (RoPE) has been widely applied in recent large language models (LLMs) to encode positional information, including…

https://medium.com/%40mlshark/rope-a-detailed-guide-to-rotary-position-embedding-in-modern-llms-fde71785f152

GO To MLA

上图是一个没有矩阵吸收，展开为 MHA 的实现

optimizing-mla.md

madsys-dev

主要计算量的比值：

矩阵吸收:

q(after q_absorb) * compressed_kv^T :

[bsz, num_heads, q_len, kv_lora_rank] * [bsz, kv_lora_rank, kv_len]

bsz * num_heads * q_len * kv_len * kv_lora_rank

非矩阵吸收 CD：

compressed_kv * W^U :

[bsz, kv_len, kv_lora_rank] * [kv_lora_rank,num_heads*nope]

bsz * kv_len * kv_lora_rank * num_heads * nope

两者的比值是：

矩阵吸收 : 非矩阵吸收 CD ->

q_len / nope(常数)

也就是说在关键位置，计算量 q_len > nope(128)的时候，矩阵吸收的计算量会更大。

不过这个估计是粗略的，不妨把整个环节里面的所有计算量整理为一个算式，我们计算比值：

FFN

LLMForEverybody/01-第一章-预训练/一文了解Deepseek系列中的MLA技术.md at main · luhengshiwo/LLMForEverybody

每个人都能看懂的大模型知识分享，LLMs春/秋招大模型面试前必看，让你和面试官侃侃而谈. Contribute to luhengshiwo/LLMForEverybody development by creating an account on GitHub.

https://github.com/luhengshiwo/LLMForEverybody/blob/main/01-%E7%AC%AC%E4%B8%80%E7%AB%A0-%E9%A2%84%E8%AE%AD%E7%BB%83/%E4%B8%80%E6%96%87%E4%BA%86%E8%A7%A3Deepseek%E7%B3%BB%E5%88%97%E4%B8%AD%E7%9A%84MLA%E6%8A%80%E6%9C%AF.md

Embedding (Positional Encoding)

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/106644634

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/454482273

Rotary Embedding

http://mingchao.wang/BDL2PIY6/

公式补充：

CUDA_GRAPH

numa aware 的线程池设计

有一个 worker pool 去管理所有的线程，subpool_cout 是逻辑的numa 数量，通过 subpool_numa_map 映射到物理的 numa 结点。并且对应的逻辑的 numa 的线程数量是 subpool_thread_count。

hwloc 使用

Portable Hardware Locality (hwloc) Documentation: v2.12.0

hwloc (https://www.open-mpi.org/projects/hwloc/) is available under the BSD license. It is hosted as a sub-project of the overall Open MPI project (https://www.open-mpi.org/). Note that hwloc does not require any functionality from Open MPI – it is a wholly separate (and much smaller!) project and code base. It just happens to be hosted as part of the overall Open MPI project.

https://www-lb.open-mpi.org/projects/hwloc/doc/v2.12.0/a00341.php

一些术语

Processing Unit (PU)

The smallest processing element that can be represented by a hwloc object. It may be a single-core processor, a core of a multicore processor, or a single thread in a SMT processor (also sometimes called "Logical processor", not to be confused with "Logical index of a processor"). hwloc's PU acronym stands for Processing Unit.

在 hwloc（Hardware Locality）中，PU（Processing Unit） 是表示系统中最小的处理单元。它可能是：

一个单核处理器，

多核处理器中的一个核心，或

支持超线程（SMT）的处理器中的一个线程。

package

A processor Package is the physical package that usually gets inserted into a socket on the motherboard. It is also often called a physical processor or a CPU even if these names bring confusion with respect to cores and processing units. A processor package usually contains multiple cores (and may also be composed of multiple dies). hwloc Package objects were called Sockets up to hwloc 1.10.

NUMA Node

An object that contains memory that is directly and byte-accessible to the host processors. It is usually close to some cores as specified by its CPU set. Hence it is attached as a memory child of the object that groups those cores together, for instance a Package objects with 4 Core children (see Hierarchy, Tree and Levels).

Memory-side Cache

A cache in front of a specific memory region (e.g. a range of physical addresses). It caches all accesses to that region without caring about which core issued the request. This is the opposite of usual CPU caches where only accesses from the local cores are cached, without caring about the target memory.

In hwloc, memory-side caches are memory objects placed between their local CPU objects (parent) and the target NUMA node memory (child).

The NUMA node is on the side because it is not part of the main tree but rather attached to the object that corresponds to its locality (the entire machine here, hence the root object). It is attached as a Memory child (in green) and has a virtual depth (negative). It could also have siblings if there were multiple local NUMA nodes, or cousins if other NUMA nodes were attached somewhere else in the machine.

关于跳变的观察：

效果图

一些 trick

Tensor

一个 tensor形状是[512,256,64],对应的是page_num page_size cache_dim我希望索引到 0 页，1 页的全部，以及 2 页到 128 的位置，也就是页索引是 kv_index[0,1,2],page 索引是[[0~255],[0,255],[0,127]]这个用 tensor 怎么实现