type
status
slug
date
summary
tags
category
password
icon

Intro

Technique Report

Multi-Head Latent Attention

recap MHA

notion image
we need to know how the Rope are added in MHA

Detailed Guide to Rotary Position Embedding

 

GO To MLA

notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image
上图是一个没有矩阵吸收,展开为 MHA 的实现
notion image
notion image
optimizing-mla.md
madsys-dev
主要计算量的比值:
矩阵吸收:
q(after q_absorb) * compressed_kv^T :
[bsz, num_heads, q_len, kv_lora_rank] * [bsz, kv_lora_rank, kv_len]
bsz * num_heads * q_len * kv_len * kv_lora_rank
非矩阵吸收 CD:
compressed_kv * W^U :
[bsz, kv_len, kv_lora_rank] * [kv_lora_rank,num_heads*nope]
bsz * kv_len * kv_lora_rank * num_heads * nope
两者的比值是:
矩阵吸收 : 非矩阵吸收 CD ->
q_len / nope(常数)
也就是说在关键位置,计算量 q_len > nope(128)的时候,矩阵吸收的计算量会更大。
不过这个估计是粗略的,不妨把整个环节里面的所有计算量整理为一个算式,我们计算比值:
notion image
notion image

FFN

 

Embedding (Positional Encoding)

notion image
notion image
notion image
notion image
notion image
 
 

CUDA_GRAPH

 

Chunked Prefill & Batch

notion image

RMSNorm

notion image

Attention detail

numa aware 的线程池设计

有一个 worker pool 去管理所有的线程,subpool_cout 是逻辑的numa 数量,通过 subpool_numa_map 映射到物理的 numa 结点。并且对应的逻辑的 numa 的线程数量是 subpool_thread_count。
 

hwloc 使用

 

一些术语

Processing Unit (PU)
The smallest processing element that can be represented by a hwloc object. It may be a single-core processor, a core of a multicore processor, or a single thread in a SMT processor (also sometimes called "Logical processor", not to be confused with "Logical index of a processor"). hwloc's PU acronym stands for Processing Unit.
在 hwloc(Hardware Locality)中,PU(Processing Unit) 是表示系统中最小的处理单元。它可能是:
  • 一个单核处理器,
  • 多核处理器中的一个核心,或
  • 支持超线程(SMT)的处理器中的一个线程。
package
A processor Package is the physical package that usually gets inserted into a socket on the motherboard. It is also often called a physical processor or a CPU even if these names bring confusion with respect to cores and processing units. A processor package usually contains multiple cores (and may also be composed of multiple dies). hwloc Package objects were called Sockets up to hwloc 1.10.
NUMA Node
An object that contains memory that is directly and byte-accessible to the host processors. It is usually close to some cores as specified by its CPU set. Hence it is attached as a memory child of the object that groups those cores together, for instance a Package objects with 4 Core children (see Hierarchy, Tree and Levels).
Memory-side Cache
A cache in front of a specific memory region (e.g. a range of physical addresses). It caches all accesses to that region without caring about which core issued the request. This is the opposite of usual CPU caches where only accesses from the local cores are cached, without caring about the target memory.
In hwloc, memory-side caches are memory objects placed between their local CPU objects (parent) and the target NUMA node memory (child).
notion image
The NUMA node is on the side because it is not part of the main tree but rather attached to the object that corresponds to its locality (the entire machine here, hence the root object). It is attached as a Memory child (in green) and has a virtual depth (negative). It could also have siblings if there were multiple local NUMA nodes, or cousins if other NUMA nodes were attached somewhere else in the machine.
关于跳变的观察:
notion image
 
 
notion image

效果图

notion image

一些 trick

Tensor

  1. 一个 tensor形状是[512,256,64],对应的是page_num page_size cache_dim我希望索引到 0 页,1 页的全部,以及 2 页到 128 的位置,也就是页索引是 kv_index[0,1,2],page 索引是[[0~255],[0,255],[0,127]]这个用 tensor 怎么实 现
    1.  

    Debug

    notion image
     
    notion image
    notion image
    notion image
    有个 x 突变
    notion image
    在这个位置:
    notion image
    成功:
    notion image
    notion image
    相关文章
    ktransformers相关内容学习
    Lazy loaded image
    ktransformers小功能补丁
    Lazy loaded image
    cuda入门
    Lazy loaded image
    CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion论文学习
    Lazy loaded image
    SnapKV: LLM Knows What You are Looking for Before Generation
    Lazy loaded image
    sglang_benchmark
    Lazy loaded image
    Enter AMX (Advanced Matrix Extensions)arm处理器课程复习
    Loading...
    liamY
    liamY
    Chasing Possible
    最新发布
    DeepSeekV3-MOE结构细致梳理
    2025-4-28
    Enter AMX (Advanced Matrix Extensions)
    2025-4-20
    ktransformers相关内容学习
    2025-2-16
    sglang_benchmark
    2025-2-7
    SnapKV: LLM Knows What You are Looking for Before Generation
    2024-12-12
    数字电路复习
    2024-12-11
    公告
    🎉Liam’s blog🎉
    -- 全新上线 ---
    👏欢迎comment👏
    ⚠️由于浏览器缓存的原因,有些内容是更新了的但是需要手动刷新3次左右,页面才会显示更新内容