type
status
slug
date
summary
tags
category
password
icon
To use in Linux, see a doc
We can note that:
In its initial form, it implements a set of up to eight "tiles", which are arrays of 16 64-byte rows.
So it one tile is 16 x 64 = 1024 Bytes.
Programmers can store matrices in these tiles of any dimension that will fit therein; a matrix of 16x16 32-bit floating-point values would work, but other geometries are supported too.
16x16 4byte can fit in ( as 16x16x4=16x64=1 tile)
The one supported operation currently will multiply the matrices stored in two tiles, then add the result to a third tile. By chaining these operations, multiplication of matrices of any size can be implemented. Evidently other operations are meant to come in the future.
The design of AMX gives the kernel control over whether these features can be used by any given process. There are a couple of reasons for this, one being that AMX instructions, as one might imagine, use a lot of processor resources. A process doing heavy AMX work on a shared computer may adversely affect other processes. But AMX also cannot be supported properly unless both the kernel and the user-space process are ready for it.

Intro From Intel

Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and an accelerator able to operate on tiles, the first implementation is called TMUL (tile matrix multiply unit).
An Intel AMX implementation enumerates to the programmer how the tiles can be programmed by providing a palette of options. Two palettes are supported; palette 0 represents the initialized state, and palette 1 consists of 8 KB of storage spread across 8 tile registers named TMM0..TMM7. Each tile has a maximum size of 16 rows x 64 bytes, (1 KB), however the programmer can configure each tile to smaller dimensions appropriate to their algorithm.
The tile dimensions supplied by the programmer (rows and bytes_per_row, i.e., colsb) are metadata that drives the execution of tile and accelerator instructions. In this way, a single instruction can launch autonomous multi-cycle execution in the tile and accelerator hardware. The palette value (palette_id) and metadata are held internally in a tile related control register (TILECFG). The TILECFG contents will be commensurate with that reported in the palette_table.

Walk Through Basic Eg.

Tile Matrix Multiply (TMUL)Tile Matrix Multiplication (TMUL):TMUL is an accelerator engine connected to Tile that performs matrix multiplication calculations for AI.
Before using TMUL instructions, the tile architecture must be configured specifying the tile configuration including number of tiles and tile sizes (the palette). This configuration step is to be performed once and the configuration remains until it is either changed by the code or is released. Once the tiles are configured, TMUL instructions can be used to perform matrix multiplications (currently, INT8 and BF16 types are supported). The TMUL instructions, when executed, will dynamically check the maximum sizes of the tile and the matrix sizes that allow a mathematically correct matrix multiplication.
 
In the code sample walkthrough next, an INT8 matrix multiplication will demonstrate the above procedure step by step. Specifically, the code sample will multiply matrices A and B of size 16 x 64 containing INT8 values, and accumulate the result to a 16 x 16 matrix C containing INT32 values.

Config

define some config
define constants declaring max number of elements per tile, maximum number of rows and columns to configure the tiles, and actual number of columns in the matrices:
This data structure in the code sample is designed to match the tile configuration format defined as a 64-byte memory location, as defined in the Intel® Intrinsics Guide:
In our example, we init those values as:
In the above function, the _tile_loadconfig() intrinsic function is used to load the tile configuration metadata from the 64-byte memory location specified by tileinfo
Currently two palettes are supported: palette 0 represents the initialized state, whereas palette 1 consists of 8 KB of storage divided across 8 tile registers, with each tile having a maximum size of 16 rows by 64 bytes. For this example, 2 tiles will be able to hold a matrix of size 16 x 64 (INT8 values), and 1 tile will hold a matrix of size 16 x 16 (INT32 values).

The init process

But this is too simple, I refer to a robust version from here and here

TMUL process

Load tiles from memory specified by base address (src1, src2 and res) and stride into tiles (tiles # 2, 3 and 1, respectively).
STRIDE (which in this case has a value of 64) indicates how the load operations should be strided (assumes row major data layout).
💡
The above instruction computes dot-product of bytes in tiles with a source/destination accumulator. Specifically, it multiplies groups of 4 adjacent pairs of signed 8-bit integers in tile 2 with corresponding signed 8-bit integers in tile 3, producing 4 intermediate 32-bit results. These 4 results are added to the corresponding 32-bit integer in tile 1 and the 32-bit result is stored back to tile 1. Details can be found in the Intel® Intrinsics Guide.

The Full Code

AMX-TMUL-Code-Samples
intelUpdated Mar 10, 2025
explain __tile_dpbssd
💡

Description

Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in src0 with corresponding signed 8-bit integers in src1, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of tile is specified in the struct of __tile1024i. The register of the tile is allocated by compiler.
notion image

Ktransformer

the cuda support part of cmake is such a mess. Refer to the guide above☝️
Useful variables
You can use FindCUDAToolkit to find a variety of useful targets and variables even without enabling the CUDA language.
Variables that using find_package(CUDAToolkit) provides:
  • CUDAToolkit_BIN_DIR: Directory that holds the nvcc executable
  • CUDAToolkit_INCLUDE_DIRS: Lists of directories containing headers for built-in Thrust, etc
  • CUDAToolkit_LIBRARY_DIR: Directory that holds the CUDA runtime library
Variables that enabling the CUDA language provides:
  • CMAKE_CUDA_COMPILER: NVCC with location
  • CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES: Place for built-in Thrust, etc

Some function

get_submat

from ktransformers/ktransformers_ext/operators/amx/la/amx.hpp:1569
To get submat from A buffer (the simple one)
let’s see it from it’s usage case:
from ktransformers/ktransformers_ext/operators/amx/la/amx.hpp:2907
m=1 (the current state vector has 1 as col), k=7168 (hidden_dim_size), n=1024 (intermediate_size, after tp)
for buffer A, we see ba→get_submat
for buffer A, we see ba→get_submat
notion image

avx512_32xbf16_to_32xfp32

implementation: from — ktransformers/ktransformers_ext/operators/amx/la/utils.hpp:42
  1. _mm256_loadu_si256: see guide u→unaligned, si→signed integer
    1. Loads 256 bits of integer data from an unaligned memory location pointed to by \a __p into a 256-bit integer vector.
      This intrinsic corresponds to the VMOVDQU instruction.
      there, we load 256 bit (i.e. 16 bf16) source data
      some link related: print_m128i, print_m256i
  1. _mm512_cvtepu16_epi32: see guide
    1. epu→extended packed unsigned, ep→extended packed integer Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.
  1. _mm512_slli_epi32: see guide
    1. slli→shift logical left immediate, epi→extended packed integer
      Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.
  1. _mm512_castsi512_ps: see guide
    1. ps→packed single-precision floating-point
      Cast vector of type __m512i to type __m512. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
  1. _mm512_storeu_ps: see guide
    1. Store 512-bits (composed of 16 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

buffer A from_mat!!!

from ktransformers/ktransformers_ext/operators/amx/la/amx.hpp:1526

The max for scale (for quantilization)

Profiling amax
amax for a row
note the m , n, k indicates the elements number. Before the conversion, the element type is bf16 and after the type is int8, but the number of elements is the same.
notion image

for BufferB

from this part, we know the B matrix needs some kind of transpose on a 4 byte group (the group is on row). Just copy the graph here:
notion image
the process of transpose code is below: 👇
the explanation:

intrinsics Function

_mm512_unpacklo_epi32: guide
Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
__m512i _mm512_unpacklo_epi32 (__m512i a, __m512i b)
_mm512_unpacklo_epi64: guide
Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
_mm512_shuffle_i32x4: guide
__m512i _mm512_shuffle_i32x4 (__m512i a, __m512i b, const int imm8)
Shuffle 128-bits (composed of 4 32-bit integers) selected by imm8 from a and b, and store the results in dst.
 

Matrix Mul

ktransformers/ktransformers_ext/operators/amx/la/amx.hpp
 

Things

type def

the type __m512i is a vector type with 64 element (long long 8 byte) (means 512 bytes), which is aligned by 64 bit.

How to know L3 cache size?

 

Knowledge

MOE Mul

gate: [hidden_size,intermediate_size]
up: [hidden_size,intermediate_size]
down: [intermediate_size,hidden_size]
the current_state (size is like [1,hidden_size] as example)
G = current_state @ gate; A = act_fn(G); U = current_state @ up; H = A * G; current_hidden_state = (H @ down) * routing_weights;
But in MOE, as the intermediate_size are “split“, so the up actually do the “down” and down do the “up”. E.g. For deepseekV3, the hidden_size is 7168 and intermediate_size is 2048.

OpenMP (Open Multi-Processing)

 

VNNI

 

Row- and column-major order

notion image

read simultaneously with ifstream

 

learn from llama.cpp

amx tile config

Some important points:
💡
Block B needs to be prepacked to vnni format before feeding into TMUL:
packed_B: from {n, k} to {k/vnni_blk, n, vnni_blck}, viewed in 2d, we get {8, 64}
notion image
相关文章
ktransformers相关内容学习
Lazy loaded image
ktransformers小功能补丁
Lazy loaded image
cuda入门
Lazy loaded image
CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion论文学习
Lazy loaded image
SnapKV: LLM Knows What You are Looking for Before Generation
Lazy loaded image
sglang_benchmark
Lazy loaded image
sglang_benchmarkarm处理器课程复习
Loading...
liamY
liamY
Chasing Possible
最新发布
Enter AMX (Advanced Matrix Extensions)
2025-3-26
ktransformers相关内容学习
2025-2-16
sglang_benchmark
2025-2-7
SnapKV: LLM Knows What You are Looking for Before Generation
2024-12-12
数字电路复习
2024-12-11
CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion论文学习
2024-11-23
公告
🎉Liam’s blog🎉
-- 全新上线 ---
👏欢迎comment👏
⚠️由于浏览器缓存的原因,有些内容是更新了的但是需要手动刷新3次左右,页面才会显示更新内容