type
status
slug
date
summary
tags
category
password
icon
To use in Linux, see a doc
We can note that:
In its initial form, it implements a set of up to eight "tiles", which are arrays of 16 64-byte rows.
So it one tile is 16 x 64 = 1024 Bytes.
Programmers can store matrices in these tiles of any dimension that will fit therein; a matrix of 16x16 32-bit floating-point values would work, but other geometries are supported too.
16x16 4byte can fit in ( as 16x16x4=16x64=1 tile)
The one supported operation currently will multiply the matrices stored in two tiles, then add the result to a third tile. By chaining these operations, multiplication of matrices of any size can be implemented. Evidently other operations are meant to come in the future.
The design of AMX gives the kernel control over whether these features can be used by any given process. There are a couple of reasons for this, one being that AMX instructions, as one might imagine, use a lot of processor resources. A process doing heavy AMX work on a shared computer may adversely affect other processes. But AMX also cannot be supported properly unless both the kernel and the user-space process are ready for it.
Intro From Intel
Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and an accelerator able to operate on tiles, the first implementation is called TMUL (tile matrix multiply unit).
An Intel AMX implementation enumerates to the programmer how the tiles can be programmed by providing a palette of options. Two palettes are supported; palette 0 represents the initialized state, and palette 1 consists of 8 KB of storage spread across 8 tile registers named TMM0..TMM7. Each tile has a maximum size of 16 rows x 64 bytes, (1 KB), however the programmer can configure each tile to smaller dimensions appropriate to their algorithm.
The tile dimensions supplied by the programmer (rows and bytes_per_row, i.e., colsb) are metadata that drives the execution of tile and accelerator instructions. In this way, a single instruction can launch autonomous multi-cycle execution in the tile and accelerator hardware. The palette value (palette_id) and metadata are held internally in a tile related control register (TILECFG). The TILECFG contents will be commensurate with that reported in the palette_table.
Walk Through Basic Eg.
Tile Matrix Multiply (TMUL)
• Tile Matrix Multiplication (TMUL):TMUL is an accelerator engine connected to Tile that performs matrix multiplication calculations for AI.
Before using TMUL instructions, the tile architecture must be configured specifying the tile configuration including number of tiles and tile sizes (the palette). This configuration step is to be performed once and the configuration remains until it is either changed by the code or is released. Once the tiles are configured, TMUL instructions can be used to perform matrix multiplications (currently, INT8 and BF16 types are supported). The TMUL instructions, when executed, will dynamically check the maximum sizes of the tile and the matrix sizes that allow a mathematically correct matrix multiplication.
In the code sample walkthrough next, an INT8 matrix multiplication will demonstrate the above procedure step by step. Specifically, the code sample will multiply matrices A and B of size 16 x 64 containing INT8 values, and accumulate the result to a 16 x 16 matrix C containing INT32 values.
Config
define some config
define constants declaring max number of elements per tile, maximum number of rows and columns to configure the tiles, and actual number of columns in the matrices:
This data structure in the code sample is designed to match the tile configuration format defined as a 64-byte memory location, as defined in the Intel® Intrinsics Guide:
In our example, we init those values as:
In the above function, the _tile_loadconfig() intrinsic function is used to load the tile configuration metadata from the 64-byte memory location specified by tileinfo
Currently two palettes are supported: palette 0 represents the initialized state, whereas palette 1 consists of 8 KB of storage divided across 8 tile registers, with each tile having a maximum size of 16 rows by 64 bytes. For this example, 2 tiles will be able to hold a matrix of size 16 x 64 (INT8 values), and 1 tile will hold a matrix of size 16 x 16 (INT32 values).
The init process
TMUL process
Load tiles from memory specified by base address (src1, src2 and res) and stride into tiles (tiles # 2, 3 and 1, respectively).
STRIDE (which in this case has a value of 64) indicates how the load operations should be strided (assumes row major data layout).
The above instruction computes dot-product of bytes in tiles with a source/destination accumulator. Specifically, it multiplies groups of 4 adjacent pairs of signed 8-bit integers in tile 2 with corresponding signed 8-bit integers in tile 3, producing 4 intermediate 32-bit results. These 4 results are added to the corresponding 32-bit integer in tile 1 and the 32-bit result is stored back to tile 1. Details can be found in the Intel® Intrinsics Guide.
The Full Code
AMX-TMUL-Code-Samples
intel • Updated Mar 10, 2025
explain
__tile_dpbssd
Description
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in src0 with corresponding signed 8-bit integers in src1, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of tile is specified in the struct of __tile1024i. The register of the tile is allocated by compiler.

Ktransformer
the cuda support part of cmake is such a mess. Refer to the guide above☝️
Useful variablesYou can useFindCUDAToolkit
to find a variety of useful targets and variables even without enabling the CUDA language.Variables that usingfind_package(CUDAToolkit)
provides:
CUDAToolkit_BIN_DIR
: Directory that holds thenvcc
executable
CUDAToolkit_INCLUDE_DIRS
: Lists of directories containing headers for built-in Thrust, etc
CUDAToolkit_LIBRARY_DIR
: Directory that holds the CUDA runtime libraryVariables that enabling theCUDA
language provides:
CMAKE_CUDA_COMPILER
: NVCC with location
CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES
: Place for built-in Thrust, etc
Some function
get_submat
from ktransformers/ktransformers_ext/operators/amx/la/amx.hpp:1569
To get submat from A buffer (the simple one)
let’s see it from it’s usage case:
from ktransformers/ktransformers_ext/operators/amx/la/amx.hpp:2907
m=1 (the current state vector has 1 as col), k=7168 (hidden_dim_size), n=1024 (intermediate_size, after tp)
for buffer A, we see ba→get_submat
for buffer A, we see ba→get_submat

avx512_32xbf16_to_32xfp32
implementation: from — ktransformers/ktransformers_ext/operators/amx/la/utils.hpp:42
_mm256_loadu_si256
: see guide u→unaligned, si→signed integer
Loads 256 bits of integer data from an unaligned memory location pointed to by \a __p into a 256-bit integer vector.
This intrinsic corresponds to the
VMOVDQU
instruction.there, we load 256 bit (i.e. 16 bf16) source data
some link related: print_m128i, print_m256i
_mm512_cvtepu16_epi32
: see guide
epu→extended packed unsigned, ep→extended packed integer
Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.
_mm512_slli_epi32
: see guide
slli→shift logical left immediate, epi→extended packed integer
Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.
_mm512_castsi512_ps
: see guide
ps→packed single-precision floating-point
Cast vector of type __m512i to type __m512. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
_mm512_storeu_ps
: see guide
Store 512-bits (composed of 16 packed single-precision (32-bit) floating-point elements) from
a
into memory. mem_addr
does not need to be aligned on any particular boundary.buffer A from_mat!!!
from ktransformers/ktransformers_ext/operators/amx/la/amx.hpp:1526
The max for scale (for quantilization)
Profiling amaxamax for a row
note the m , n, k indicates the elements number. Before the conversion, the element type is bf16 and after the type is int8, but the number of elements is the same.

for BufferB
from this part, we know the B matrix needs some kind of transpose on a 4 byte group (the group is on row). Just copy the graph here:

the process of transpose code is below: 👇
the explanation:
intrinsics Function
_mm512_unpacklo_epi32
: guideUnpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
__m512i _mm512_unpacklo_epi32 (__m512i a, __m512i b)
_mm512_unpacklo_epi64
: guideUnpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
_mm512_shuffle_i32x4
: guide__m512i _mm512_shuffle_i32x4 (__m512i a, __m512i b, const int imm8)
Shuffle 128-bits (composed of 4 32-bit integers) selected by imm8 from a and b, and store the results in dst.
Matrix Mul
ktransformers/ktransformers_ext/operators/amx/la/amx.hpp
Things
type def
the type __m512i is a vector type with 64 element (long long 8 byte) (means 512 bytes), which is aligned by 64 bit.
How to know L3 cache size?
Knowledge
MOE Mul
gate: [hidden_size,intermediate_size]
up: [hidden_size,intermediate_size]
down: [intermediate_size,hidden_size]
the current_state (size is like [1,hidden_size] as example)
G = current_state @ gate; A = act_fn(G); U = current_state @ up; H = A * G; current_hidden_state = (H @ down) * routing_weights;
But in MOE, as the intermediate_size are “split“, so the up actually do the “down” and down do the “up”. E.g. For deepseekV3, the hidden_size is 7168 and intermediate_size is 2048.
OpenMP (Open Multi-Processing)
VNNI
Row- and column-major order

read simultaneously with ifstream
learn from llama.cpp
amx tile config
from llama.cpp
Some important points:
Block B needs to be prepacked to vnni format before feeding into TMUL:
packed_B: from {n, k} to {k/vnni_blk, n, vnni_blck}, viewed in 2d, we get {8, 64}

- 作者:liamY
- 链接:https://liamy.clovy.top/article/madsys/amx
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章