cuda入门 | LiamY Blog

type

status

slug

date

summary

一些概念

线程批处理

GPU的模型是一个能执行非常高数量并行线程的计算设备（超多线程的协处理器）

配合代码看：add是一个kernel,N指定了一个grid里面block数量，1指定的是一个block里面线程的数量

中文把block叫做线程块

线程块内的线程是一批（这里有线程id），他们能够共享内存协作，而同一个grid里面的各个block之间的thread的协作是减少了的（这里有block id）

内存模型

GPU概念之间的关系

GPC—图形处理簇，Graphics Processing Clusters

TPC—纹理处理簇，Texture Processing Clusters

SM—流多处理器，Stream Multiprocessors

HBM—高带宽存储器，High Bandwidth Memory

GPC>TPC>SM>CORE

SM(Streaming Multiprocessor)

包含CUDA核心、共享内存、寄存器等。

CUDA Core （向量运行单元FP32-FPU、INT32-ALU）、Tensor Core （张量运算单元FP16、BF16）、Special Function Units （特殊函数单元）

warp scheduler 线程束调度器。dispatch unit指令分发单元

Multi level cache：多级缓存。Register File寄存器堆。Load/Store访问存储单元

SP后续被CUDA Core替代掉了

但是后面CUDA Core也被取代掉了，换成了单独的FP32 FPU和INT32 ALU（因为每个SM支持FP32和INT32的并发执行）

从软件侧看到的是线程的执行，对应于硬件上的 CUDA Core，每个线程对应于 CUDA Core，软件方面线程数量是超配的，硬件上 CUDA Core 是固定数量的。Block 线程块只在一个 SM 上通过 Warp 进行调度，一旦在 SM 上调用了 Block 线程块，就会一直保留到执行完 kernel，SM 可以同时保存多个 Block 线程块，多个 SM 组成的 TPC 和 GPC 硬件实现了 GPU 并行计算。

举例

Fermi

看到有16个SM，每个SM有32个CUDA Core,所以共512个CUDA Core

Tensor Core

GEMM：Generalized Matrix-Multiply（通用矩阵乘）

FMA：Fused-Multiply-Add，即积和熔加运算

CH1 Heterogeneous Parallel Computing with CUDA

有个新东西：The parameters within the triple angle brackets—N

这个triple angle放的参数是运行时参数用于描述如何启动kernel的: the first number represents the number of parallel blocks in which we would like the device to execute our kernel.

GPU跑N份kernel code（每份称为block），那我们怎么分别哪一个block正在跑呢？通过blockIdx.x参数就能确定。为啥是blockIdx.x不是仅仅blockIdx？因为CUDA C允许我们以两维的方式定义a group of blocks.但是没必要用这个特性。

we call the collection of parallel blocks a grid（网格）

【CUDA 基础】1.0 并行计算与计算机架构

Abstract: 本文从总体上给出了CUDA编程的Big picture，后续所有的文章都在本文的基础上详细展开。Keywords: 并行计算，串行编程，并行编程，计算机架构，并行性，异构架构，CUDA

https://face2ai.com/CUDA-F-1-0-%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97%E4%B8%8E%E8%AE%A1%E7%AE%97%E6%9C%BA%E6%9E%B6%E6%9E%84/

Harvard architecture：

task parallelism

arises when there are many tasks or functions that can be operated independently and largely in parallel. Task parallelism focuses on distributing functions across multiple cores.

data parallelism (GPU made for)

arises when there are many data items that can be operated on at the same time. Data parallelism focuses on distributing the data across multiple cores.

两种partition的方式：

1维的模式：

2维的模式：

体系结构

单指令单数据流：SISD

refers to the traditional computer: a serial architecture. There is only one core in the computer. At any time only one instruction stream is executed, and operations are performed on one data stream.

单指令多数据流：SIMD

refers to a type of parallel architecture. There are multiple cores in the computer. All cores execute the same instruction stream at any time, each operating on different data streams. Vector computers are typically characterized as SIMD, and most modern computers employ a SIMD architecture. Perhaps the biggest advantage of SIMD is that, while writing code on the CPU, programmers can continue to think sequentially yet achieve parallel speed-up from parallel data operations because the compiler takes care of the details.

多指令单数据流：MISD

refers to an uncommon architecture, where each core operates on the same data stream via separate instruction streams.

多指令多数据流：MIMD

refers to a type of parallel architecture in which multiple cores operate on multiple data streams, each executing independent instructions. Many MIMD architectures also include SIMD execution sub-components.

SIMT：

GPUs represent a many-core architecture, and have virtually every type of parallelism described previously: multithreading, MIMD, SIMD, and instruction-level parallelism. NVIDIA coined the phrase Single Instruction, Multiple Thread (SIMT) for this type of architecture.

异构架构

GPU和CPU之间通过PCIe总线连接，GPU是协处理器（co-processor）

描述GPU capability：

CUDA core数量

memory大小

描述GPU performance：

计算峰值

内存带宽

NVIDIA uses a special term, compute capability, to describe hardware versions of GPU accelerators that belong to the entire Tesla product family

Therefore, for optimal performance you may need to use both CPU and GPU for your application, executing the sequential parts or task parallel parts on the CPU and intensive data parallel parts on the GPU, as shown in Figure 1-11.

To support joint CPU + GPU execution of an application, NVIDIA designed a programming model called CUDA.

Q:GPU的线程是有专门的寄存器配对吗？不然怎么上下文保存恢复？

SM（Streaming Multiprocessors）称作流式多处理器，核心组件包括 CUDA 核心、共享内存、寄存器等。SM 包含很多为线程执行数学运算的 core，是英伟达 GPU 的核心，在 CUDA 中可以执行数百个线程、一个 block 上线程放在同一个 SM 上执行，一个 SM 有限的 Cache 制约了每个 block 的线程数量。

看起来应该是把信息放在SM上

CUDA: A Platform for Heterogeneous Computing

CUDA is a parallel computing platform and programming model with a small set of extensions to the C language.

CUDA PROGRAM STRUCTURE A typical CUDA program structure consists of fi ve main steps:
Allocate GPU memories.
Copy data from CPU memory to GPU memory.
Invoke the CUDA kernel to perform program-specifi c computation.
Copy data back from GPU memory to CPU memory.
Destroy GPU memories.

CH2 CUDA Programming Model

可以有三个层面理解并行计算：

Domain level 2. Logic level 3. Hardware level

kernel指的是跑在GPU的代码

cuda程序执行流程是这样的：

Copy data from CPU memory to GPU memory.

Invoke kernels to operate on the data stored in GPU memory.

Copy data back from GPU memory to CPU memory

A typical CUDA program consists of serial code complemented by parallel code. As shown in Figure 2-2, the serial code (as well as task parallel code) is executed on the host, while the parallel code is executed on the GPU device.

memory hierarchy

DIFFERENT MEMORY SPACES One of the most common mistakes made by those learning to program in CUDA C is to improperly dereference the different memory spaces. For the memory allocated on the GPU, the device pointers may not be dereferenced in the host code. If you improperly use an assignment, for example: gpuRef = d_C instead of using cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost) the application will crash at runtime. To help avoid these types of mistakes, Unified Memory was introduced with CUDA 6, which lets you access both CPU and GPU memory by using a single pointer. You will learn more about unifi ed memory in Chapter 4.

Organizing Threads

cuda有一个线程的抽象层次结构

一个kernel对应一个grid，而grid是一个个线程block组成

对应之前的内存结构，一个grid里面的线程共享global memory

而一个block里面的线程可以彼此交互通过：

1.Block-local synchronization 2. Block-local shared memory

Threads from different blocks cannot cooperate.

线程通过blockIdx（在一个块里面block的index）和threadIdx（在一个block里面thread的index）来做区别（这些是由运行时内置的，我们操纵api的时候就可以用这些变量区别每个线程）

这两个变量是3维的（The coordinate variable is of type uint3, a CUDA built-in vector type, derived from the basic integer type.）：

blockIdx.x blockIdx.y blockIdx.z

threadIdx.x threadIdx.y threadIdx.z

那像2-5这种，我们怎么设置blockIdx.x范围是从0-2呢？那么就有同样两个变量，blockDim (block dimension, measured in threads)和gridDim (grid dimension, measured in blocks)

The dimensions of a grid and a block are specified by the following two built-in variables: ➤ blockDim (block dimension, measured in threads) ➤ gridDim (grid dimension, measured in blocks)

These variables are of type dim3, an integer vector type based on uint3 that is used to specify dimensions.When defi ning a variable of type dim3, any component left unspecifi ed is initialized to 1.

这两个变量同样是三维，表示每个维度的大小：

blockDim.x blockDim.y blockDim.z

GRID AND BLOCK DIMENSIONS Usually, a grid is organized as a 2D array of blocks, and a block is organized as a 3D array of threads. Both grids and blocks use the dim3 type with three unsigned integer fields. The unused fields will be initialized to 1 and ignored.

There are two distinct sets of grid and block variables in a CUDA program: manually-defined dim3 data type and pre-defined uint3 data type.

The manually-defined grid and block variables for the dim3 data type are only visible on the host side, and the built-in, pre-initialized grid and block variables of the uint3 data type are only visible on the device side.

运行示例listing2-2 checkDimension的结果

下面这个提示说明了对于block和gird的维度变量，有两套变量对应，分别用于host上在invoke kernal前设置以及在device里面执行kernel时使用的

ACCESS GRID/BLOCK VARIABLES FROM THE HOST AND DEVICE SIDE It is important to distinguish between the host and device access of grid and block variables. For example, using a variable declared as block from the host, you define the coordinates and access them as follows: block.x, block.y, and block.z On the device side, you have pre-initialized, built-in block size variable available as: blockDim.x, blockDim.y, and blockDim.z In summary, you define variables for grid and block on the host before launching a kernel, and access them there with the x, y and z fields of the vector structure from the host side. When the kernel is launched, you can use the pre-initialized, built-in variables within the kernel. 对应代码里面是：