GPU Divergence in AMD CDNA 3 - StructurizeCFG Pass

Introduction

I've been exploring GPU divergence especially in AMD CDNA3 based GPUs (MI300x) and in this post, we will dive into how divergence is handled in CDNA-3 GPUs. When a wavefront diverges, we need to remember the state of the threads before the divergence, the diverging paths themselves (along with which thread takes which path) and finally the point of reconvergence. There are two approaches to this - Special hardware instructions that push/pop the required details from a special stack, or compiler-managed thread mask control. Interestingly, I found that CDNA-3 GPUs seem to rely on the compiler to handle divergence despite having special instructions to do so. Let us look at both approaches.

Two approaches - Software vs Hardware

Consider a simple control flow as follows

Diamond CFG: entry branches to True and False paths, both merge

When a wavefront encounters the condition at the end of entry block, it must save the current information about active threads, disable all threads that meet the condition, and start executing the false path. The hardware must also know when to stop executing the false path and switch over to the true path. Finally, the hardware must know how to restore the state of active threads as it was before this region.

Hardware based approach

The AMD CDNA-3 ISA documents a hardware mechanism for handling divergence using S_CBRANCH_I/G_FORK and S_CBRANCH_JOIN instructions. These FORK/JOIN instructions use a six-deep stack stored in SGPRs, with a Control Stack Pointer (CSP) to track nesting depth.

When a wavefront encounters a FORK:

  1. It computes mask_pass and mask_fail from the condition and current EXEC
  2. If all threads go the same way (uniform branch), just jump
  3. If divergent, the path with fewer threads executes first (EXEC = smaller group), and the other path's {PC, EXEC mask} is pushed onto the SGPR stack. CSP is incremented

By always executing the smaller group first, the stack depth is bounded to log₂64 = 6 because at each nesting level, at max half the threads are executed.

When a path reaches JOIN:

  1. It compares the current CSP against the saved CSP value from when the FORK started (passed via arg0)
  2. If they differ, all paths have not completed yet. CSP is decremented, and {PC, EXEC} is popped from the stack to begin executing the next path
  3. If they match, all paths have completed. Execution continues past the JOIN

Each stack entry is 128 bits: {EXEC[63:0], PC[47:2]}, stored across 4 consecutive SGPRs. FORK/JOIN blocks can be nested to any depth (limited only by SGPR availability), and can coexist with other conditional flow control.

FORK/JOIN control flow example from CDNA-3 ISA manual

This brings us to the twist — these FORK/JOIN instructions are never emitted by the compiler. In all my experiments involving complicated and twisted control flows, I was never able to get the compiler to actually emit these instructions.

A brief comparison with NVIDIA

Let us also briefly take a look at what we know about divergence in NVIDIA GPUs. NVIDIA GPUs (Volta and later) seem to take the hardware approach. They use BSSY and BSYNC instructions with convergence barrier registers (B0, B1, ...):

  • BSSY B0, <target> — Saves the current active thread mask into barrier register B0. Placed before a divergent branch.
  • BSYNC B0 — Reconvergence barrier. Threads arriving here are marked as "arrived" in B0. If all threads recorded in B0 have arrived, execution continues with the full mask restored. If not, the arriving threads are blocked and the hardware switches to execute the other path.

Here is the actual SASS generated by NVIDIA's compiler (Turing, sm_75) for a diamond (if-then-else) kernel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
__device__ __noinline__ void path_a(volatile int *out, int tid) {
out[tid] = 1;
out[tid + 64] = 11;
out[tid + 128] = 111;
}

__device__ __noinline__ void path_b(volatile int *out, int tid) {
out[tid] = 2;
out[tid + 64] = 22;
out[tid + 128] = 222;
}

__global__ void diamond(volatile int *out) {
int tid = threadIdx.x;
if (tid > 10) {
path_a(out, tid);
} else {
path_b(out, tid);
}
out[tid + 256] = 3;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
      BSSY B0, .reconverge              ; save active mask into B0, set reconvergence point
ISETP.GT P0, R0, 0xa ; P0 = (tid > 10)
@P0 BRA .then ; threads with tid > 10 jump to then

; else path
CALL path_b ; call path_b(out, tid)
BRA .reconverge ; skip then path

.then:
CALL path_a ; call path_a(out, tid)

.reconverge:
BSYNC B0 ; reconverge: wait for all threads saved in B0
STG [R2+0x400], 3 ; out[tid + 256] = 3 (all threads)

The compiler inserts BSSY B0 before the divergent branch to record which threads are active. Both paths converge at BSYNC B0, where the hardware waits for all threads recorded in B0 to arrive, then restores the full active mask and continues. Unlike AMD's FORK/JOIN, NVIDIA's compiler actively uses these hardware instructions.

Unfortunately, how exactly these instructions work on a microarchitecture level is not documented anywhere. I found one paper that attempts to reverse engineer and model it. Find it here - Control Flow Management in Modern GPUs.

Software (compiler) based approach

Since I couldn't get the compiler to emit any FORK/JOIN instructions, it strongly suggests that the compiler is handling divergence and reconvergence on its own. So how does it do it? By directly flipping bits in the EXEC register. The EXEC register is a 64 bit register, one bit for each thread, which controls which thread is enabled/disabled (1/0) and the compiler approach to handling divergent flow involves directly manipulating this EXEC register.

Consider the following kernel:

1
2
3
4
5
6
7
8
9
__global__ void diamond(int *out) {
int tid = threadIdx.x;
if (tid == 0) {
out[0] = 1; // then
} else {
out[1] = 2; // else
}
out[2] = 3; // merge
}

The corresponding control flow graph looks like this:

Diamond CFG before structurization

To handle divergence, the compiler must restructure (linearize) this so that both True path and False path execute sequentially with different EXEC masks applied. Diamond CFG after structurization: linearized with Flow block between else and then

Now inserting the EXEC masking instructions becomes straightforward.

  • Save the EXEC register at the end of entry and mask it to only execute else path.
  • Execute the else path.
  • At Flow block, flip the EXEC register bits enabling the other set of threads
  • Execute the then path.
  • At merge restore the EXEC as it was at the end of entry

At each condition, we also have the option of skipping a path if no thread meets the condition. For example, at the end of entry, we can directly jump to Flow block if no thread meets the condition for the else block.

Here is the actual ISA generated by the compiler for gfx942 (MI300X):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
; entry:
v_cmp_ne_u32_e32 vcc, 0, v0 ; compare tid != 0, result in VCC
s_and_saveexec_b64 s[2:3], vcc ; save EXEC into s[2:3], EXEC &= VCC (else-lanes only)
s_xor_b64 s[2:3], exec, s[2:3] ; s[2:3] = saved ^ current (compute then-lanes for later)
s_cbranch_execz .Flow ; if no else-lanes active, skip else block

; else:
global_store_dword v0, v1, s[0:1] offset:4 ; out[1] = 2

.Flow:
s_andn2_saveexec_b64 s[2:3], s[2:3] ; flip EXEC to then-lanes (EXEC = s[2:3] & ~EXEC)
s_cbranch_execz .merge ; if no then-lanes active, skip then block

; then:
global_store_dword v0, v1, s[0:1] ; out[0] = 1

.merge:
s_or_b64 exec, exec, s[2:3] ; restore full EXEC (all threads active again)
global_store_dword v0, v1, s[0:1] offset:8 ; out[2] = 3

Note: I generated these examples by running the StructurizeCFG pass in isolation to avoid other optimizations altering the results.

The pass that handles this restructuring is the StructurizeCFG pass. The core idea behind the pass is to sequentialize the execution of all divergent paths by inserting Flow blocks which act as merge/flip points from where execution can safely switch to the sister path (by flipping the EXEC mask). The pass must handle nested divergent branches as well. This can be done by merging all appropriate threads before flipping the EXEC register as seen below.

Now let us consider a slightly more complicated kernel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__global__ void multi_exit(volatile int *out, volatile int *cond) {
int tid = threadIdx.x;
int c1 = cond[0];
int c2 = cond[1];

if (tid > c1) {
out[1] = 0; // early_exit
return;
}
// path1: only threads where tid <= c1 reach here
if (tid > c2) {
out[1] = 0; // early_exit
return;
}
// path2: only threads where tid <= c2 reach here
out[0] = 42;
}

The corresponding control flow graph would look something like this:

Multi-exit CFG: entry and path1 both branch to early_exit

In this case, observe that early_exit block has two predecessors. The exec mask will be saved at end of entry block and then also at the end of path1. In early_exit, depending on whether threads are arriving from entry or from path1, the compiler needs to emit different EXEC restore instruction. This is how the compiler transforms the structure:

Multi-exit CFG after structurization: Flow1 and Flow blocks inserted, early_exit has single predecessor

Now early_exit has a single predecessor (Flow), and the EXEC operations become easier to insert correctly:

  • At entry: compute the entry exit mask (tid > c1) and store it in s[4:5]. Save EXEC, restrict to path1 lanes (tid <= c1)
  • At path1: save EXEC again, restrict to path2 lanes (tid <= c2)
  • Execute path2
  • At Flow1: restore EXEC to path1 level. Merge the path1 exit mask into s[4:5] — now s[4:5] contains the combined mask of all threads that should go to early_exit (from both entry and path1)
  • At Flow: restore EXEC to entry level (all threads). Save EXEC, restrict to early_exit lanes using s[4:5]. Execute early_exit

Here is the actual ISA generated by the compiler for gfx942 (MI300X). Let's trace through it with a concrete example: 8 threads, c1=3, c2=2.

  • Threads 0-3: tid <= c1, enter path1
  • Threads 4-7: tid > c1, should exit early
  • Threads 0-2: tid <= c2, enter path2
  • Thread 3: tid > c2, should exit early from path1

Let's track the EXEC mask and key registers at each step (1 = active, 0 = masked off):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
; entry:                                        ; EXEC = 1111 1111 (all 8 threads)
global_load_dword v2, v1, s[2:3] sc0 sc1 ; load c1 (=3)
v_cmp_gt_i32_e64 s[4:5], v0, v2 ; s[4:5] = 1111 0000 (tid > 3 = entry exit mask)
v_cmp_le_i32_e32 vcc, v0, v2 ; vcc = 0000 1111 (tid <= 3)
s_and_saveexec_b64 s[6:7], vcc ; s[6:7] = 1111 1111 (saved EXEC)
; EXEC = 0000 1111 (path1 lanes only)
s_cbranch_execz .Flow ; EXEC != 0, so continue to path1

; path1: ; EXEC = 0000 1111 (threads 0-3)
global_load_dword v1, v1, s[2:3] offset:4 ; load c2 (=2)
s_mov_b64 s[2:3], -1 ; s[2:3] = 1111 1111 (assume all path1 threads exit)
v_cmp_le_i32_e32 vcc, v0, v1 ; vcc = 0000 0111 (tid <= 2, masked by EXEC to threads 0-2)
s_and_saveexec_b64 s[8:9], vcc ; s[8:9] = 0000 1111 (saved EXEC = path1 mask)
; EXEC = 0000 0111 (path2 lanes: threads 0-2)
s_cbranch_execz .Flow1 ; EXEC != 0, so continue to path2

; path2: ; EXEC = 0000 0111 (threads 0-2)
global_store_dword v0, v1, s[0:1] ; out[0] = 42
s_xor_b64 s[2:3], exec, -1 ; s[2:3] = 1111 1000 (threads NOT in path2 should exit)

.Flow1: ; EXEC = 0000 0111 (path2 threads)
s_or_b64 exec, exec, s[8:9] ; EXEC = 0000 1111 (restore to path1 level)
s_andn2_b64 s[4:5], s[4:5], exec ; s[4:5] = 1111 0000 (clear path1 bits from entry exit)
s_and_b64 s[2:3], s[2:3], exec ; s[2:3] = 0000 1000 (path1 exit = thread 3 only)
s_or_b64 s[4:5], s[4:5], s[2:3] ; s[4:5] = 1111 1000 (merged: threads 3-7 exit early)

.Flow: ; EXEC = 0000 1111 (path1 threads)
s_or_b64 exec, exec, s[6:7] ; EXEC = 1111 1111 (restore to entry level, all threads)
s_and_saveexec_b64 s[2:3], s[4:5] ; s[2:3] = 1111 1111 (saved)
; EXEC = 1111 1000 (early_exit lanes: threads 3-7)
s_cbranch_execz .exit ; EXEC != 0, so continue to early_exit

; early_exit: ; EXEC = 1111 1000 (threads 3-7)
global_store_dword v0, v0, s[0:1] offset:4 ; out[1] = 0

.exit:
s_endpgm

Notice how the Flow1 block collects exit masks from different nesting depths and merges them into s[4:5]. By the time we reach Flow, s[4:5] = 1111 1000 which is exactly the threads that should exit early (thread 3 from path1, threads 4-7 from entry). The save/restore pairs are nested cleanly like parentheses: SAVE → SAVE → RESTORE → RESTORE → SAVE.

Conclusion

So CDNA-3 GPUs rely entirely on the compiler to handle divergent control flow. The StructurizeCFG pass linearizes the CFG, and the backend emits EXEC mask operations to sequence through each path. This is despite the ISA manual documenting support for hardware based divergence handling using the FORK/JOIN instructions. I am not sure why the compiler doesn't generate these instructions but I am planning on investigating further.