[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127

perlfu · 2024-09-18T12:06:17Z

BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.

llvmbot · 2024-09-18T12:06:53Z

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.

Patch is 42.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109127.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+15)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+937)

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 30aa36be99c95f..b110b5fc19097a 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2867,6 +2867,12 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
   MachineRegisterInfo &MRI = MF->getRegInfo();
   const SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
 
+  // Note: as this is used after hazard recognizer we need to apply some hazard
+  // workarounds directly.
+  const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
+  const bool FlushSGPRWrites = (ST.isWave64() && ST.hasVALUMaskWriteHazard()) ||
+                               ST.hasVALUReadSGPRHazard();
+
   // FIXME: Virtual register workaround for RegScavenger not working with empty
   // blocks.
   Register PCReg = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass);
@@ -2876,6 +2882,9 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
   // We need to compute the offset relative to the instruction immediately after
   // s_getpc_b64. Insert pc arithmetic code before last terminator.
   MachineInstr *GetPC = BuildMI(MBB, I, DL, get(AMDGPU::S_GETPC_B64), PCReg);
+  if (FlushSGPRWrites)
+    BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+        .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
 
   auto &MCCtx = MF->getContext();
   MCSymbol *PostGetPCLabel =
@@ -2890,10 +2899,16 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
       .addReg(PCReg, RegState::Define, AMDGPU::sub0)
       .addReg(PCReg, 0, AMDGPU::sub0)
       .addSym(OffsetLo, MO_FAR_BRANCH_OFFSET);
+  if (FlushSGPRWrites)
+    BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+        .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
   BuildMI(MBB, I, DL, get(AMDGPU::S_ADDC_U32))
       .addReg(PCReg, RegState::Define, AMDGPU::sub1)
       .addReg(PCReg, 0, AMDGPU::sub1)
       .addSym(OffsetHi, MO_FAR_BRANCH_OFFSET);
+  if (FlushSGPRWrites)
+    BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+        .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
 
   // Insert the indirect branch after the other terminator.
   BuildMI(&MBB, DL, get(AMDGPU::S_SETPC_B64))
diff --git a/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll b/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
index 635f3e4886b875..41023ac5ae4e48 100644
--- a/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
+++ b/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
@@ -1,5 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -mtriple=amdgcn -mcpu=tahiti -verify-machineinstrs -amdgpu-s-branch-bits=4 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs -amdgpu-s-branch-bits=5 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GFX11 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -amdgpu-s-branch-bits=5 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GFX12 %s
 
 
 ; FIXME: We should use llvm-mc for this, but we can't even parse our own output.
@@ -42,6 +44,71 @@ define amdgpu_kernel void @uniform_conditional_max_short_forward_branch(ptr addr
 ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_max_short_forward_branch:
+; GFX11:       ; %bb.0: ; %bb
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX11-NEXT:    s_cbranch_scc0 .LBB0_1
+; GFX11-NEXT:  ; %bb.3: ; %bb
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:  .Lpost_getpc0:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s5, s5, (.LBB0_2-.Lpost_getpc0)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[4:5]
+; GFX11-NEXT:  .LBB0_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    s_sleep 0
+; GFX11-NEXT:  .LBB0_2: ; %bb3
+; GFX11-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_max_short_forward_branch:
+; GFX12:       ; %bb.0: ; %bb
+; GFX12-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX12-NEXT:    s_cbranch_scc0 .LBB0_1
+; GFX12-NEXT:  ; %bb.3: ; %bb
+; GFX12-NEXT:    s_getpc_b64 s[4:5]
+; GFX12-NEXT:  .Lpost_getpc0:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s5, s5, (.LBB0_2-.Lpost_getpc0)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[4:5]
+; GFX12-NEXT:  .LBB0_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:    s_sleep 0
+; GFX12-NEXT:  .LBB0_2: ; %bb3
+; GFX12-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb:
   %cmp = icmp eq i32 %cnd, 0
   br i1 %cmp, label %bb3, label %bb2 ; +8 dword branch
@@ -89,6 +156,71 @@ define amdgpu_kernel void @uniform_conditional_min_long_forward_branch(ptr addrs
 ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_min_long_forward_branch:
+; GFX11:       ; %bb.0: ; %bb0
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX11-NEXT:    s_cbranch_scc0 .LBB1_1
+; GFX11-NEXT:  ; %bb.3: ; %bb0
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:  .Lpost_getpc1:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB1_2-.Lpost_getpc1)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s5, s5, (.LBB1_2-.Lpost_getpc1)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[4:5]
+; GFX11-NEXT:  .LBB1_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:  .LBB1_2: ; %bb3
+; GFX11-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_min_long_forward_branch:
+; GFX12:       ; %bb.0: ; %bb0
+; GFX12-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX12-NEXT:    s_cbranch_scc0 .LBB1_1
+; GFX12-NEXT:  ; %bb.3: ; %bb0
+; GFX12-NEXT:    s_getpc_b64 s[4:5]
+; GFX12-NEXT:  .Lpost_getpc1:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s4, s4, (.LBB1_2-.Lpost_getpc1)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s5, s5, (.LBB1_2-.Lpost_getpc1)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[4:5]
+; GFX12-NEXT:  .LBB1_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:  .LBB1_2: ; %bb3
+; GFX12-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb0:
   %cmp = icmp eq i32 %cnd, 0
   br i1 %cmp, label %bb3, label %bb2 ; +9 dword branch
@@ -138,6 +270,78 @@ define amdgpu_kernel void @uniform_conditional_min_long_forward_vcnd_branch(ptr
 ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_min_long_forward_vcnd_branch:
+; GFX11:       ; %bb.0: ; %bb0
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_cmp_eq_f32_e64 s[4:5], s0, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    s_and_b64 vcc, exec, s[4:5]
+; GFX11-NEXT:    s_cbranch_vccz .LBB2_1
+; GFX11-NEXT:  ; %bb.3: ; %bb0
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:  .Lpost_getpc2:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB2_2-.Lpost_getpc2)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s5, s5, (.LBB2_2-.Lpost_getpc2)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[4:5]
+; GFX11-NEXT:  .LBB2_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:     ; 32 bytes
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:  .LBB2_2: ; %bb3
+; GFX11-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_min_long_forward_vcnd_branch:
+; GFX12:       ; %bb.0: ; %bb0
+; GFX12-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_cmp_eq_f32 s0, 0
+; GFX12-NEXT:    s_cselect_b32 s1, -1, 0
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_and_b32 vcc_lo, exec_lo, s1
+; GFX12-NEXT:    s_cbranch_vccz .LBB2_1
+; GFX12-NEXT:  ; %bb.3: ; %bb0
+; GFX12-NEXT:    s_getpc_b64 s[4:5]
+; GFX12-NEXT:  .Lpost_getpc2:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s4, s4, (.LBB2_2-.Lpost_getpc2)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s5, s5, (.LBB2_2-.Lpost_getpc2)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[4:5]
+; GFX12-NEXT:  .LBB2_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:     ; 32 bytes
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:  .LBB2_2: ; %bb3
+; GFX12-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb0:
   %cmp = fcmp oeq float %cnd, 0.0
   br i1 %cmp, label %bb3, label %bb2 ; + 8 dword branch
@@ -193,6 +397,85 @@ define amdgpu_kernel void @min_long_forward_vbranch(ptr addrspace(1) %arg) #0 {
 ; GCN-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: min_long_forward_vbranch:
+; GFX11:       ; %bb.0: ; %bb
+; GFX11-NEXT:    s_load_b64 s[0:1], s[2:3], 0x24
+; GFX11-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_load_b32 v2, v0, s[0:1] glc dlc
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_add_co_u32 v0, s[2:3], s0, v0
+; GFX11-NEXT:    v_add_co_ci_u32_e64 v1, null, s1, 0, s[2:3]
+; GFX11-NEXT:    s_mov_b64 s[0:1], exec
+; GFX11-NEXT:    v_cmpx_ne_u32_e32 0, v2
+; GFX11-NEXT:    s_cbranch_execnz .LBB3_1
+; GFX11-NEXT:  ; %bb.3: ; %bb
+; GFX11-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-NEXT:  .Lpost_getpc3:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s2, s2, (.LBB3_2-.Lpost_getpc3)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s3, s3, (.LBB3_2-.Lpost_getpc3)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[2:3]
+; GFX11-NEXT:  .LBB3_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:     ; 32 bytes
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:  .LBB3_2: ; %bb3
+; GFX11-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX11-NEXT:    global_store_b32 v[0:1], v2, off dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: min_long_forward_vbranch:
+; GFX12:       ; %bb.0: ; %bb
+; GFX12-NEXT:    s_load_b64 s[0:1], s[2:3], 0x24
+; GFX12-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GFX12-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_load_b32 v2, v0, s[0:1] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_loadcnt 0x0
+; GFX12-NEXT:    v_add_co_u32 v0, s0, s0, v0
+; GFX12-NEXT:    v_add_co_ci_u32_e64 v1, null, s1, 0, s0
+; GFX12-NEXT:    s_mov_b32 s0, exec_lo
+; GFX12-NEXT:    v_cmpx_ne_u32_e32 0, v2
+; GFX12-NEXT:    s_cbranch_execnz .LBB3_1
+; GFX12-NEXT:  ; %bb.3: ; %bb
+; GFX12-NEXT:    s_getpc_b64 s[2:3]
+; GFX12-NEXT:  .Lpost_getpc3:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s2, s2, (.LBB3_2-.Lpost_getpc3)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s3, s3, (.LBB3_2-.Lpost_getpc3)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[2:3]
+; GFX12-NEXT:  .LBB3_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:     ; 32 bytes
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:  .LBB3_2: ; %bb3
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT:    global_store_b32 v[0:1], v2, off scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb:
   %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %tid.ext = zext i32 %tid to i64
@@ -237,6 +520,61 @@ define amdgpu_kernel void @long_backward_sbranch(ptr addrspace(1) %arg) #0 {
 ; GCN-NEXT:    s_setpc_b64 s[2:3]
 ; GCN-NEXT:  .LBB4_2: ; %bb3
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: long_backward_sbranch:
+; GFX11:       ; %bb.0: ; %bb
+; GFX11-NEXT:    s_mov_b32 s0, 0
+; GFX11-NEXT:    .p2align 6
+; GFX11-NEXT:  .LBB4_1: ; %bb2
+; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_add_i32 s0, s0, 1
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    s_cmp_lt_i32 s0, 10
+; GFX11-NEXT:    s_cbranch_scc0 .LBB4_2
+; GFX11-NEXT:  ; %bb.3: ; %bb2
+; GFX11-NEXT:    ; in Loop: Header=BB4_1 Depth=1
+; GFX11-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-NEXT:  .Lpost_getpc4:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s2, s2, (.LBB4_1-.Lpost_getpc4)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s3, s3, (.LBB4_1-.Lpost_getpc4)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[2:3]
+; GFX11-NEXT:  .LBB4_2: ; %bb3
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: long_backward_sbranch:
+; GFX12:       ; %bb.0: ; %bb
+; GFX12-NEXT:    s_mov_b32 s0, 0
+; GFX12-NEXT:  .LBB4_1: ; %bb2
+; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_add_co_i32 s0, s0, 1
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:    s_cmp_lt_i32 s0, 10
+; GFX12-NEXT:    s_cbranch_scc0 .LBB4_2
+; GFX12-NEXT:  ; %bb.3: ; %bb2
+; GFX12-NEXT:    ; in Loop: Header=BB4_1 Depth=1
+; GFX12-NEXT:    s_getpc_b64 s[2:3]
+; GFX12-NEXT:  .Lpost_getpc4:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s2, s2, (.LBB4_1-.Lpost_getpc4)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s3, s3, (.LBB4_1-.Lpost_getpc4)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[2:3]
+; GFX12-NEXT:  .LBB4_2: ; %bb3
+; GFX12-NEXT:    s_endpgm
 bb:
   br label %bb2
 
@@ -311,6 +649,125 @@ define amdgpu_kernel void @uniform_unconditional_min_long_forward_branch(ptr add
 ; GCN-NEXT:    s_add_u32 s0, s0, (.LBB5_3-.Lpost_getpc4)&4294967295
 ; GCN-NEXT:    s_addc_u32 s1, s1, (.LBB5_3-.Lpost_getpc4)>>32
 ; GCN-NEXT:    s_setpc_b64 s[0:1]
+;
+; GFX11-LABEL: uniform_unconditional_min_long_forward_branch:
+; GFX11:       ; %bb.0: ; %bb0
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX11-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-NEXT:    s_cbranch_scc0 .LBB5_1
+; GFX11-NEXT:  ; %bb.7: ; %bb0
+; GFX11-NEXT:    s_getpc_b64 s[0:1]
+; GFX11-NEXT:  .Lpost_getpc6:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s0, s0, (.LBB5_4-.Lpost_getpc6)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s1, s1, (.LBB5_4-.Lpost_getpc6)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[0:1]
+; GFX11-NEXT:  .LBB5_1: ; %Flow
+; GFX11-NEXT:    s_and_not1_b64 vcc, exec, s[0:1]
+; GFX11-NEXT:    s_cbranch_vccnz .LBB5_3
+; GFX11-NEXT:  .LBB5_2: ; %bb2
+; GFX11-NEXT:    v_mov_b32_e32 v0, 17
+; GFX11-NEXT:    global_store_b32 v[0:1], v0, off dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:  .LBB5_3: ; %bb4
+; GFX11-NEXT:    s_load_b64 s[0:1], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, 63
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[0:1] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+; GFX11-NEXT:  .LBB5_4: ; %bb3
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    s_cbranch_execnz .LBB5_5
+; GFX11-NEXT:  ; %bb.9: ; %bb3
+; GFX11-NEXT:    s_getpc_b64 s[0:1]
+; GFX11-NEXT:  .Lpost_getpc7:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s0, s0, (.LBB5_2-.Lpost_getpc7)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s1, s1, (.LBB5_2-.Lpost_getpc7)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[0:1]
+; GFX11-NEXT:  .LBB5_5: ; %bb3
+; GFX11-NEXT:    s_getpc_b64 s[0:1]
+; GFX11-NEXT:  .Lpost_getpc5:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s0, s0, (.LBB5_3-.Lpost_getpc5)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:   ...
[truncated]

jayfoad

Seems reasonable to me

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

arsenm

we probably could run branch relaxation before the post RA scheduler

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

perlfu · 2024-09-18T13:03:18Z

we probably could run branch relaxation before the post RA scheduler

Presumably this would cause problems if new hazard related instructions pushed a branch over the representable size?

arsenm · 2024-09-18T13:15:56Z

Presumably this would cause problems if new hazard related instructions pushed a branch over the representable size?

Yes. In principle we could pad the size of every instruction with the worst case number of instructions required for hazards in getInstSizeInBytes

perlfu · 2024-09-19T01:25:00Z

In principle we could pad the size of every instruction with the worst case number of instructions required for hazards getInstSizeInBytes

This doesn't really seem great.
Maybe I am misunderstanding, but if we assume that every instruction could require zero or one workaround instruction, then that halves the effective direct branch distance?
We must also account for s_delay_alu instructions which are added before relaxation at present, so that's another 4 bytes?
Now branch distance is reduced by 3x.

I think the solution in this patch works fine for now.
If we run into this again then I think the hazard mitigation infrastructure should be refactored to allow access to it outside the post RA scheduler, so insertIndirectBranch and similar could call into the hazard mitigation after inserting instructions.

jayfoad

LGTM assuming all the workarounds are actually required.

jayfoad · 2024-09-19T09:46:23Z

llvm/test/CodeGen/AMDGPU/branch-relaxation.ll

+; GFX11-NEXT:  .Lpost_getpc0:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe


Is the workaround really needed here, even though there were no VALU since the previous workaround? (Just checking since I can't remember the exact rules.)

I don't know this hazard. The entire branch sequence is pure SALU, so is this about instructions after the branch? Can you consider the branch target?

@jayfoad - Second s_waitcnt_depctr should be unnecessary on both GFX11 and GFX12.
I added it out of a sense of caution, but will remove it inline with normal workaround sequence.
Third s_waitcnt_depctr is only needed on GFX12, but in my opinion it is simpler to have same code for 11 and 12 as the perf. impact in this case will basically be zero.

@arsenm - This is about VALU access to SGPR before these instructions.
If a VALU has accessed the SGPRs used here then each SALU write must be flushed before a subsequent SALU read to the same register.
Technically flush is not required if SGPR was not used by VALU, or some expiry condition was reached, but adding backward scan to establish this is excessive compared to low perf. cost of always mitigating this sequence.

jayfoad · 2024-09-19T09:47:23Z

In principle we could pad the size of every instruction with the worst case number of instructions required for hazards getInstSizeInBytes

This doesn't really seem great. Maybe I am misunderstanding, but if we assume that every instruction could require zero or one workaround instruction, then that halves the effective direct branch distance? We must also account for s_delay_alu instructions which are added before relaxation at present, so that's another 4 bytes? Now branch distance is reduced by 3x.

I think the solution in this patch works fine for now. If we run into this again then I think the hazard mitigation infrastructure should be refactored to allow access to it outside the post RA scheduler, so insertIndirectBranch and similar could call into the hazard mitigation after inserting instructions.

Agreed - I think it would be a mistake to move branch relaxation earlier.

BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.

perlfu requested review from jayfoad, arsenm, shiltian and dstutt September 18, 2024 12:06

llvmbot added the backend:AMDGPU label Sep 18, 2024

jayfoad reviewed Sep 18, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved

arsenm reviewed Sep 18, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved

perlfu force-pushed the amdgpu-branch-hazards branch from 651af28 to 153137d Compare September 18, 2024 13:01

perlfu mentioned this pull request Sep 19, 2024

[AMDGPU] Add PC sign extension to insertIndirectBranch #109233

Closed

jayfoad approved these changes Sep 19, 2024

View reviewed changes

perlfu added 2 commits September 22, 2024 12:46

[AMDGPU] Add hazard workarounds to insertIndirectBranch

e01242b

BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.

- Remove unnecessary part of workaround

9757672

perlfu force-pushed the amdgpu-branch-hazards branch from 153137d to 9757672 Compare September 22, 2024 04:19

perlfu merged commit d147b6d into llvm:main Sep 22, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127

[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127

perlfu commented Sep 18, 2024

llvmbot commented Sep 18, 2024

jayfoad left a comment

arsenm left a comment

perlfu commented Sep 18, 2024

arsenm commented Sep 18, 2024

perlfu commented Sep 19, 2024

jayfoad left a comment

jayfoad Sep 19, 2024

arsenm Sep 20, 2024

perlfu Sep 22, 2024

jayfoad commented Sep 19, 2024

[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127

[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127

Conversation

perlfu commented Sep 18, 2024

llvmbot commented Sep 18, 2024

jayfoad left a comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

perlfu commented Sep 18, 2024

arsenm commented Sep 18, 2024

perlfu commented Sep 19, 2024

jayfoad left a comment

Choose a reason for hiding this comment

jayfoad Sep 19, 2024

Choose a reason for hiding this comment

arsenm Sep 20, 2024

Choose a reason for hiding this comment

perlfu Sep 22, 2024

Choose a reason for hiding this comment

jayfoad commented Sep 19, 2024