Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127

Merged
merged 2 commits into from
Sep 22, 2024

Conversation

perlfu
Copy link
Contributor

@perlfu perlfu commented Sep 18, 2024

BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 18, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.


Patch is 42.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109127.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+15)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+937)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 30aa36be99c95f..b110b5fc19097a 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2867,6 +2867,12 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
   MachineRegisterInfo &MRI = MF->getRegInfo();
   const SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
 
+  // Note: as this is used after hazard recognizer we need to apply some hazard
+  // workarounds directly.
+  const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
+  const bool FlushSGPRWrites = (ST.isWave64() && ST.hasVALUMaskWriteHazard()) ||
+                               ST.hasVALUReadSGPRHazard();
+
   // FIXME: Virtual register workaround for RegScavenger not working with empty
   // blocks.
   Register PCReg = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass);
@@ -2876,6 +2882,9 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
   // We need to compute the offset relative to the instruction immediately after
   // s_getpc_b64. Insert pc arithmetic code before last terminator.
   MachineInstr *GetPC = BuildMI(MBB, I, DL, get(AMDGPU::S_GETPC_B64), PCReg);
+  if (FlushSGPRWrites)
+    BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+        .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
 
   auto &MCCtx = MF->getContext();
   MCSymbol *PostGetPCLabel =
@@ -2890,10 +2899,16 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
       .addReg(PCReg, RegState::Define, AMDGPU::sub0)
       .addReg(PCReg, 0, AMDGPU::sub0)
       .addSym(OffsetLo, MO_FAR_BRANCH_OFFSET);
+  if (FlushSGPRWrites)
+    BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+        .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
   BuildMI(MBB, I, DL, get(AMDGPU::S_ADDC_U32))
       .addReg(PCReg, RegState::Define, AMDGPU::sub1)
       .addReg(PCReg, 0, AMDGPU::sub1)
       .addSym(OffsetHi, MO_FAR_BRANCH_OFFSET);
+  if (FlushSGPRWrites)
+    BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+        .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
 
   // Insert the indirect branch after the other terminator.
   BuildMI(&MBB, DL, get(AMDGPU::S_SETPC_B64))
diff --git a/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll b/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
index 635f3e4886b875..41023ac5ae4e48 100644
--- a/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
+++ b/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
@@ -1,5 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -mtriple=amdgcn -mcpu=tahiti -verify-machineinstrs -amdgpu-s-branch-bits=4 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs -amdgpu-s-branch-bits=5 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GFX11 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -amdgpu-s-branch-bits=5 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GFX12 %s
 
 
 ; FIXME: We should use llvm-mc for this, but we can't even parse our own output.
@@ -42,6 +44,71 @@ define amdgpu_kernel void @uniform_conditional_max_short_forward_branch(ptr addr
 ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_max_short_forward_branch:
+; GFX11:       ; %bb.0: ; %bb
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX11-NEXT:    s_cbranch_scc0 .LBB0_1
+; GFX11-NEXT:  ; %bb.3: ; %bb
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:  .Lpost_getpc0:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s5, s5, (.LBB0_2-.Lpost_getpc0)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[4:5]
+; GFX11-NEXT:  .LBB0_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    s_sleep 0
+; GFX11-NEXT:  .LBB0_2: ; %bb3
+; GFX11-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_max_short_forward_branch:
+; GFX12:       ; %bb.0: ; %bb
+; GFX12-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX12-NEXT:    s_cbranch_scc0 .LBB0_1
+; GFX12-NEXT:  ; %bb.3: ; %bb
+; GFX12-NEXT:    s_getpc_b64 s[4:5]
+; GFX12-NEXT:  .Lpost_getpc0:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s5, s5, (.LBB0_2-.Lpost_getpc0)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[4:5]
+; GFX12-NEXT:  .LBB0_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:    s_sleep 0
+; GFX12-NEXT:  .LBB0_2: ; %bb3
+; GFX12-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb:
   %cmp = icmp eq i32 %cnd, 0
   br i1 %cmp, label %bb3, label %bb2 ; +8 dword branch
@@ -89,6 +156,71 @@ define amdgpu_kernel void @uniform_conditional_min_long_forward_branch(ptr addrs
 ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_min_long_forward_branch:
+; GFX11:       ; %bb.0: ; %bb0
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX11-NEXT:    s_cbranch_scc0 .LBB1_1
+; GFX11-NEXT:  ; %bb.3: ; %bb0
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:  .Lpost_getpc1:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB1_2-.Lpost_getpc1)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s5, s5, (.LBB1_2-.Lpost_getpc1)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[4:5]
+; GFX11-NEXT:  .LBB1_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:  .LBB1_2: ; %bb3
+; GFX11-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_min_long_forward_branch:
+; GFX12:       ; %bb.0: ; %bb0
+; GFX12-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX12-NEXT:    s_cbranch_scc0 .LBB1_1
+; GFX12-NEXT:  ; %bb.3: ; %bb0
+; GFX12-NEXT:    s_getpc_b64 s[4:5]
+; GFX12-NEXT:  .Lpost_getpc1:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s4, s4, (.LBB1_2-.Lpost_getpc1)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s5, s5, (.LBB1_2-.Lpost_getpc1)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[4:5]
+; GFX12-NEXT:  .LBB1_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:  .LBB1_2: ; %bb3
+; GFX12-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb0:
   %cmp = icmp eq i32 %cnd, 0
   br i1 %cmp, label %bb3, label %bb2 ; +9 dword branch
@@ -138,6 +270,78 @@ define amdgpu_kernel void @uniform_conditional_min_long_forward_vcnd_branch(ptr
 ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_min_long_forward_vcnd_branch:
+; GFX11:       ; %bb.0: ; %bb0
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_cmp_eq_f32_e64 s[4:5], s0, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    s_and_b64 vcc, exec, s[4:5]
+; GFX11-NEXT:    s_cbranch_vccz .LBB2_1
+; GFX11-NEXT:  ; %bb.3: ; %bb0
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:  .Lpost_getpc2:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s4, s4, (.LBB2_2-.Lpost_getpc2)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s5, s5, (.LBB2_2-.Lpost_getpc2)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[4:5]
+; GFX11-NEXT:  .LBB2_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:     ; 32 bytes
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:  .LBB2_2: ; %bb3
+; GFX11-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_min_long_forward_vcnd_branch:
+; GFX12:       ; %bb.0: ; %bb0
+; GFX12-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_cmp_eq_f32 s0, 0
+; GFX12-NEXT:    s_cselect_b32 s1, -1, 0
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_and_b32 vcc_lo, exec_lo, s1
+; GFX12-NEXT:    s_cbranch_vccz .LBB2_1
+; GFX12-NEXT:  ; %bb.3: ; %bb0
+; GFX12-NEXT:    s_getpc_b64 s[4:5]
+; GFX12-NEXT:  .Lpost_getpc2:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s4, s4, (.LBB2_2-.Lpost_getpc2)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s5, s5, (.LBB2_2-.Lpost_getpc2)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[4:5]
+; GFX12-NEXT:  .LBB2_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:     ; 32 bytes
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:  .LBB2_2: ; %bb3
+; GFX12-NEXT:    s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb0:
   %cmp = fcmp oeq float %cnd, 0.0
   br i1 %cmp, label %bb3, label %bb2 ; + 8 dword branch
@@ -193,6 +397,85 @@ define amdgpu_kernel void @min_long_forward_vbranch(ptr addrspace(1) %arg) #0 {
 ; GCN-NEXT:    buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: min_long_forward_vbranch:
+; GFX11:       ; %bb.0: ; %bb
+; GFX11-NEXT:    s_load_b64 s[0:1], s[2:3], 0x24
+; GFX11-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_load_b32 v2, v0, s[0:1] glc dlc
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_add_co_u32 v0, s[2:3], s0, v0
+; GFX11-NEXT:    v_add_co_ci_u32_e64 v1, null, s1, 0, s[2:3]
+; GFX11-NEXT:    s_mov_b64 s[0:1], exec
+; GFX11-NEXT:    v_cmpx_ne_u32_e32 0, v2
+; GFX11-NEXT:    s_cbranch_execnz .LBB3_1
+; GFX11-NEXT:  ; %bb.3: ; %bb
+; GFX11-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-NEXT:  .Lpost_getpc3:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s2, s2, (.LBB3_2-.Lpost_getpc3)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s3, s3, (.LBB3_2-.Lpost_getpc3)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[2:3]
+; GFX11-NEXT:  .LBB3_1: ; %bb2
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:     ; 32 bytes
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:  .LBB3_2: ; %bb3
+; GFX11-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX11-NEXT:    global_store_b32 v[0:1], v2, off dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: min_long_forward_vbranch:
+; GFX12:       ; %bb.0: ; %bb
+; GFX12-NEXT:    s_load_b64 s[0:1], s[2:3], 0x24
+; GFX12-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GFX12-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    global_load_b32 v2, v0, s[0:1] scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_loadcnt 0x0
+; GFX12-NEXT:    v_add_co_u32 v0, s0, s0, v0
+; GFX12-NEXT:    v_add_co_ci_u32_e64 v1, null, s1, 0, s0
+; GFX12-NEXT:    s_mov_b32 s0, exec_lo
+; GFX12-NEXT:    v_cmpx_ne_u32_e32 0, v2
+; GFX12-NEXT:    s_cbranch_execnz .LBB3_1
+; GFX12-NEXT:  ; %bb.3: ; %bb
+; GFX12-NEXT:    s_getpc_b64 s[2:3]
+; GFX12-NEXT:  .Lpost_getpc3:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s2, s2, (.LBB3_2-.Lpost_getpc3)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s3, s3, (.LBB3_2-.Lpost_getpc3)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[2:3]
+; GFX12-NEXT:  .LBB3_1: ; %bb2
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:     ; 32 bytes
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:  .LBB3_2: ; %bb3
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT:    global_store_b32 v[0:1], v2, off scope:SCOPE_SYS
+; GFX12-NEXT:    s_wait_storecnt 0x0
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
 bb:
   %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %tid.ext = zext i32 %tid to i64
@@ -237,6 +520,61 @@ define amdgpu_kernel void @long_backward_sbranch(ptr addrspace(1) %arg) #0 {
 ; GCN-NEXT:    s_setpc_b64 s[2:3]
 ; GCN-NEXT:  .LBB4_2: ; %bb3
 ; GCN-NEXT:    s_endpgm
+;
+; GFX11-LABEL: long_backward_sbranch:
+; GFX11:       ; %bb.0: ; %bb
+; GFX11-NEXT:    s_mov_b32 s0, 0
+; GFX11-NEXT:    .p2align 6
+; GFX11-NEXT:  .LBB4_1: ; %bb2
+; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_add_i32 s0, s0, 1
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    s_cmp_lt_i32 s0, 10
+; GFX11-NEXT:    s_cbranch_scc0 .LBB4_2
+; GFX11-NEXT:  ; %bb.3: ; %bb2
+; GFX11-NEXT:    ; in Loop: Header=BB4_1 Depth=1
+; GFX11-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-NEXT:  .Lpost_getpc4:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s2, s2, (.LBB4_1-.Lpost_getpc4)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s3, s3, (.LBB4_1-.Lpost_getpc4)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[2:3]
+; GFX11-NEXT:  .LBB4_2: ; %bb3
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: long_backward_sbranch:
+; GFX12:       ; %bb.0: ; %bb
+; GFX12-NEXT:    s_mov_b32 s0, 0
+; GFX12-NEXT:  .LBB4_1: ; %bb2
+; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_add_co_i32 s0, s0, 1
+; GFX12-NEXT:    ;;#ASMSTART
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    v_nop_e64
+; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:    s_cmp_lt_i32 s0, 10
+; GFX12-NEXT:    s_cbranch_scc0 .LBB4_2
+; GFX12-NEXT:  ; %bb.3: ; %bb2
+; GFX12-NEXT:    ; in Loop: Header=BB4_1 Depth=1
+; GFX12-NEXT:    s_getpc_b64 s[2:3]
+; GFX12-NEXT:  .Lpost_getpc4:
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_u32 s2, s2, (.LBB4_1-.Lpost_getpc4)&4294967295
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_add_co_ci_u32 s3, s3, (.LBB4_1-.Lpost_getpc4)>>32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_setpc_b64 s[2:3]
+; GFX12-NEXT:  .LBB4_2: ; %bb3
+; GFX12-NEXT:    s_endpgm
 bb:
   br label %bb2
 
@@ -311,6 +649,125 @@ define amdgpu_kernel void @uniform_unconditional_min_long_forward_branch(ptr add
 ; GCN-NEXT:    s_add_u32 s0, s0, (.LBB5_3-.Lpost_getpc4)&4294967295
 ; GCN-NEXT:    s_addc_u32 s1, s1, (.LBB5_3-.Lpost_getpc4)>>32
 ; GCN-NEXT:    s_setpc_b64 s[0:1]
+;
+; GFX11-LABEL: uniform_unconditional_min_long_forward_branch:
+; GFX11:       ; %bb.0: ; %bb0
+; GFX11-NEXT:    s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_cmp_eq_u32 s0, 0
+; GFX11-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-NEXT:    s_cbranch_scc0 .LBB5_1
+; GFX11-NEXT:  ; %bb.7: ; %bb0
+; GFX11-NEXT:    s_getpc_b64 s[0:1]
+; GFX11-NEXT:  .Lpost_getpc6:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s0, s0, (.LBB5_4-.Lpost_getpc6)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s1, s1, (.LBB5_4-.Lpost_getpc6)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[0:1]
+; GFX11-NEXT:  .LBB5_1: ; %Flow
+; GFX11-NEXT:    s_and_not1_b64 vcc, exec, s[0:1]
+; GFX11-NEXT:    s_cbranch_vccnz .LBB5_3
+; GFX11-NEXT:  .LBB5_2: ; %bb2
+; GFX11-NEXT:    v_mov_b32_e32 v0, 17
+; GFX11-NEXT:    global_store_b32 v[0:1], v0, off dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:  .LBB5_3: ; %bb4
+; GFX11-NEXT:    s_load_b64 s[0:1], s[2:3], 0x24
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    v_mov_b32_e32 v1, 63
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    global_store_b32 v0, v1, s[0:1] dlc
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_nop 0
+; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT:    s_endpgm
+; GFX11-NEXT:  .LBB5_4: ; %bb3
+; GFX11-NEXT:    ;;#ASMSTART
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    v_nop_e64
+; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    s_cbranch_execnz .LBB5_5
+; GFX11-NEXT:  ; %bb.9: ; %bb3
+; GFX11-NEXT:    s_getpc_b64 s[0:1]
+; GFX11-NEXT:  .Lpost_getpc7:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s0, s0, (.LBB5_2-.Lpost_getpc7)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_addc_u32 s1, s1, (.LBB5_2-.Lpost_getpc7)>>32
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_setpc_b64 s[0:1]
+; GFX11-NEXT:  .LBB5_5: ; %bb3
+; GFX11-NEXT:    s_getpc_b64 s[0:1]
+; GFX11-NEXT:  .Lpost_getpc5:
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:    s_add_u32 s0, s0, (.LBB5_3-.Lpost_getpc5)&4294967295
+; GFX11-NEXT:    s_waitcnt_depctr 0xfffe
+; GFX11-NEXT:   ...
[truncated]

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably could run branch relaxation before the post RA scheduler

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved
@perlfu
Copy link
Contributor Author

perlfu commented Sep 18, 2024

we probably could run branch relaxation before the post RA scheduler

Presumably this would cause problems if new hazard related instructions pushed a branch over the representable size?

@arsenm
Copy link
Contributor

arsenm commented Sep 18, 2024

Presumably this would cause problems if new hazard related instructions pushed a branch over the representable size?

Yes. In principle we could pad the size of every instruction with the worst case number of instructions required for hazards in getInstSizeInBytes

@perlfu
Copy link
Contributor Author

perlfu commented Sep 19, 2024

In principle we could pad the size of every instruction with the worst case number of instructions required for hazards getInstSizeInBytes

This doesn't really seem great.
Maybe I am misunderstanding, but if we assume that every instruction could require zero or one workaround instruction, then that halves the effective direct branch distance?
We must also account for s_delay_alu instructions which are added before relaxation at present, so that's another 4 bytes?
Now branch distance is reduced by 3x.

I think the solution in this patch works fine for now.
If we run into this again then I think the hazard mitigation infrastructure should be refactored to allow access to it outside the post RA scheduler, so insertIndirectBranch and similar could call into the hazard mitigation after inserting instructions.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming all the workarounds are actually required.

; GFX11-NEXT: .Lpost_getpc0:
; GFX11-NEXT: s_waitcnt_depctr 0xfffe
; GFX11-NEXT: s_add_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
; GFX11-NEXT: s_waitcnt_depctr 0xfffe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the workaround really needed here, even though there were no VALU since the previous workaround? (Just checking since I can't remember the exact rules.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this hazard. The entire branch sequence is pure SALU, so is this about instructions after the branch? Can you consider the branch target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayfoad - Second s_waitcnt_depctr should be unnecessary on both GFX11 and GFX12.
I added it out of a sense of caution, but will remove it inline with normal workaround sequence.
Third s_waitcnt_depctr is only needed on GFX12, but in my opinion it is simpler to have same code for 11 and 12 as the perf. impact in this case will basically be zero.

@arsenm - This is about VALU access to SGPR before these instructions.
If a VALU has accessed the SGPRs used here then each SALU write must be flushed before a subsequent SALU read to the same register.
Technically flush is not required if SGPR was not used by VALU, or some expiry condition was reached, but adding backward scan to establish this is excessive compared to low perf. cost of always mitigating this sequence.

@jayfoad
Copy link
Contributor

jayfoad commented Sep 19, 2024

In principle we could pad the size of every instruction with the worst case number of instructions required for hazards getInstSizeInBytes

This doesn't really seem great. Maybe I am misunderstanding, but if we assume that every instruction could require zero or one workaround instruction, then that halves the effective direct branch distance? We must also account for s_delay_alu instructions which are added before relaxation at present, so that's another 4 bytes? Now branch distance is reduced by 3x.

I think the solution in this patch works fine for now. If we run into this again then I think the hazard mitigation infrastructure should be refactored to allow access to it outside the post RA scheduler, so insertIndirectBranch and similar could call into the hazard mitigation after inserting instructions.

Agreed - I think it would be a mistake to move branch relaxation earlier.

BranchRelaxation runs after the hazard recognizer, so workarounds
for SGPR accesses need to be applied directly inline to the code
it generates.
@perlfu perlfu merged commit d147b6d into llvm:main Sep 22, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants