-
Notifications
You must be signed in to change notification settings - Fork 11.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Add hazard workarounds to insertIndirectBranch #109127
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Carl Ritson (perlfu) ChangesBranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates. Patch is 42.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109127.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 30aa36be99c95f..b110b5fc19097a 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2867,6 +2867,12 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
MachineRegisterInfo &MRI = MF->getRegInfo();
const SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
+ // Note: as this is used after hazard recognizer we need to apply some hazard
+ // workarounds directly.
+ const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
+ const bool FlushSGPRWrites = (ST.isWave64() && ST.hasVALUMaskWriteHazard()) ||
+ ST.hasVALUReadSGPRHazard();
+
// FIXME: Virtual register workaround for RegScavenger not working with empty
// blocks.
Register PCReg = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass);
@@ -2876,6 +2882,9 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
// We need to compute the offset relative to the instruction immediately after
// s_getpc_b64. Insert pc arithmetic code before last terminator.
MachineInstr *GetPC = BuildMI(MBB, I, DL, get(AMDGPU::S_GETPC_B64), PCReg);
+ if (FlushSGPRWrites)
+ BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+ .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
auto &MCCtx = MF->getContext();
MCSymbol *PostGetPCLabel =
@@ -2890,10 +2899,16 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
.addReg(PCReg, RegState::Define, AMDGPU::sub0)
.addReg(PCReg, 0, AMDGPU::sub0)
.addSym(OffsetLo, MO_FAR_BRANCH_OFFSET);
+ if (FlushSGPRWrites)
+ BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+ .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
BuildMI(MBB, I, DL, get(AMDGPU::S_ADDC_U32))
.addReg(PCReg, RegState::Define, AMDGPU::sub1)
.addReg(PCReg, 0, AMDGPU::sub1)
.addSym(OffsetHi, MO_FAR_BRANCH_OFFSET);
+ if (FlushSGPRWrites)
+ BuildMI(MBB, I, DL, get(AMDGPU::S_WAITCNT_DEPCTR))
+ .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
// Insert the indirect branch after the other terminator.
BuildMI(&MBB, DL, get(AMDGPU::S_SETPC_B64))
diff --git a/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll b/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
index 635f3e4886b875..41023ac5ae4e48 100644
--- a/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
+++ b/llvm/test/CodeGen/AMDGPU/branch-relaxation.ll
@@ -1,5 +1,7 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=amdgcn -mcpu=tahiti -verify-machineinstrs -amdgpu-s-branch-bits=4 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GCN %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs -amdgpu-s-branch-bits=5 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GFX11 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -amdgpu-s-branch-bits=5 -simplifycfg-require-and-preserve-domtree=1 -amdgpu-long-branch-factor=0 < %s | FileCheck -enable-var-scope -check-prefix=GFX12 %s
; FIXME: We should use llvm-mc for this, but we can't even parse our own output.
@@ -42,6 +44,71 @@ define amdgpu_kernel void @uniform_conditional_max_short_forward_branch(ptr addr
; GCN-NEXT: buffer_store_dword v0, off, s[4:7], 0
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_max_short_forward_branch:
+; GFX11: ; %bb.0: ; %bb
+; GFX11-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_cmp_eq_u32 s0, 0
+; GFX11-NEXT: s_cbranch_scc0 .LBB0_1
+; GFX11-NEXT: ; %bb.3: ; %bb
+; GFX11-NEXT: s_getpc_b64 s[4:5]
+; GFX11-NEXT: .Lpost_getpc0:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s5, s5, (.LBB0_2-.Lpost_getpc0)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[4:5]
+; GFX11-NEXT: .LBB0_1: ; %bb2
+; GFX11-NEXT: ;;#ASMSTART
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: ;;#ASMEND
+; GFX11-NEXT: s_sleep 0
+; GFX11-NEXT: .LBB0_2: ; %bb3
+; GFX11-NEXT: s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
+; GFX11-NEXT: v_mov_b32_e32 v1, s0
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT: s_nop 0
+; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT: s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_max_short_forward_branch:
+; GFX12: ; %bb.0: ; %bb
+; GFX12-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: s_cmp_eq_u32 s0, 0
+; GFX12-NEXT: s_cbranch_scc0 .LBB0_1
+; GFX12-NEXT: ; %bb.3: ; %bb
+; GFX12-NEXT: s_getpc_b64 s[4:5]
+; GFX12-NEXT: .Lpost_getpc0:
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_ci_u32 s5, s5, (.LBB0_2-.Lpost_getpc0)>>32
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_setpc_b64 s[4:5]
+; GFX12-NEXT: .LBB0_1: ; %bb2
+; GFX12-NEXT: ;;#ASMSTART
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: ;;#ASMEND
+; GFX12-NEXT: s_sleep 0
+; GFX12-NEXT: .LBB0_2: ; %bb3
+; GFX12-NEXT: s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT: s_wait_storecnt 0x0
+; GFX12-NEXT: s_nop 0
+; GFX12-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT: s_endpgm
bb:
%cmp = icmp eq i32 %cnd, 0
br i1 %cmp, label %bb3, label %bb2 ; +8 dword branch
@@ -89,6 +156,71 @@ define amdgpu_kernel void @uniform_conditional_min_long_forward_branch(ptr addrs
; GCN-NEXT: buffer_store_dword v0, off, s[4:7], 0
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_min_long_forward_branch:
+; GFX11: ; %bb.0: ; %bb0
+; GFX11-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_cmp_eq_u32 s0, 0
+; GFX11-NEXT: s_cbranch_scc0 .LBB1_1
+; GFX11-NEXT: ; %bb.3: ; %bb0
+; GFX11-NEXT: s_getpc_b64 s[4:5]
+; GFX11-NEXT: .Lpost_getpc1:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s4, s4, (.LBB1_2-.Lpost_getpc1)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s5, s5, (.LBB1_2-.Lpost_getpc1)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[4:5]
+; GFX11-NEXT: .LBB1_1: ; %bb2
+; GFX11-NEXT: ;;#ASMSTART
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: ;;#ASMEND
+; GFX11-NEXT: .LBB1_2: ; %bb3
+; GFX11-NEXT: s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
+; GFX11-NEXT: v_mov_b32_e32 v1, s0
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT: s_nop 0
+; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT: s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_min_long_forward_branch:
+; GFX12: ; %bb.0: ; %bb0
+; GFX12-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: s_cmp_eq_u32 s0, 0
+; GFX12-NEXT: s_cbranch_scc0 .LBB1_1
+; GFX12-NEXT: ; %bb.3: ; %bb0
+; GFX12-NEXT: s_getpc_b64 s[4:5]
+; GFX12-NEXT: .Lpost_getpc1:
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_u32 s4, s4, (.LBB1_2-.Lpost_getpc1)&4294967295
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_ci_u32 s5, s5, (.LBB1_2-.Lpost_getpc1)>>32
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_setpc_b64 s[4:5]
+; GFX12-NEXT: .LBB1_1: ; %bb2
+; GFX12-NEXT: ;;#ASMSTART
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: ;;#ASMEND
+; GFX12-NEXT: .LBB1_2: ; %bb3
+; GFX12-NEXT: s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT: s_wait_storecnt 0x0
+; GFX12-NEXT: s_nop 0
+; GFX12-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT: s_endpgm
bb0:
%cmp = icmp eq i32 %cnd, 0
br i1 %cmp, label %bb3, label %bb2 ; +9 dword branch
@@ -138,6 +270,78 @@ define amdgpu_kernel void @uniform_conditional_min_long_forward_vcnd_branch(ptr
; GCN-NEXT: buffer_store_dword v0, off, s[4:7], 0
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_endpgm
+;
+; GFX11-LABEL: uniform_conditional_min_long_forward_vcnd_branch:
+; GFX11: ; %bb.0: ; %bb0
+; GFX11-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: v_cmp_eq_f32_e64 s[4:5], s0, 0
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT: s_and_b64 vcc, exec, s[4:5]
+; GFX11-NEXT: s_cbranch_vccz .LBB2_1
+; GFX11-NEXT: ; %bb.3: ; %bb0
+; GFX11-NEXT: s_getpc_b64 s[4:5]
+; GFX11-NEXT: .Lpost_getpc2:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s4, s4, (.LBB2_2-.Lpost_getpc2)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s5, s5, (.LBB2_2-.Lpost_getpc2)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[4:5]
+; GFX11-NEXT: .LBB2_1: ; %bb2
+; GFX11-NEXT: ;;#ASMSTART
+; GFX11-NEXT: ; 32 bytes
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: ;;#ASMEND
+; GFX11-NEXT: .LBB2_2: ; %bb3
+; GFX11-NEXT: s_load_b64 s[2:3], s[2:3], 0x24
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
+; GFX11-NEXT: v_mov_b32_e32 v1, s0
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: global_store_b32 v0, v1, s[2:3] dlc
+; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT: s_nop 0
+; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT: s_endpgm
+;
+; GFX12-LABEL: uniform_conditional_min_long_forward_vcnd_branch:
+; GFX12: ; %bb.0: ; %bb0
+; GFX12-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: s_cmp_eq_f32 s0, 0
+; GFX12-NEXT: s_cselect_b32 s1, -1, 0
+; GFX12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT: s_and_b32 vcc_lo, exec_lo, s1
+; GFX12-NEXT: s_cbranch_vccz .LBB2_1
+; GFX12-NEXT: ; %bb.3: ; %bb0
+; GFX12-NEXT: s_getpc_b64 s[4:5]
+; GFX12-NEXT: .Lpost_getpc2:
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_u32 s4, s4, (.LBB2_2-.Lpost_getpc2)&4294967295
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_ci_u32 s5, s5, (.LBB2_2-.Lpost_getpc2)>>32
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_setpc_b64 s[4:5]
+; GFX12-NEXT: .LBB2_1: ; %bb2
+; GFX12-NEXT: ;;#ASMSTART
+; GFX12-NEXT: ; 32 bytes
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: ;;#ASMEND
+; GFX12-NEXT: .LBB2_2: ; %bb3
+; GFX12-NEXT: s_load_b64 s[2:3], s[2:3], 0x24
+; GFX12-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: global_store_b32 v0, v1, s[2:3] scope:SCOPE_SYS
+; GFX12-NEXT: s_wait_storecnt 0x0
+; GFX12-NEXT: s_nop 0
+; GFX12-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT: s_endpgm
bb0:
%cmp = fcmp oeq float %cnd, 0.0
br i1 %cmp, label %bb3, label %bb2 ; + 8 dword branch
@@ -193,6 +397,85 @@ define amdgpu_kernel void @min_long_forward_vbranch(ptr addrspace(1) %arg) #0 {
; GCN-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_endpgm
+;
+; GFX11-LABEL: min_long_forward_vbranch:
+; GFX11: ; %bb.0: ; %bb
+; GFX11-NEXT: s_load_b64 s[0:1], s[2:3], 0x24
+; GFX11-NEXT: v_and_b32_e32 v0, 0x3ff, v0
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: global_load_b32 v2, v0, s[0:1] glc dlc
+; GFX11-NEXT: s_waitcnt vmcnt(0)
+; GFX11-NEXT: v_add_co_u32 v0, s[2:3], s0, v0
+; GFX11-NEXT: v_add_co_ci_u32_e64 v1, null, s1, 0, s[2:3]
+; GFX11-NEXT: s_mov_b64 s[0:1], exec
+; GFX11-NEXT: v_cmpx_ne_u32_e32 0, v2
+; GFX11-NEXT: s_cbranch_execnz .LBB3_1
+; GFX11-NEXT: ; %bb.3: ; %bb
+; GFX11-NEXT: s_getpc_b64 s[2:3]
+; GFX11-NEXT: .Lpost_getpc3:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s2, s2, (.LBB3_2-.Lpost_getpc3)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s3, s3, (.LBB3_2-.Lpost_getpc3)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[2:3]
+; GFX11-NEXT: .LBB3_1: ; %bb2
+; GFX11-NEXT: ;;#ASMSTART
+; GFX11-NEXT: ; 32 bytes
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: ;;#ASMEND
+; GFX11-NEXT: .LBB3_2: ; %bb3
+; GFX11-NEXT: s_or_b64 exec, exec, s[0:1]
+; GFX11-NEXT: global_store_b32 v[0:1], v2, off dlc
+; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT: s_nop 0
+; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT: s_endpgm
+;
+; GFX12-LABEL: min_long_forward_vbranch:
+; GFX12: ; %bb.0: ; %bb
+; GFX12-NEXT: s_load_b64 s[0:1], s[2:3], 0x24
+; GFX12-NEXT: v_and_b32_e32 v0, 0x3ff, v0
+; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GFX12-NEXT: v_lshlrev_b32_e32 v0, 2, v0
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: global_load_b32 v2, v0, s[0:1] scope:SCOPE_SYS
+; GFX12-NEXT: s_wait_loadcnt 0x0
+; GFX12-NEXT: v_add_co_u32 v0, s0, s0, v0
+; GFX12-NEXT: v_add_co_ci_u32_e64 v1, null, s1, 0, s0
+; GFX12-NEXT: s_mov_b32 s0, exec_lo
+; GFX12-NEXT: v_cmpx_ne_u32_e32 0, v2
+; GFX12-NEXT: s_cbranch_execnz .LBB3_1
+; GFX12-NEXT: ; %bb.3: ; %bb
+; GFX12-NEXT: s_getpc_b64 s[2:3]
+; GFX12-NEXT: .Lpost_getpc3:
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_u32 s2, s2, (.LBB3_2-.Lpost_getpc3)&4294967295
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_ci_u32 s3, s3, (.LBB3_2-.Lpost_getpc3)>>32
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_setpc_b64 s[2:3]
+; GFX12-NEXT: .LBB3_1: ; %bb2
+; GFX12-NEXT: ;;#ASMSTART
+; GFX12-NEXT: ; 32 bytes
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: ;;#ASMEND
+; GFX12-NEXT: .LBB3_2: ; %bb3
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT: global_store_b32 v[0:1], v2, off scope:SCOPE_SYS
+; GFX12-NEXT: s_wait_storecnt 0x0
+; GFX12-NEXT: s_nop 0
+; GFX12-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT: s_endpgm
bb:
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%tid.ext = zext i32 %tid to i64
@@ -237,6 +520,61 @@ define amdgpu_kernel void @long_backward_sbranch(ptr addrspace(1) %arg) #0 {
; GCN-NEXT: s_setpc_b64 s[2:3]
; GCN-NEXT: .LBB4_2: ; %bb3
; GCN-NEXT: s_endpgm
+;
+; GFX11-LABEL: long_backward_sbranch:
+; GFX11: ; %bb.0: ; %bb
+; GFX11-NEXT: s_mov_b32 s0, 0
+; GFX11-NEXT: .p2align 6
+; GFX11-NEXT: .LBB4_1: ; %bb2
+; GFX11-NEXT: ; =>This Inner Loop Header: Depth=1
+; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: s_add_i32 s0, s0, 1
+; GFX11-NEXT: ;;#ASMSTART
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: ;;#ASMEND
+; GFX11-NEXT: s_cmp_lt_i32 s0, 10
+; GFX11-NEXT: s_cbranch_scc0 .LBB4_2
+; GFX11-NEXT: ; %bb.3: ; %bb2
+; GFX11-NEXT: ; in Loop: Header=BB4_1 Depth=1
+; GFX11-NEXT: s_getpc_b64 s[2:3]
+; GFX11-NEXT: .Lpost_getpc4:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s2, s2, (.LBB4_1-.Lpost_getpc4)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s3, s3, (.LBB4_1-.Lpost_getpc4)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[2:3]
+; GFX11-NEXT: .LBB4_2: ; %bb3
+; GFX11-NEXT: s_endpgm
+;
+; GFX12-LABEL: long_backward_sbranch:
+; GFX12: ; %bb.0: ; %bb
+; GFX12-NEXT: s_mov_b32 s0, 0
+; GFX12-NEXT: .LBB4_1: ; %bb2
+; GFX12-NEXT: ; =>This Inner Loop Header: Depth=1
+; GFX12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT: s_add_co_i32 s0, s0, 1
+; GFX12-NEXT: ;;#ASMSTART
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: v_nop_e64
+; GFX12-NEXT: ;;#ASMEND
+; GFX12-NEXT: s_cmp_lt_i32 s0, 10
+; GFX12-NEXT: s_cbranch_scc0 .LBB4_2
+; GFX12-NEXT: ; %bb.3: ; %bb2
+; GFX12-NEXT: ; in Loop: Header=BB4_1 Depth=1
+; GFX12-NEXT: s_getpc_b64 s[2:3]
+; GFX12-NEXT: .Lpost_getpc4:
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_u32 s2, s2, (.LBB4_1-.Lpost_getpc4)&4294967295
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_add_co_ci_u32 s3, s3, (.LBB4_1-.Lpost_getpc4)>>32
+; GFX12-NEXT: s_wait_alu 0xfffe
+; GFX12-NEXT: s_setpc_b64 s[2:3]
+; GFX12-NEXT: .LBB4_2: ; %bb3
+; GFX12-NEXT: s_endpgm
bb:
br label %bb2
@@ -311,6 +649,125 @@ define amdgpu_kernel void @uniform_unconditional_min_long_forward_branch(ptr add
; GCN-NEXT: s_add_u32 s0, s0, (.LBB5_3-.Lpost_getpc4)&4294967295
; GCN-NEXT: s_addc_u32 s1, s1, (.LBB5_3-.Lpost_getpc4)>>32
; GCN-NEXT: s_setpc_b64 s[0:1]
+;
+; GFX11-LABEL: uniform_unconditional_min_long_forward_branch:
+; GFX11: ; %bb.0: ; %bb0
+; GFX11-NEXT: s_load_b32 s0, s[2:3], 0x2c
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_cmp_eq_u32 s0, 0
+; GFX11-NEXT: s_mov_b64 s[0:1], -1
+; GFX11-NEXT: s_cbranch_scc0 .LBB5_1
+; GFX11-NEXT: ; %bb.7: ; %bb0
+; GFX11-NEXT: s_getpc_b64 s[0:1]
+; GFX11-NEXT: .Lpost_getpc6:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s0, s0, (.LBB5_4-.Lpost_getpc6)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s1, s1, (.LBB5_4-.Lpost_getpc6)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[0:1]
+; GFX11-NEXT: .LBB5_1: ; %Flow
+; GFX11-NEXT: s_and_not1_b64 vcc, exec, s[0:1]
+; GFX11-NEXT: s_cbranch_vccnz .LBB5_3
+; GFX11-NEXT: .LBB5_2: ; %bb2
+; GFX11-NEXT: v_mov_b32_e32 v0, 17
+; GFX11-NEXT: global_store_b32 v[0:1], v0, off dlc
+; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT: .LBB5_3: ; %bb4
+; GFX11-NEXT: s_load_b64 s[0:1], s[2:3], 0x24
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
+; GFX11-NEXT: v_mov_b32_e32 v1, 63
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: global_store_b32 v0, v1, s[0:1] dlc
+; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT: s_nop 0
+; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT: s_endpgm
+; GFX11-NEXT: .LBB5_4: ; %bb3
+; GFX11-NEXT: ;;#ASMSTART
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: v_nop_e64
+; GFX11-NEXT: ;;#ASMEND
+; GFX11-NEXT: s_cbranch_execnz .LBB5_5
+; GFX11-NEXT: ; %bb.9: ; %bb3
+; GFX11-NEXT: s_getpc_b64 s[0:1]
+; GFX11-NEXT: .Lpost_getpc7:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s0, s0, (.LBB5_2-.Lpost_getpc7)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_addc_u32 s1, s1, (.LBB5_2-.Lpost_getpc7)>>32
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_setpc_b64 s[0:1]
+; GFX11-NEXT: .LBB5_5: ; %bb3
+; GFX11-NEXT: s_getpc_b64 s[0:1]
+; GFX11-NEXT: .Lpost_getpc5:
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: s_add_u32 s0, s0, (.LBB5_3-.Lpost_getpc5)&4294967295
+; GFX11-NEXT: s_waitcnt_depctr 0xfffe
+; GFX11-NEXT: ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably could run branch relaxation before the post RA scheduler
651af28
to
153137d
Compare
Presumably this would cause problems if new hazard related instructions pushed a branch over the representable size? |
Yes. In principle we could pad the size of every instruction with the worst case number of instructions required for hazards in getInstSizeInBytes |
This doesn't really seem great. I think the solution in this patch works fine for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming all the workarounds are actually required.
; GFX11-NEXT: .Lpost_getpc0: | ||
; GFX11-NEXT: s_waitcnt_depctr 0xfffe | ||
; GFX11-NEXT: s_add_u32 s4, s4, (.LBB0_2-.Lpost_getpc0)&4294967295 | ||
; GFX11-NEXT: s_waitcnt_depctr 0xfffe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the workaround really needed here, even though there were no VALU since the previous workaround? (Just checking since I can't remember the exact rules.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this hazard. The entire branch sequence is pure SALU, so is this about instructions after the branch? Can you consider the branch target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayfoad - Second s_waitcnt_depctr
should be unnecessary on both GFX11 and GFX12.
I added it out of a sense of caution, but will remove it inline with normal workaround sequence.
Third s_waitcnt_depctr
is only needed on GFX12, but in my opinion it is simpler to have same code for 11 and 12 as the perf. impact in this case will basically be zero.
@arsenm - This is about VALU access to SGPR before these instructions.
If a VALU has accessed the SGPRs used here then each SALU write must be flushed before a subsequent SALU read to the same register.
Technically flush is not required if SGPR was not used by VALU, or some expiry condition was reached, but adding backward scan to establish this is excessive compared to low perf. cost of always mitigating this sequence.
Agreed - I think it would be a mistake to move branch relaxation earlier. |
BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.
153137d
to
9757672
Compare
BranchRelaxation runs after the hazard recognizer, so workarounds for SGPR accesses need to be applied directly inline to the code it generates.