[VPlan] First step towards VPlan cost modeling. #92555

fhahn · 2024-05-17T14:22:18Z

This adds a new interface to compute the cost of recipes, VPBasicBlocks, VPRegionBlocks
and VPlan, initially falling back to the legacy cost model for all recipes. Follow-up patches
will gradually migrate recipes to compute their own costs step-by-step.

It also adds getBestPlan function to LVP which computes the cost of all
VPlans and picks the most profitable one together with the most
profitable VF.

The VPlan selected by the VPlan cost model is executed and there is an
assert to catch cases where the VPlan cost model and the legacy cost
model disagree. Even though I checked a number of different build
configurations on AArch64 and X86, there may be some differences
that have been missed.

Additional discussions and context can be found in @arcbbb's
#67647 and
#67934 which is an earlier version
of the current PR.

This adds a new computeCost interface to VPReicpeBase and implements it for VPWidenRecipe and VPWidenIntOrFpInductionRecipe. It also adds getBestPlan function to LVP which computes the cost of all VPlans and picks the most profitable one together with the most profitable VF. For recipes that do not yet implement computeCost, the legacy cost for the underlying instruction is used. The VPlan selected by the VPlan cost model is executed and there is an assert to catch cases where the VPlan cost model and the legacy cost model disagree.

github-actions · 2024-05-22T09:07:37Z

✅ With the latest revision this PR passed the C/C++ code formatter.

ayalz

This looks right to me, thanks! Adding various comments.
Worth noting in the commit message that the cost of VPWidenRecipe is computed directly, other recipes to follow.

ayalz · 2024-05-22T21:09:08Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  InstructionCost ScalarCost = computeCost(
+      getBestPlanFor(ElementCount::getFixed(1)), ElementCount::getFixed(1));


Suggested change

InstructionCost ScalarCost = computeCost(

getBestPlanFor(ElementCount::getFixed(1)), ElementCount::getFixed(1));

InstructionCost ScalarCost = computeCost(getBestPlanFor(ScalarVF), ScalarVF);

Done, thanks!

ayalz · 2024-05-23T06:44:41Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Return the cost of instructions in an inloop reduction pattern, if I is
+  /// part of that pattern.


Suggested change

/// Return the cost of instructions in an inloop reduction pattern, if I is

/// part of that pattern.

/// Return the cost of instructions in an inloop reduction pattern, if \p I

/// is part of that pattern.

(unrelated to this patch).

Will adjust separately.

ayalz · 2024-05-23T06:45:34Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Return the cost of instructions in an inloop reduction pattern, if I is
+  /// part of that pattern.
+  std::optional<InstructionCost>
+  getReductionPatternCost(Instruction *I, ElementCount VF, Type *VectorTy,


Better called getInLoopReductionPatternCost()?
(unrelated to this patch).

Will adjust separately.

Will adjust separately.

Very well. Another suggestion is to use Invalid cost for "no cost" instead of optional.

ayalz · 2024-05-23T06:49:54Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Returns the execution time cost of an instruction for a given vector
+  /// width. Vector width of one means scalar.
+  VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);
+


nit: hoist this to the public methods area above, along with getReductionCost(), rather than below to the public fields area here?

moved, thanks

ayalz · 2024-05-23T06:53:45Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  return CM.getInstructionCost(UI, VF).first;
+}
+
+bool VPCostContext::skipForCostComputation(Instruction *UI) const {


Suggested change

bool VPCostContext::skipForCostComputation(Instruction *UI) const {

bool VPCostContext::skipCostComputation(Instruction *UI) const {

?

Done, thanks!

ayalz · 2024-05-23T11:28:54Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+        VPSingleDefRecipe *VPC;
+        if (auto *UV = R.getOperand(0)->getUnderlyingValue())
+          VPC = new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A,
+                                      TruncTy, *cast<CastInst>(UV));


This uses an underlying (Trunc) UV whose result type size may differ from the size of the inferred TruncTy of (S/ZExt) VPC, for legacy cost computation, hence dropping the assert in the constructor?

Ahh, UV is the underlying Z/SExt rather than the underlying Trunc, so its type clearly differs from that ot VPC. A more informative name would help, e.g., UnderlyingExt.

Updated the name, thanks!

ayalz · 2024-05-23T11:37:01Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -738,6 +760,10 @@ class VPRecipeBase : public ilist_node_with_parent<VPRecipeBase, VPBasicBlock>,
  /// this VPRecipe, thereby "executing" the VPlan.
  virtual void execute(VPTransformState &State) = 0;

+  /// Compute the cost for the recipe. Returns an invalid cost if the recipe
+  /// does not yet implement computing the cost.


Where/Is invalid returned?

Updated to reflect the fact that the default implementation now returns the cost via the legacy CM, thanks!

ayalz · 2024-05-23T11:40:24Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -730,6 +731,81 @@ void VPRegionBlock::execute(VPTransformState *State) {
  State->Instance.reset();
 }

+static InstructionCost computeCostForRecipe(VPRecipeBase *R, ElementCount VF,


Can this be folded into Recipe::computeCost()?

I don't think so, as VPRecipeBase::computeCost has the generic implementation to fall back on the legacy CM. Subclasses implementing it would all need to invoke computeCostForRecipe, unless there's an elegant alternative?

A public non-virtual VPRecipeBase::cost() or getCost() method can take care of skipping, forcing, or otherwise computing the cost of a recipe. The latter case can invoke a protected virtual VPRecipeBase::computeCost() which actually implements cost computations, with CM of underlying Instructions as default. Another naming option: computeCost() and computeCostImpl(). Otherwise the distinction between computeCostForRecipe() and Recipe::computeCost() needs to be explained.

Moved as suggested, thanks!

ayalz · 2024-05-23T11:51:39Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+  Instruction *UI = nullptr;
+  if (auto *S = dyn_cast<VPSingleDefRecipe>(R))
+    UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue());
+  if (UI && Ctx.skipForCostComputation(UI))
+    return 0;


Suggested change

Instruction *UI = nullptr;

if (auto *S = dyn_cast<VPSingleDefRecipe>(R))

UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue());

if (UI && Ctx.skipForCostComputation(UI))

return 0;

if (auto *S = dyn_cast<VPSingleDefRecipe>(R)) {

auto *UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue());

if (UI && Ctx.skipForCostComputation(UI))

return 0;

}

done, thanks!

ayalz · 2024-05-23T11:59:06Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+  if (!match(R, m_Binary<Instruction::ICmp>(m_VPValue(), m_VPValue())))
+    return false;
+  return all_of(R->operands(), [](VPValue *Op) {
+    return vputils::isUniformAfterVectorization(Op);


nit: better to return match() && all_of()?

updated, thanks!

fhahn

Addressed latest comments and also removed the VPWidenRecipe::computeCost in order to try to separate & split things up a bit more

fhahn · 2024-05-23T20:25:04Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Returns the execution time cost of an instruction for a given vector
+  /// width. Vector width of one means scalar.
+  VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);
+


moved, thanks

fhahn · 2024-05-23T20:28:40Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  return CM.getInstructionCost(UI, VF).first;
+}
+
+bool VPCostContext::skipForCostComputation(Instruction *UI) const {


Done, thanks!

fhahn · 2024-05-23T20:29:09Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  LLVMContext &Ctx = OrigLoop->getHeader()->getContext();
+  VPCostContext CostCtx(CM.TTI, Legal->getWidestInductionType(), Ctx, CM);


Done, thanks!

fhahn · 2024-05-23T20:29:46Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  LLVMContext &Ctx = OrigLoop->getHeader()->getContext();
+  VPCostContext CostCtx(CM.TTI, Legal->getWidestInductionType(), Ctx, CM);
+
+  // Cost modeling for inductions is inaccurate in the legacy cost model


Added as below, thanks!

fhahn · 2024-05-23T20:29:52Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // Cost modeling for inductions is inaccurate in the legacy cost model
+  // compared to the recipes that are generated. To match here initially during
+  // VPlan cost model bring up directly use the induction costs from the legacy
+  // cost model and skip induction recipes. Note that we do this as


Added, thanks!

fhahn · 2024-05-23T21:18:54Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  if (auto *S = dyn_cast<VPSingleDefRecipe>(this))
+    if (auto *UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue()))
+      return Ctx.getLegacyCost(UI, VF);
+
+  if (auto *IG = dyn_cast<VPInterleaveRecipe>(this))
+    return Ctx.getLegacyCost(IG->getInsertPos(), VF);
+  if (auto *WidenMem = dyn_cast<VPWidenMemoryRecipe>(this))
+    return Ctx.getLegacyCost(&WidenMem->getIngredient(), VF);


Updated, thanks!

fhahn · 2024-05-23T21:19:07Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -253,6 +253,18 @@ void VPRecipeBase::moveBefore(VPBasicBlock &BB,
  insertBefore(BB, I);
 }

+InstructionCost VPRecipeBase::computeCost(ElementCount VF, VPCostContext &Ctx) {


Added a comment, thanks!

fhahn · 2024-05-23T21:20:09Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+        VPSingleDefRecipe *VPC;
+        if (auto *UV = R.getOperand(0)->getUnderlyingValue())
+          VPC = new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A,
+                                      TruncTy, *cast<CastInst>(UV));


Updated the name, thanks!

fhahn · 2024-05-23T21:21:11Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Return the cost of instructions in an inloop reduction pattern, if I is
+  /// part of that pattern.


Will adjust separately.

fhahn · 2024-05-23T21:21:17Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Return the cost of instructions in an inloop reduction pattern, if I is
+  /// part of that pattern.
+  std::optional<InstructionCost>
+  getReductionPatternCost(Instruction *I, ElementCount VF, Type *VectorTy,


Will adjust separately.

ayalz · 2024-05-24T09:56:33Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -730,6 +731,81 @@ void VPRegionBlock::execute(VPTransformState *State) {
  State->Instance.reset();
 }

+static InstructionCost computeCostForRecipe(VPRecipeBase *R, ElementCount VF,


A public non-virtual VPRecipeBase::cost() or getCost() method can take care of skipping, forcing, or otherwise computing the cost of a recipe. The latter case can invoke a protected virtual VPRecipeBase::computeCost() which actually implements cost computations, with CM of underlying Instructions as default. Another naming option: computeCost() and computeCostImpl(). Otherwise the distinction between computeCostForRecipe() and Recipe::computeCost() needs to be explained.

ayalz · 2024-05-25T14:29:04Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

@@ -365,6 +374,9 @@ class LoopVectorizationPlanner {
  /// Return the best VPlan for \p VF.
  VPlan &getBestPlanFor(ElementCount VF) const;

+  /// Return the most profitable plan.


Note this also fixes the best VF.

Updated, thanks!

ayalz · 2024-05-26T16:28:49Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // Add the cost for the backedge.
+  Cost += 1;


This can and should be taken care of by (loop) region::computeCost()?

Moved, thanks!

ayalz · 2024-05-26T16:31:51Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

+  /// The current implementation requires access to the legacy cost model which
+  /// is why it is kept separate from the VPlan-only cost infrastructure.
+  ///
+  /// TODO: Move to VPlan::computeCost once the use of the legacy cost model
+  /// has been retired.


This has more to do with Legal::inductions and reductions, and their CM cost; the former are kept separate from VPlan and its cost implementation, rather than the latter, atm.

Updated, thanks!

ayalz · 2024-05-26T16:34:44Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7396,6 +7397,122 @@ LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) {
  return VF;
 }

+InstructionCost VPCostContext::getLegacyCost(Instruction *UI, ElementCount VF) {


done, thanks!

ayalz · 2024-05-26T16:39:55Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+  // Compute the cost of a replicate region. Replicating isn't supported for
+  // scalable vectors, return an invalid cost for them.
+  if (VF.isScalable())
+    return InstructionCost::getInvalid();


If it isn't supported, should it be (prevented and) asserted, instead of built and cost invalidated?

Yes, but at the moment this is done via the cost. Might be worth to adjust (e.g. bail out during VPlan construction), but best done separately.

Worth leaving behind a TODO?

Added, thanks!

ayalz · 2024-05-26T16:43:20Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+  // blockNeedsPredication from Legal is used so as to not include all blocks in
+  // tail folded loops.
+  if (VF.isScalar())
+    return Cost / 2;


Note that getReciprocalPredBlockProb() should be used, which currently returns 2?

Done, required moving it to a header.

ayalz · 2024-05-26T16:47:02Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+
+  // For the scalar case, we may not always execute the original predicated
+  // block, Thus, scale the block's cost by the probability of executing it.
+  // blockNeedsPredication from Legal is used so as to not include all blocks in


blockNeedsPredication is no longer used here, which only checks if VF is scalar.

Done, thanks

ayalz · 2024-05-26T16:48:55Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+    Cond = Cond->getDefiningRecipe()->getOperand(0);
+  auto *R = Cond->getDefiningRecipe();
+  if (!R)
+    return true;


Worth adding a TODO to match additional patterns preserving uniformity of booleans, e.g., AND/OR/etc.?

Done, thanks!

ayalz · 2024-05-26T16:58:53Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    }
+    for (Instruction *I : ReductionOperations) {
+      auto ReductionCost = CM.getReductionPatternCost(
+          I, VF, ToVectorTy(I->getType(), VF), TTI::TCK_RecipThroughput);


Worth a comment that we precompute the cost of I only if it is associated with a reduction pattern, i.e., has ReductionCost.

Added, thanks.

ayalz · 2024-05-27T06:20:26Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

+  /// LoopVectorizationLegality to handle inductions and reductions, which is
+  /// why it is kept separate from the VPlan-only cost infrastructure.
+  ///
+  /// TODO: Move to VPlan::computeCost once the use of the legacy cost model


Suggested change

/// TODO: Move to VPlan::computeCost once the use of the legacy cost model

/// TODO: Move to VPlan::computeCost once the use of Legal

fixed, thanks!

ayalz · 2024-05-27T06:25:10Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+         "More than a single plan/VF w/o any plan having scalar VF");
+
+  InstructionCost ScalarCost =
+      computeCost(getBestPlanFor(ElementCount::getFixed(1)), ScalarVF);


Suggested change

computeCost(getBestPlanFor(ElementCount::getFixed(1)), ScalarVF);

computeCost(getBestPlanFor(ScalarVF), ScalarVF);

?

Done, thanks!

ayalz · 2024-05-27T09:53:42Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

-  });
-  return RecipeCost;
-}
-
 InstructionCost VPBasicBlock::computeCost(ElementCount VF, VPCostContext &Ctx) {


Should VPlan and VPBlockBase have cost() instead of computeCost() to align with VPRecipeBase, complementing their mutual execute()?

Done, thanks!

ayalz · 2024-05-27T09:57:32Z

llvm/lib/Transforms/Vectorize/VPlan.h

+protected:
+  /// Compute the cost of this recipe using the legacy cost model and the
+  /// underlying instructions.
+  InstructionCost computeCost(ElementCount VF, VPCostContext &Ctx) const;


(Currently non-virtual, as only the base default implementation is provided, to be made virtual later.)

ayalz · 2024-05-27T09:59:18Z

llvm/lib/Transforms/Vectorize/VPlanValue.h

@@ -75,7 +75,7 @@ class VPValue {
 public:
  /// Return the underlying Value attached to this VPValue.
  Value *getUnderlyingValue() { return UnderlyingVal; }
-  const Value *getUnderlyingValue() const { return UnderlyingVal; }
+  Value *getUnderlyingValue() const { return UnderlyingVal; }


This loses const'ness? Makes the non-const version redundant?

Yep, removed, thanks!

ayalz

nit

ayalz · 2024-05-28T09:00:56Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  /// Return the cost of instructions in an inloop reduction pattern, if I is
+  /// part of that pattern.
+  std::optional<InstructionCost>
+  getReductionPatternCost(Instruction *I, ElementCount VF, Type *VectorTy,


Will adjust separately.

Very well. Another suggestion is to use Invalid cost for "no cost" instead of optional.

ayalz · 2024-05-28T09:08:46Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -2948,6 +2994,9 @@ class VPBasicBlock : public VPBlockBase {
    return NewBlock;
  }

+  /// Compute the cost of this VPBasicBlock


Suggested change

/// Compute the cost of this VPBasicBlock

/// Compute the cost of this VPBasicBlock.

fixed, thanks!

ayalz · 2024-05-28T09:13:05Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -908,8 +908,13 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) {
        unsigned ExtOpcode = match(R.getOperand(0), m_SExt(m_VPValue()))
                                 ? Instruction::SExt
                                 : Instruction::ZExt;
-        auto *VPC =
-            new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A, TruncTy);
+        VPSingleDefRecipe *VPC;


Worth a comment that UnderlyingExt, having distinct return type, is recorded only to support legacy cost computation?

Should be clearer in the updated version, thanks!

ayalz · 2024-05-28T09:22:09Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // VPlan cost model bring up directly use the induction costs from the legacy
+  // cost model and skip induction recipes. Note that we do this as
+  // pre-processing; the VPlan may not have any recipes associated with the
+  // original induction increment instruction.


OK, worth clarifying in the comment?

If original induction increment instructions do have recipes, is this pre-processing needed, in this preliminary version where recipe costs default to the CM cost of their underlying instructions? Perhaps to retain debug dumps.

Instructions associated with in-loop reductions do need to be pre-processed in order to take their getReductionPatternCost() rather than their getInstructionCost().

ayalz · 2024-05-28T09:33:05Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -664,6 +676,9 @@ class VPBlockBase {
  /// the cloned recipes, including all blocks in the single-entry single-exit
  /// region for VPRegionBlocks.
  virtual VPBlockBase *clone() = 0;
+
+  /// Compute the cost of the block.
+  virtual InstructionCost cost(ElementCount VF, VPCostContext &Ctx) = 0;


nit: better place all cost() methods next to execute(), as done in VPRecipeBase.

Done, thanks!

alexfh · 2024-07-16T17:38:38Z

And another test case:

$ cat reduced.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @_f(ptr %0, i16 %1) #0 {
.preheader.preheader:
  br label %.preheader

.loopexit.loopexit:                               ; preds = %.preheader
  ret void

.preheader:                                       ; preds = %.preheader, %.preheader.preheader
  %2 = phi i64 [ %13, %.preheader ], [ 0, %.preheader.preheader ]
  %3 = shl i64 %2, 3
  %4 = getelementptr i8, ptr %0, i64 %3
  %5 = load i16, ptr null, align 2
  %6 = trunc i32 0 to i16
  store i16 %6, ptr %4, align 2
  %7 = trunc i32 0 to i16
  %8 = getelementptr i8, ptr %4, i64 2
  store i16 %7, ptr %8, align 2
  %9 = sdiv i32 0, 1
  %10 = getelementptr i8, ptr %4, i64 4
  store i16 %1, ptr %10, align 2
  %11 = trunc i32 0 to i16
  %12 = getelementptr i8, ptr %4, i64 6
  store i16 %11, ptr %12, align 2
  %13 = add i64 %2, 1
  %14 = icmp eq i64 %13, 0
  br i1 %14, label %.loopexit.loopexit, label %.preheader
}

attributes #0 = { "target-features"="+aes,+cmov,+crc32,+cx16,+cx8,+fxsr,+mmx,+pclmul,+popcnt,+prfchw,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }
$ ./opt-bad -passes=loop-vectorize -S < reduced.ll
Invalid bitcast
  %26 = bitcast <4 x i16> %broadcast.splat to <8 x i16>
LLVM ERROR: Broken module found, compilation aborted!

fhahn · 2024-07-16T20:04:19Z

@AlexHF is it possible the test case is over-reduced? I can reproduce the crash (with assertions, it triggers an assertion), but it is due to a dead load. Would it be possible to get a reproducer also without dead instructions?

alexfh · 2024-07-16T21:55:11Z

The test case can easily be over-reduced. The interestingness test I used is that a known-good version of opt -passes=loop-vectorize succeeds and the same opt -passes=loop-vectorize after this commit crashes. Any advice on what I could do to verify there are no dead loads? opt -passes=loop-vectorizer,verify still succeeds on both reduced inputs.

alexfh · 2024-07-16T22:11:28Z

Luckily, the code involved is open-source (a part of https://chromium.googlesource.com/angle/angle). Here's a couple of the non-reduced IR dumps I got using clang -mllvm -print-before=loop-vectorize -mllvm -print-module-scope:
err3.tar.gz,
err4.tar.gz

fhahn · 2024-07-17T10:44:33Z

@AlexHF thanks, should be fixed in d216615

alexfh · 2024-07-17T10:46:22Z

Thank you! I'll check the original failure as well.

alexey-bataev · 2024-07-17T14:01:35Z

Looks like still crashes:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define fastcc void @FmoInit() #0 {
entry:
  br label %for.body25.i

for.body25.i:                                     ; preds = %for.body25.i, %entry
  %indvars.iv12.i1 = phi i64 [ 0, %entry ], [ %indvars.iv.next13.i.7, %for.body25.i ]
  %div66.i = lshr i64 %indvars.iv12.i1, 1
  %arrayidx27.i = getelementptr nusw i32, ptr null, i64 %div66.i
  %0 = load i32, ptr %arrayidx27.i, align 4
  %indvars.iv.next13.i = or i64 0, 0
  %arrayidx29.i.1 = getelementptr i32, ptr null, i64 %indvars.iv.next13.i
  store i32 0, ptr %arrayidx29.i.1, align 4
  %indvars.iv.next13.i.1 = or disjoint i64 %indvars.iv12.i1, 2
  %div66.i.2 = lshr i64 %indvars.iv.next13.i.1, 1
  %arrayidx27.i.2 = getelementptr i32, ptr null, i64 %div66.i.2
  %1 = load i32, ptr %arrayidx27.i.2, align 4
  %indvars.iv.next13.i.7 = add i64 %indvars.iv12.i1, 8
  %niter144.ncmp.7 = icmp eq i64 %indvars.iv.next13.i.7, 0
  br i1 %niter144.ncmp.7, label %FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit, label %for.body25.i

FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit: ; preds = %for.body25.i
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="cascadelake" }

development/bin/opt -S --passes=loop-vectorize ./reduced.ll

fhahn · 2024-07-17T14:24:39Z

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

alexey-bataev · 2024-07-17T14:27:53Z

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define fastcc void @FmoInit(ptr %call.i, i64 %unroll_iter143) #0 {
entry:
  %call.i10 = load volatile ptr, ptr null, align 8
  br label %for.body25.i

for.body25.i:                                     ; preds = %for.body25.i, %entry
  %indvars.iv12.i = phi i64 [ 0, %entry ], [ %indvars.iv.next13.i.7, %for.body25.i ]
  %div66.i = lshr i64 %indvars.iv12.i, 1
  %arrayidx27.i = getelementptr nusw i32, ptr %call.i, i64 %div66.i
  %0 = load i32, ptr %arrayidx27.i, align 4
  %arrayidx29.i = getelementptr i32, ptr %call.i10, i64 %indvars.iv12.i
  store i32 %0, ptr %arrayidx29.i, align 4
  %indvars.iv.next13.i = or i64 %indvars.iv12.i, 1
  %arrayidx29.i.1 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i
  store i32 0, ptr %arrayidx29.i.1, align 4
  %indvars.iv.next13.i.1 = or i64 %indvars.iv12.i, 2
  %div66.i.2 = lshr i64 %indvars.iv.next13.i.1, 1
  %arrayidx27.i.2 = getelementptr i32, ptr %call.i, i64 %div66.i.2
  %1 = load i32, ptr %arrayidx27.i.2, align 4
  %arrayidx29.i.2 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.1
  store i32 %1, ptr %arrayidx29.i.2, align 4
  %indvars.iv.next13.i.2 = or i64 %indvars.iv12.i, 3
  %arrayidx29.i.3 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.2
  store i32 0, ptr %arrayidx29.i.3, align 4
  %indvars.iv.next13.i.3 = or i64 %indvars.iv12.i, 4
  %arrayidx27.i.4 = getelementptr i32, ptr %call.i, i64 %indvars.iv12.i
  %2 = load i32, ptr %arrayidx27.i.4, align 4
  %arrayidx29.i.4 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.3
  store i32 %2, ptr %arrayidx29.i.4, align 4
  %indvars.iv.next13.i.4 = or i64 %indvars.iv12.i, 5
  %arrayidx29.i.5 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.4
  store i32 0, ptr %arrayidx29.i.5, align 4
  %indvars.iv.next13.i.5 = or i64 %indvars.iv12.i, 6
  %arrayidx29.i.6 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.5
  store i32 0, ptr %arrayidx29.i.6, align 4
  %indvars.iv.next13.i.6 = or i64 %indvars.iv12.i, 7
  %arrayidx29.i.7 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.6
  store i32 0, ptr %arrayidx29.i.7, align 4
  %indvars.iv.next13.i.7 = add nsw i64 %indvars.iv12.i, 8
  %niter144.ncmp.7 = icmp eq i64 %indvars.iv12.i, %unroll_iter143
  br i1 %niter144.ncmp.7, label %FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit, label %for.body25.i

FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit: ; preds = %for.body25.i
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="cascadelake" }

Try this one

alexfh · 2024-07-17T16:51:48Z

Thank you! I'll check the original failure as well.

The code from #92555 (comment) compiles without errors now. I checked an assertions-enabled Clang build as well.

fhahn · 2024-07-17T20:54:23Z

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define fastcc void @FmoInit(ptr %call.i, i64 %unroll_iter143) #0 {
entry:
  %call.i10 = load volatile ptr, ptr null, align 8
  br label %for.body25.i

for.body25.i:                                     ; preds = %for.body25.i, %entry
  %indvars.iv12.i = phi i64 [ 0, %entry ], [ %indvars.iv.next13.i.7, %for.body25.i ]
  %div66.i = lshr i64 %indvars.iv12.i, 1
  %arrayidx27.i = getelementptr nusw i32, ptr %call.i, i64 %div66.i
  %0 = load i32, ptr %arrayidx27.i, align 4
  %arrayidx29.i = getelementptr i32, ptr %call.i10, i64 %indvars.iv12.i
  store i32 %0, ptr %arrayidx29.i, align 4
  %indvars.iv.next13.i = or i64 %indvars.iv12.i, 1
  %arrayidx29.i.1 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i
  store i32 0, ptr %arrayidx29.i.1, align 4
  %indvars.iv.next13.i.1 = or i64 %indvars.iv12.i, 2
  %div66.i.2 = lshr i64 %indvars.iv.next13.i.1, 1
  %arrayidx27.i.2 = getelementptr i32, ptr %call.i, i64 %div66.i.2
  %1 = load i32, ptr %arrayidx27.i.2, align 4
  %arrayidx29.i.2 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.1
  store i32 %1, ptr %arrayidx29.i.2, align 4
  %indvars.iv.next13.i.2 = or i64 %indvars.iv12.i, 3
  %arrayidx29.i.3 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.2
  store i32 0, ptr %arrayidx29.i.3, align 4
  %indvars.iv.next13.i.3 = or i64 %indvars.iv12.i, 4
  %arrayidx27.i.4 = getelementptr i32, ptr %call.i, i64 %indvars.iv12.i
  %2 = load i32, ptr %arrayidx27.i.4, align 4
  %arrayidx29.i.4 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.3
  store i32 %2, ptr %arrayidx29.i.4, align 4
  %indvars.iv.next13.i.4 = or i64 %indvars.iv12.i, 5
  %arrayidx29.i.5 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.4
  store i32 0, ptr %arrayidx29.i.5, align 4
  %indvars.iv.next13.i.5 = or i64 %indvars.iv12.i, 6
  %arrayidx29.i.6 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.5
  store i32 0, ptr %arrayidx29.i.6, align 4
  %indvars.iv.next13.i.6 = or i64 %indvars.iv12.i, 7
  %arrayidx29.i.7 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.6
  store i32 0, ptr %arrayidx29.i.7, align 4
  %indvars.iv.next13.i.7 = add nsw i64 %indvars.iv12.i, 8
  %niter144.ncmp.7 = icmp eq i64 %indvars.iv12.i, %unroll_iter143
  br i1 %niter144.ncmp.7, label %FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit, label %for.body25.i

FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit: ; preds = %for.body25.i
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="cascadelake" }

Try this one

Thanks, should be fixed now!

nikic · 2024-07-17T21:38:42Z

Given the assertion whack-a-mole, should this change be reverted until after LLVM 19 has branched early next week? (Or maybe just disable the consistency assertion?)

alexfh · 2024-07-17T23:58:26Z

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

Try this one

Thanks, should be fixed now!

Which commit is the fix for this? 2bb65660ae8b9b2e1896b07b881505a4ffc0393b?

This patch adds a new temporary option to still use the legacy cost model after llvm#92555. It defaults to false and the only intended use is to adjust the default to true in the soon-to-be-cut release branch.

fhahn · 2024-07-18T18:00:38Z

Given the assertion whack-a-mole, should this change be reverted until after LLVM 19 has branched early next week? (Or maybe just disable the consistency assertion?)

How about an option to enable the legacy behavior, off in main and enabled on the release branch? (#99536)

fhahn · 2024-07-18T18:00:56Z

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

Try this one

Thanks, should be fixed now!

Which commit is the fix for this? 2bb65660ae8b9b2e1896b07b881505a4ffc0393b?

Yep

This patch adds a new temporary option to still use the legacy cost model after #92555. It defaults to false and the only intended use is to adjust the default to true in the soon-to-be-cut release branch. PR: #99536

As discussed in llvm#92555 flip the default for the option added in llvm#99536 to true. This restores the original behavior for the release branch to give the VPlan-based cost model more time to mature on main.

As discussed in #92555 flip the default for the option added in #99536 to true. This restores the original behavior for the release branch to give the VPlan-based cost model more time to mature on main.

This patch adds a new temporary option to still use the legacy cost model after llvm#92555. It defaults to false and the only intended use is to adjust the default to true in the soon-to-be-cut release branch. PR: llvm#99536

Summary: This patch adds a new temporary option to still use the legacy cost model after #92555. It defaults to false and the only intended use is to adjust the default to true in the soon-to-be-cut release branch. PR: #99536 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251192

@arcbbb

This adds a new interface to compute the cost of recipes, VPBasicBlocks, VPRegionBlocks and VPlan, initially falling back to the legacy cost model for all recipes. Follow-up patches will gradually migrate recipes to compute their own costs step-by-step. It also adds getBestPlan function to LVP which computes the cost of all VPlans and picks the most profitable one together with the most profitable VF. The VPlan selected by the VPlan cost model is executed and there is an assert to catch cases where the VPlan cost model and the legacy cost model disagree. Even though I checked a number of different build configurations on AArch64 and X86, there may be some differences that have been missed. Additional discussions and context can be found in @arcbbb's llvm#67647 and llvm#67934 which is an earlier version of the current PR. PR: llvm#92555

This reverts commit 0079835. Causes crashes, see comments on llvm#92555.

Extra test cases that caused revert of llvm#92555

fhahn added 6 commits April 28, 2024 13:33

Merge remote-tracking branch 'origin/main' into vplan-cost

6330a67

Merge remote-tracking branch 'origin/main' into vplan-cost

0da9e25

!fixup address latest comments, thanks!

52786ae

Merge remote-tracking branch 'origin/main' into vplan-cost

7043085

!fixup Move legacy CM to context.

d2fa5ee

fhahn changed the title ~~[VPlan] First step towards VPlan cost modeling.~~ [VPlan] First step towards VPlan cost modeling (LegacyCM in CostCtx) May 17, 2024

fhahn mentioned this pull request May 17, 2024

[VPlan] First step towards VPlan cost modeling. #67934

Closed

fhahn requested a review from ayalz May 22, 2024 08:59

fhahn added 2 commits May 22, 2024 10:00

Merge remote-tracking branch 'origin/main' into vplan-cost-cm-in-ctx

b1ab1b8

!fixup

c91f8ba

!fixup fix formatting.

e1cd132

ayalz reviewed May 23, 2024

View reviewed changes

fhahn added 2 commits May 23, 2024 20:57

Merge remote-tracking branch 'origin/main' into vplan-cost-cm-in-ctx

e66563b

!fixup addres latest comments, thanks

9a4111d

fhahn commented May 23, 2024

View reviewed changes

fhahn marked this pull request as ready for review May 23, 2024 21:30

ayalz reviewed May 26, 2024

View reviewed changes

fhahn added 2 commits May 26, 2024 17:22

Merge remote-tracking branch 'origin/main' into vplan-cost-cm-in-ctx

faa855d

!fixup address latest comments, thanks!

860aae1

ayalz reviewed May 27, 2024

View reviewed changes

fhahn added 2 commits May 27, 2024 15:37

Merge remote-tracking branch 'origin/main' into vplan-cost-cm-in-ctx

32fc296

!fixup address comments, thanks!

17442f9

ayalz reviewed May 28, 2024

View reviewed changes

fhahn added 4 commits May 31, 2024 06:08

Merge remote-tracking branch 'origin/main' into vplan-cost-cm-in-ctx

b27201c

Merge remote-tracking branch 'origin/main' into vplan-cost-cm-in-ctx

24e03bd

[LV] Add test with strided interleave groups and maximizing bandwidth.

1ae4d60

[LV] Operands feeding pointers of interleave member pointers are free.

423adca

fhahn mentioned this pull request Jul 18, 2024

[LV] Add option to still enable the legacy cost model. #99536

Merged

patrick-rivos mentioned this pull request Jul 19, 2024

[VPlan] Assertion `VF.Width == Width && "VPlan cost model and legacy cost model disagreed"' failed. #99701

Closed

fhahn mentioned this pull request Jul 23, 2024

[LV] Use legacy cost model by default in 19.x #100093

Closed

fhahn mentioned this pull request Jul 23, 2024

[LV] Disable VPlan-based cost model for 19.x release. #100097

Closed

patrick-rivos mentioned this pull request Jul 24, 2024

[VPlan] VPlan cost model and legacy cost model disagreed #100464

Closed

EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024

Revert "[VPlan] First step towards VPlan cost modeling. (llvm#92555)"

97b730a

This reverts commit 0079835. Causes crashes, see comments on llvm#92555.

EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024

[LV] Add extra cost model tests with truncated inductions.

ce79a0a

Extra test cases that caused revert of llvm#92555

ayalz mentioned this pull request Aug 18, 2024

[VPlan] Compute cost for most opcodes in VPWidenRecipe (NFCI). #98764

Merged

		InstructionCost ScalarCost = computeCost(
		getBestPlanFor(ElementCount::getFixed(1)), ElementCount::getFixed(1));

		/// Return the cost of instructions in an inloop reduction pattern, if I is
		/// part of that pattern.

	bool VPCostContext::skipForCostComputation(Instruction *UI) const {
	bool VPCostContext::skipCostComputation(Instruction *UI) const {

		LLVMContext &Ctx = OrigLoop->getHeader()->getContext();
		VPCostContext CostCtx(CM.TTI, Legal->getWidestInductionType(), Ctx, CM);

	/// TODO: Move to VPlan::computeCost once the use of the legacy cost model
	/// TODO: Move to VPlan::computeCost once the use of Legal

	computeCost(getBestPlanFor(ElementCount::getFixed(1)), ScalarVF);
	computeCost(getBestPlanFor(ScalarVF), ScalarVF);

	/// Compute the cost of this VPBasicBlock
	/// Compute the cost of this VPBasicBlock.

[VPlan] First step towards VPlan cost modeling. #92555

[VPlan] First step towards VPlan cost modeling. #92555

Conversation

fhahn commented May 17, 2024 • edited Loading

github-actions bot commented May 22, 2024 • edited Loading

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayalz May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn commented May 17, 2024 •

edited

Loading

github-actions bot commented May 22, 2024 •

edited

Loading

ayalz May 23, 2024 •

edited

Loading