Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] First step towards VPlan cost modeling. #92555

Merged
merged 33 commits into from
Jun 13, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented May 17, 2024

This adds a new interface to compute the cost of recipes, VPBasicBlocks, VPRegionBlocks
and VPlan, initially falling back to the legacy cost model for all recipes. Follow-up patches
will gradually migrate recipes to compute their own costs step-by-step.

It also adds getBestPlan function to LVP which computes the cost of all
VPlans and picks the most profitable one together with the most
profitable VF.

The VPlan selected by the VPlan cost model is executed and there is an
assert to catch cases where the VPlan cost model and the legacy cost
model disagree. Even though I checked a number of different build
configurations on AArch64 and X86, there may be some differences
that have been missed.

Additional discussions and context can be found in @arcbbb's
#67647 and
#67934 which is an earlier version
of the current PR.

This adds a new computeCost interface to VPReicpeBase and implements it
for VPWidenRecipe and VPWidenIntOrFpInductionRecipe.

It also adds getBestPlan function to LVP which computes the cost of all
VPlans and picks the most profitable one together with the most
profitable VF. For recipes that do not yet implement computeCost, the
legacy cost for the underlying instruction is used.

The VPlan selected by the VPlan cost model is executed and there is an
assert to catch cases where the VPlan cost model and the legacy cost
model disagree.
@fhahn fhahn changed the title [VPlan] First step towards VPlan cost modeling. [VPlan] First step towards VPlan cost modeling (LegacyCM in CostCtx) May 17, 2024
@fhahn fhahn requested a review from ayalz May 22, 2024 08:59
Copy link

github-actions bot commented May 22, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me, thanks! Adding various comments.
Worth noting in the commit message that the cost of VPWidenRecipe is computed directly, other recipes to follow.

Comment on lines 7484 to 7485
InstructionCost ScalarCost = computeCost(
getBestPlanFor(ElementCount::getFixed(1)), ElementCount::getFixed(1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InstructionCost ScalarCost = computeCost(
getBestPlanFor(ElementCount::getFixed(1)), ElementCount::getFixed(1));
InstructionCost ScalarCost = computeCost(getBestPlanFor(ScalarVF), ScalarVF);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines +1625 to +1626
/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.
/// Return the cost of instructions in an inloop reduction pattern, if \p I
/// is part of that pattern.

(unrelated to this patch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust separately.

/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.
std::optional<InstructionCost>
getReductionPatternCost(Instruction *I, ElementCount VF, Type *VectorTy,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better called getInLoopReductionPatternCost()?
(unrelated to this patch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust separately.

Very well. Another suggestion is to use Invalid cost for "no cost" instead of optional.

/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hoist this to the public methods area above, along with getReductionCost(), rather than below to the public fields area here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved, thanks

return CM.getInstructionCost(UI, VF).first;
}

bool VPCostContext::skipForCostComputation(Instruction *UI) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool VPCostContext::skipForCostComputation(Instruction *UI) const {
bool VPCostContext::skipCostComputation(Instruction *UI) const {

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

VPSingleDefRecipe *VPC;
if (auto *UV = R.getOperand(0)->getUnderlyingValue())
VPC = new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A,
TruncTy, *cast<CastInst>(UV));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses an underlying (Trunc) UV whose result type size may differ from the size of the inferred TruncTy of (S/ZExt) VPC, for legacy cost computation, hence dropping the assert in the constructor?

Copy link
Collaborator

@ayalz ayalz May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, UV is the underlying Z/SExt rather than the underlying Trunc, so its type clearly differs from that ot VPC. A more informative name would help, e.g., UnderlyingExt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the name, thanks!

@@ -738,6 +760,10 @@ class VPRecipeBase : public ilist_node_with_parent<VPRecipeBase, VPBasicBlock>,
/// this VPRecipe, thereby "executing" the VPlan.
virtual void execute(VPTransformState &State) = 0;

/// Compute the cost for the recipe. Returns an invalid cost if the recipe
/// does not yet implement computing the cost.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where/Is invalid returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to reflect the fact that the default implementation now returns the cost via the legacy CM, thanks!

@@ -730,6 +731,81 @@ void VPRegionBlock::execute(VPTransformState *State) {
State->Instance.reset();
}

static InstructionCost computeCostForRecipe(VPRecipeBase *R, ElementCount VF,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be folded into Recipe::computeCost()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, as VPRecipeBase::computeCost has the generic implementation to fall back on the legacy CM. Subclasses implementing it would all need to invoke computeCostForRecipe, unless there's an elegant alternative?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A public non-virtual VPRecipeBase::cost() or getCost() method can take care of skipping, forcing, or otherwise computing the cost of a recipe. The latter case can invoke a protected virtual VPRecipeBase::computeCost() which actually implements cost computations, with CM of underlying Instructions as default. Another naming option: computeCost() and computeCostImpl(). Otherwise the distinction between computeCostForRecipe() and Recipe::computeCost() needs to be explained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved as suggested, thanks!

Comment on lines 736 to 740
Instruction *UI = nullptr;
if (auto *S = dyn_cast<VPSingleDefRecipe>(R))
UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue());
if (UI && Ctx.skipForCostComputation(UI))
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Instruction *UI = nullptr;
if (auto *S = dyn_cast<VPSingleDefRecipe>(R))
UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue());
if (UI && Ctx.skipForCostComputation(UI))
return 0;
if (auto *S = dyn_cast<VPSingleDefRecipe>(R)) {
auto *UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue());
if (UI && Ctx.skipForCostComputation(UI))
return 0;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

Comment on lines 1565 to 1568
if (!match(R, m_Binary<Instruction::ICmp>(m_VPValue(), m_VPValue())))
return false;
return all_of(R->operands(), [](VPValue *Op) {
return vputils::isUniformAfterVectorization(Op);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to return match() && all_of()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed latest comments and also removed the VPWidenRecipe::computeCost in order to try to separate & split things up a bit more

/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved, thanks

return CM.getInstructionCost(UI, VF).first;
}

bool VPCostContext::skipForCostComputation(Instruction *UI) const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 7411 to 7412
LLVMContext &Ctx = OrigLoop->getHeader()->getContext();
VPCostContext CostCtx(CM.TTI, Legal->getWidestInductionType(), Ctx, CM);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

LLVMContext &Ctx = OrigLoop->getHeader()->getContext();
VPCostContext CostCtx(CM.TTI, Legal->getWidestInductionType(), Ctx, CM);

// Cost modeling for inductions is inaccurate in the legacy cost model
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as below, thanks!

// Cost modeling for inductions is inaccurate in the legacy cost model
// compared to the recipes that are generated. To match here initially during
// VPlan cost model bring up directly use the induction costs from the legacy
// cost model and skip induction recipes. Note that we do this as
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Comment on lines 257 to 264
if (auto *S = dyn_cast<VPSingleDefRecipe>(this))
if (auto *UI = dyn_cast_or_null<Instruction>(S->getUnderlyingValue()))
return Ctx.getLegacyCost(UI, VF);

if (auto *IG = dyn_cast<VPInterleaveRecipe>(this))
return Ctx.getLegacyCost(IG->getInsertPos(), VF);
if (auto *WidenMem = dyn_cast<VPWidenMemoryRecipe>(this))
return Ctx.getLegacyCost(&WidenMem->getIngredient(), VF);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@@ -253,6 +253,18 @@ void VPRecipeBase::moveBefore(VPBasicBlock &BB,
insertBefore(BB, I);
}

InstructionCost VPRecipeBase::computeCost(ElementCount VF, VPCostContext &Ctx) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, thanks!

VPSingleDefRecipe *VPC;
if (auto *UV = R.getOperand(0)->getUnderlyingValue())
VPC = new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A,
TruncTy, *cast<CastInst>(UV));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the name, thanks!

Comment on lines +1625 to +1626
/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust separately.

/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.
std::optional<InstructionCost>
getReductionPatternCost(Instruction *I, ElementCount VF, Type *VectorTy,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust separately.

@fhahn fhahn marked this pull request as ready for review May 23, 2024 21:30
@@ -730,6 +731,81 @@ void VPRegionBlock::execute(VPTransformState *State) {
State->Instance.reset();
}

static InstructionCost computeCostForRecipe(VPRecipeBase *R, ElementCount VF,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A public non-virtual VPRecipeBase::cost() or getCost() method can take care of skipping, forcing, or otherwise computing the cost of a recipe. The latter case can invoke a protected virtual VPRecipeBase::computeCost() which actually implements cost computations, with CM of underlying Instructions as default. Another naming option: computeCost() and computeCostImpl(). Otherwise the distinction between computeCostForRecipe() and Recipe::computeCost() needs to be explained.

@@ -365,6 +374,9 @@ class LoopVectorizationPlanner {
/// Return the best VPlan for \p VF.
VPlan &getBestPlanFor(ElementCount VF) const;

/// Return the most profitable plan.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this also fixes the best VF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment on lines 7471 to 7472
// Add the cost for the backedge.
Cost += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can and should be taken care of by (loop) region::computeCost()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, thanks!

Comment on lines 349 to 353
/// The current implementation requires access to the legacy cost model which
/// is why it is kept separate from the VPlan-only cost infrastructure.
///
/// TODO: Move to VPlan::computeCost once the use of the legacy cost model
/// has been retired.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has more to do with Legal::inductions and reductions, and their CM cost; the former are kept separate from VPlan and its cost implementation, rather than the latter, atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@@ -7396,6 +7397,122 @@ LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) {
return VF;
}

InstructionCost VPCostContext::getLegacyCost(Instruction *UI, ElementCount VF) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

// Compute the cost of a replicate region. Replicating isn't supported for
// scalable vectors, return an invalid cost for them.
if (VF.isScalable())
return InstructionCost::getInvalid();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't supported, should it be (prevented and) asserted, instead of built and cost invalidated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but at the moment this is done via the cost. Might be worth to adjust (e.g. bail out during VPlan construction), but best done separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth leaving behind a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

// blockNeedsPredication from Legal is used so as to not include all blocks in
// tail folded loops.
if (VF.isScalar())
return Cost / 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that getReciprocalPredBlockProb() should be used, which currently returns 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, required moving it to a header.


// For the scalar case, we may not always execute the original predicated
// block, Thus, scale the block's cost by the probability of executing it.
// blockNeedsPredication from Legal is used so as to not include all blocks in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockNeedsPredication is no longer used here, which only checks if VF is scalar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

Cond = Cond->getDefiningRecipe()->getOperand(0);
auto *R = Cond->getDefiningRecipe();
if (!R)
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a TODO to match additional patterns preserving uniformity of booleans, e.g., AND/OR/etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

}
for (Instruction *I : ReductionOperations) {
auto ReductionCost = CM.getReductionPatternCost(
I, VF, ToVectorTy(I->getType(), VF), TTI::TCK_RecipThroughput);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment that we precompute the cost of I only if it is associated with a reduction pattern, i.e., has ReductionCost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks.

/// LoopVectorizationLegality to handle inductions and reductions, which is
/// why it is kept separate from the VPlan-only cost infrastructure.
///
/// TODO: Move to VPlan::computeCost once the use of the legacy cost model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// TODO: Move to VPlan::computeCost once the use of the legacy cost model
/// TODO: Move to VPlan::computeCost once the use of Legal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

"More than a single plan/VF w/o any plan having scalar VF");

InstructionCost ScalarCost =
computeCost(getBestPlanFor(ElementCount::getFixed(1)), ScalarVF);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
computeCost(getBestPlanFor(ElementCount::getFixed(1)), ScalarVF);
computeCost(getBestPlanFor(ScalarVF), ScalarVF);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

});
return RecipeCost;
}

InstructionCost VPBasicBlock::computeCost(ElementCount VF, VPCostContext &Ctx) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should VPlan and VPBlockBase have cost() instead of computeCost() to align with VPRecipeBase, complementing their mutual execute()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

protected:
/// Compute the cost of this recipe using the legacy cost model and the
/// underlying instructions.
InstructionCost computeCost(ElementCount VF, VPCostContext &Ctx) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Currently non-virtual, as only the base default implementation is provided, to be made virtual later.)

@@ -75,7 +75,7 @@ class VPValue {
public:
/// Return the underlying Value attached to this VPValue.
Value *getUnderlyingValue() { return UnderlyingVal; }
const Value *getUnderlyingValue() const { return UnderlyingVal; }
Value *getUnderlyingValue() const { return UnderlyingVal; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loses const'ness? Makes the non-const version redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed, thanks!

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.
std::optional<InstructionCost>
getReductionPatternCost(Instruction *I, ElementCount VF, Type *VectorTy,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust separately.

Very well. Another suggestion is to use Invalid cost for "no cost" instead of optional.

@@ -2948,6 +2994,9 @@ class VPBasicBlock : public VPBlockBase {
return NewBlock;
}

/// Compute the cost of this VPBasicBlock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Compute the cost of this VPBasicBlock
/// Compute the cost of this VPBasicBlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

@@ -908,8 +908,13 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) {
unsigned ExtOpcode = match(R.getOperand(0), m_SExt(m_VPValue()))
? Instruction::SExt
: Instruction::ZExt;
auto *VPC =
new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A, TruncTy);
VPSingleDefRecipe *VPC;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment that UnderlyingExt, having distinct return type, is recorded only to support legacy cost computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be clearer in the updated version, thanks!

// VPlan cost model bring up directly use the induction costs from the legacy
// cost model and skip induction recipes. Note that we do this as
// pre-processing; the VPlan may not have any recipes associated with the
// original induction increment instruction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, worth clarifying in the comment?

If original induction increment instructions do have recipes, is this pre-processing needed, in this preliminary version where recipe costs default to the CM cost of their underlying instructions? Perhaps to retain debug dumps.

Instructions associated with in-loop reductions do need to be pre-processed in order to take their getReductionPatternCost() rather than their getInstructionCost().

@@ -664,6 +676,9 @@ class VPBlockBase {
/// the cloned recipes, including all blocks in the single-entry single-exit
/// region for VPRegionBlocks.
virtual VPBlockBase *clone() = 0;

/// Compute the cost of the block.
virtual InstructionCost cost(ElementCount VF, VPCostContext &Ctx) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better place all cost() methods next to execute(), as done in VPRecipeBase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@alexfh
Copy link
Contributor

alexfh commented Jul 16, 2024

And another test case:

$ cat reduced.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @_f(ptr %0, i16 %1) #0 {
.preheader.preheader:
  br label %.preheader

.loopexit.loopexit:                               ; preds = %.preheader
  ret void

.preheader:                                       ; preds = %.preheader, %.preheader.preheader
  %2 = phi i64 [ %13, %.preheader ], [ 0, %.preheader.preheader ]
  %3 = shl i64 %2, 3
  %4 = getelementptr i8, ptr %0, i64 %3
  %5 = load i16, ptr null, align 2
  %6 = trunc i32 0 to i16
  store i16 %6, ptr %4, align 2
  %7 = trunc i32 0 to i16
  %8 = getelementptr i8, ptr %4, i64 2
  store i16 %7, ptr %8, align 2
  %9 = sdiv i32 0, 1
  %10 = getelementptr i8, ptr %4, i64 4
  store i16 %1, ptr %10, align 2
  %11 = trunc i32 0 to i16
  %12 = getelementptr i8, ptr %4, i64 6
  store i16 %11, ptr %12, align 2
  %13 = add i64 %2, 1
  %14 = icmp eq i64 %13, 0
  br i1 %14, label %.loopexit.loopexit, label %.preheader
}

attributes #0 = { "target-features"="+aes,+cmov,+crc32,+cx16,+cx8,+fxsr,+mmx,+pclmul,+popcnt,+prfchw,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }
$ ./opt-bad -passes=loop-vectorize -S < reduced.ll
Invalid bitcast
  %26 = bitcast <4 x i16> %broadcast.splat to <8 x i16>
LLVM ERROR: Broken module found, compilation aborted!

@fhahn
Copy link
Contributor Author

fhahn commented Jul 16, 2024

@AlexHF is it possible the test case is over-reduced? I can reproduce the crash (with assertions, it triggers an assertion), but it is due to a dead load. Would it be possible to get a reproducer also without dead instructions?

@alexfh
Copy link
Contributor

alexfh commented Jul 16, 2024

The test case can easily be over-reduced. The interestingness test I used is that a known-good version of opt -passes=loop-vectorize succeeds and the same opt -passes=loop-vectorize after this commit crashes. Any advice on what I could do to verify there are no dead loads? opt -passes=loop-vectorizer,verify still succeeds on both reduced inputs.

@alexfh
Copy link
Contributor

alexfh commented Jul 16, 2024

Luckily, the code involved is open-source (a part of https://chromium.googlesource.com/angle/angle). Here's a couple of the non-reduced IR dumps I got using clang -mllvm -print-before=loop-vectorize -mllvm -print-module-scope:
err3.tar.gz,
err4.tar.gz

@fhahn
Copy link
Contributor Author

fhahn commented Jul 17, 2024

@AlexHF thanks, should be fixed in d216615

@alexfh
Copy link
Contributor

alexfh commented Jul 17, 2024

Thank you! I'll check the original failure as well.

@alexey-bataev
Copy link
Member

Looks like still crashes:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define fastcc void @FmoInit() #0 {
entry:
  br label %for.body25.i

for.body25.i:                                     ; preds = %for.body25.i, %entry
  %indvars.iv12.i1 = phi i64 [ 0, %entry ], [ %indvars.iv.next13.i.7, %for.body25.i ]
  %div66.i = lshr i64 %indvars.iv12.i1, 1
  %arrayidx27.i = getelementptr nusw i32, ptr null, i64 %div66.i
  %0 = load i32, ptr %arrayidx27.i, align 4
  %indvars.iv.next13.i = or i64 0, 0
  %arrayidx29.i.1 = getelementptr i32, ptr null, i64 %indvars.iv.next13.i
  store i32 0, ptr %arrayidx29.i.1, align 4
  %indvars.iv.next13.i.1 = or disjoint i64 %indvars.iv12.i1, 2
  %div66.i.2 = lshr i64 %indvars.iv.next13.i.1, 1
  %arrayidx27.i.2 = getelementptr i32, ptr null, i64 %div66.i.2
  %1 = load i32, ptr %arrayidx27.i.2, align 4
  %indvars.iv.next13.i.7 = add i64 %indvars.iv12.i1, 8
  %niter144.ncmp.7 = icmp eq i64 %indvars.iv.next13.i.7, 0
  br i1 %niter144.ncmp.7, label %FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit, label %for.body25.i

FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit: ; preds = %for.body25.i
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="cascadelake" }

development/bin/opt -S --passes=loop-vectorize ./reduced.ll

@fhahn
Copy link
Contributor Author

fhahn commented Jul 17, 2024

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

@alexey-bataev
Copy link
Member

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define fastcc void @FmoInit(ptr %call.i, i64 %unroll_iter143) #0 {
entry:
  %call.i10 = load volatile ptr, ptr null, align 8
  br label %for.body25.i

for.body25.i:                                     ; preds = %for.body25.i, %entry
  %indvars.iv12.i = phi i64 [ 0, %entry ], [ %indvars.iv.next13.i.7, %for.body25.i ]
  %div66.i = lshr i64 %indvars.iv12.i, 1
  %arrayidx27.i = getelementptr nusw i32, ptr %call.i, i64 %div66.i
  %0 = load i32, ptr %arrayidx27.i, align 4
  %arrayidx29.i = getelementptr i32, ptr %call.i10, i64 %indvars.iv12.i
  store i32 %0, ptr %arrayidx29.i, align 4
  %indvars.iv.next13.i = or i64 %indvars.iv12.i, 1
  %arrayidx29.i.1 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i
  store i32 0, ptr %arrayidx29.i.1, align 4
  %indvars.iv.next13.i.1 = or i64 %indvars.iv12.i, 2
  %div66.i.2 = lshr i64 %indvars.iv.next13.i.1, 1
  %arrayidx27.i.2 = getelementptr i32, ptr %call.i, i64 %div66.i.2
  %1 = load i32, ptr %arrayidx27.i.2, align 4
  %arrayidx29.i.2 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.1
  store i32 %1, ptr %arrayidx29.i.2, align 4
  %indvars.iv.next13.i.2 = or i64 %indvars.iv12.i, 3
  %arrayidx29.i.3 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.2
  store i32 0, ptr %arrayidx29.i.3, align 4
  %indvars.iv.next13.i.3 = or i64 %indvars.iv12.i, 4
  %arrayidx27.i.4 = getelementptr i32, ptr %call.i, i64 %indvars.iv12.i
  %2 = load i32, ptr %arrayidx27.i.4, align 4
  %arrayidx29.i.4 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.3
  store i32 %2, ptr %arrayidx29.i.4, align 4
  %indvars.iv.next13.i.4 = or i64 %indvars.iv12.i, 5
  %arrayidx29.i.5 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.4
  store i32 0, ptr %arrayidx29.i.5, align 4
  %indvars.iv.next13.i.5 = or i64 %indvars.iv12.i, 6
  %arrayidx29.i.6 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.5
  store i32 0, ptr %arrayidx29.i.6, align 4
  %indvars.iv.next13.i.6 = or i64 %indvars.iv12.i, 7
  %arrayidx29.i.7 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.6
  store i32 0, ptr %arrayidx29.i.7, align 4
  %indvars.iv.next13.i.7 = add nsw i64 %indvars.iv12.i, 8
  %niter144.ncmp.7 = icmp eq i64 %indvars.iv12.i, %unroll_iter143
  br i1 %niter144.ncmp.7, label %FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit, label %for.body25.i

FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit: ; preds = %for.body25.i
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="cascadelake" }

Try this one

@alexfh
Copy link
Contributor

alexfh commented Jul 17, 2024

Thank you! I'll check the original failure as well.

The code from #92555 (comment) compiles without errors now. I checked an assertions-enabled Clang build as well.

@fhahn
Copy link
Contributor Author

fhahn commented Jul 17, 2024

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define fastcc void @FmoInit(ptr %call.i, i64 %unroll_iter143) #0 {
entry:
  %call.i10 = load volatile ptr, ptr null, align 8
  br label %for.body25.i

for.body25.i:                                     ; preds = %for.body25.i, %entry
  %indvars.iv12.i = phi i64 [ 0, %entry ], [ %indvars.iv.next13.i.7, %for.body25.i ]
  %div66.i = lshr i64 %indvars.iv12.i, 1
  %arrayidx27.i = getelementptr nusw i32, ptr %call.i, i64 %div66.i
  %0 = load i32, ptr %arrayidx27.i, align 4
  %arrayidx29.i = getelementptr i32, ptr %call.i10, i64 %indvars.iv12.i
  store i32 %0, ptr %arrayidx29.i, align 4
  %indvars.iv.next13.i = or i64 %indvars.iv12.i, 1
  %arrayidx29.i.1 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i
  store i32 0, ptr %arrayidx29.i.1, align 4
  %indvars.iv.next13.i.1 = or i64 %indvars.iv12.i, 2
  %div66.i.2 = lshr i64 %indvars.iv.next13.i.1, 1
  %arrayidx27.i.2 = getelementptr i32, ptr %call.i, i64 %div66.i.2
  %1 = load i32, ptr %arrayidx27.i.2, align 4
  %arrayidx29.i.2 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.1
  store i32 %1, ptr %arrayidx29.i.2, align 4
  %indvars.iv.next13.i.2 = or i64 %indvars.iv12.i, 3
  %arrayidx29.i.3 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.2
  store i32 0, ptr %arrayidx29.i.3, align 4
  %indvars.iv.next13.i.3 = or i64 %indvars.iv12.i, 4
  %arrayidx27.i.4 = getelementptr i32, ptr %call.i, i64 %indvars.iv12.i
  %2 = load i32, ptr %arrayidx27.i.4, align 4
  %arrayidx29.i.4 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.3
  store i32 %2, ptr %arrayidx29.i.4, align 4
  %indvars.iv.next13.i.4 = or i64 %indvars.iv12.i, 5
  %arrayidx29.i.5 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.4
  store i32 0, ptr %arrayidx29.i.5, align 4
  %indvars.iv.next13.i.5 = or i64 %indvars.iv12.i, 6
  %arrayidx29.i.6 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.5
  store i32 0, ptr %arrayidx29.i.6, align 4
  %indvars.iv.next13.i.6 = or i64 %indvars.iv12.i, 7
  %arrayidx29.i.7 = getelementptr i32, ptr %call.i10, i64 %indvars.iv.next13.i.6
  store i32 0, ptr %arrayidx29.i.7, align 4
  %indvars.iv.next13.i.7 = add nsw i64 %indvars.iv12.i, 8
  %niter144.ncmp.7 = icmp eq i64 %indvars.iv12.i, %unroll_iter143
  br i1 %niter144.ncmp.7, label %FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit, label %for.body25.i

FmoGenerateMbToSliceGroupMap.exit.loopexit124.unr-lcssa.loopexit: ; preds = %for.body25.i
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="cascadelake" }

Try this one

Thanks, should be fixed now!

@nikic
Copy link
Contributor

nikic commented Jul 17, 2024

Given the assertion whack-a-mole, should this change be reverted until after LLVM 19 has branched early next week? (Or maybe just disable the consistency assertion?)

@alexfh
Copy link
Contributor

alexfh commented Jul 17, 2024

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

Try this one

Thanks, should be fixed now!

Which commit is the fix for this? 2bb65660ae8b9b2e1896b07b881505a4ffc0393b?

fhahn added a commit to fhahn/llvm-project that referenced this pull request Jul 18, 2024
This patch adds a new temporary option to still use the legacy cost
model after llvm#92555.
It defaults to false and the only intended use is to adjust the
default to true in the soon-to-be-cut release branch.
@fhahn
Copy link
Contributor Author

fhahn commented Jul 18, 2024

Given the assertion whack-a-mole, should this change be reverted until after LLVM 19 has branched early next week? (Or maybe just disable the consistency assertion?)

How about an option to enable the legacy behavior, off in main and enabled on the release branch? (#99536)

@fhahn
Copy link
Contributor Author

fhahn commented Jul 18, 2024

@alexey-bataev would it be possible to share the unreduced input without an unused load in the loop?

Try this one

Thanks, should be fixed now!

Which commit is the fix for this? 2bb65660ae8b9b2e1896b07b881505a4ffc0393b?

Yep

fhahn added a commit that referenced this pull request Jul 19, 2024
This patch adds a new temporary option to still use the legacy cost
model after #92555. It defaults
to false and the only intended use is to adjust the default to true in
the soon-to-be-cut release branch.

PR: #99536
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jul 23, 2024
As discussed in  llvm#92555 flip
the default for the option added in
llvm#99536 to true.

This restores the original behavior for the release branch to give the
VPlan-based cost model more time to mature on main.
tru pushed a commit that referenced this pull request Jul 23, 2024
As discussed in  #92555 flip
the default for the option added in
#99536 to true.

This restores the original behavior for the release branch to give the
VPlan-based cost model more time to mature on main.
sgundapa pushed a commit to sgundapa/upstream_effort that referenced this pull request Jul 23, 2024
This patch adds a new temporary option to still use the legacy cost
model after llvm#92555. It defaults
to false and the only intended use is to adjust the default to true in
the soon-to-be-cut release branch.

PR: llvm#99536
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
This patch adds a new temporary option to still use the legacy cost
model after #92555. It defaults
to false and the only intended use is to adjust the default to true in
the soon-to-be-cut release branch.

PR: #99536

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251192
EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024
This adds a new interface to compute the cost of recipes, VPBasicBlocks,
VPRegionBlocks and VPlan, initially falling back to the legacy cost model
for all recipes. Follow-up patches will gradually migrate recipes to 
compute their own costs step-by-step.

It also adds getBestPlan function to LVP which computes the cost of all
VPlans and picks the most profitable one together with the most
profitable VF.

The VPlan selected by the VPlan cost model is executed and there is an
assert to catch cases where the VPlan cost model and the legacy cost
model disagree. Even though I checked a number of different build
configurations on AArch64 and X86, there may be some differences
that have been missed.

Additional discussions and context can be found in @arcbbb's
llvm#67647 and 
llvm#67934 which is an earlier
version of the current PR.


PR: llvm#92555
EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024
EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.