The replica count of target pods fluctuates when fallback is triggered in scaling-modifier #5666

SpiritZhou · 2024-04-07T09:20:48Z

Report

If one of the scalers encounters an error while using scaling-modifier, the replica count cannot remain stable at the fallback value. Instead, it fluctuates between 1 and the fallback value.

Expected Behavior

The replica count of target pod keeps at the fallback value when scaler encounter an error.

Actual Behavior

The replica count of target pod keeps fluctuating between 1 and the fallback value.

Steps to Reproduce the Problem

Running the fallback template in scaling_modifiers_test.go, scale metricsServerDeployment to 0 to trigger fallback and keep for a while.

Logs from KEDA operator

keda-keda-operator-5789f449c4-dprm4-1712480741312929239.log

KEDA Version

2.13.1

Kubernetes Version

1.27

Platform

Other

Scaler Details

No response

Anything else?

No response

JorTurFer · 2024-04-07T16:38:10Z

Definitivelly it shouldn't happen and fallback should be applied always, could you take a look?

SpiritZhou · 2024-04-08T03:45:21Z

After a quick check, I think the metric being provided to the metric server is incorrect.

When there is no fallback, the metric is a composite metric.

However, if a fallback occurs, the metric changes to separate metrics, and the metric values are not equal to the fallback value. This metric can trigger HPA to scale to 1.

Meanwhile, the doFallbackScaling() function continues to work, which causes the target pod to fluctuate between 1 and the fallback value.

zroubalik · 2024-04-08T22:27:47Z

@SpiritZhou good catch, I also belive this is the root of the problem.

SpiritZhou · 2024-04-10T05:55:41Z

There is another bug in the dofallback(). The metricSpec.External.Target.AverageValue will be 0 when the scakubg-modifier is active and the correct fallback value cannot be calculated. Should it be scaledObject.Spec.Advanced.ScalingModifiers.Target? @zroubalik @JorTurFer

func doFallback(scaledObject *kedav1alpha1.ScaledObject, metricSpec v2.MetricSpec, metricName string, suppressedError error) []external_metrics.ExternalMetricValue {
	replicas := int64(scaledObject.Spec.Fallback.Replicas)
	normalisationValue := metricSpec.External.Target.AverageValue.AsApproximateFloat64()
	metric := external_metrics.ExternalMetricValue{
		MetricName: metricName,
		Value:      *resource.NewMilliQuantity(int64(normalisationValue*1000)*replicas, resource.DecimalSI),
		Timestamp:  metav1.Now(),
	}
	fallbackMetrics := []external_metrics.ExternalMetricValue{metric}

	log.Info("Suppressing error, falling back to fallback.replicas", "scaledObject.Namespace", scaledObject.Namespace, "scaledObject.Name", scaledObject.Name, "suppressedError", suppressedError, "fallback.replicas", replicas)
	return fallbackMetrics
}

JorTurFer · 2024-04-10T06:13:21Z

yes! we should use the scaledObject.Spec.Advanced.ScalingModifiers.Target there in case of using scalingModifiers

SpiritZhou · 2024-04-11T11:07:49Z

If the user sets a failureThreshold, the pods will continue to fluctuate until the number of failures exceeds the threshold because doFallback() will not be called. Should we prevent the user from setting the failureThreshold while using the formula? @JorTurFer

zroubalik · 2024-04-11T12:24:49Z

If the user sets a failureThreshold, the pods will continue to fluctuate until the number of failure

could you pleaes elaborate?

SpiritZhou · 2024-04-12T04:34:56Z

In the fallback logic there is a comparison between healthStatus.NumberOfFailures and scaledObject.Spec.Fallback.FailureThreshold. The healthStatus.NumberOfFailures will increase each round and this func will return directly without generating fallback metric value until healthStatus.NumberOfFailures reached scaledObject.Spec.Fallback.FailureThreshold.

In composite scaler, it will return wrong metrics to HPA if healthStatus.NumberOfFailures is below scaledObject.Spec.Fallback. FailureThreshold. But at the same time KEDA will keep scaling target to fallback value, resulting in fluctuations in the replica count of target pods.

For example, there is a composite scaler with workload and metric-api scaler. The FailureThreshold is set 3 and the fallback value is set 5. If metric-api scaler encountered an error, this logic will return just workload metrics to HPA in the next 3 round. But KEDA will scale target to 5 in RequestScale() at the same time. As a result, the target will fluctuate between 5 and the wrong replica count in the following 45 seconds (suppose the interval of HPA metric request is 15 seconds).

In fact I don't quite understand the functionality of fallback failureThreshold. Even in normal scaler without composition, this logic just returns error to HPA which will not trigger any scale. But RequestScale() already trigger the fallback logic to scale target to fallback replica count. Is there any situation that needs the failureThreshold?

zroubalik · 2024-04-16T07:45:39Z

Hmm, probably some glitch in the logic. The scaler (no matter if composite or just one) should in case of errors report errors to HPA normally and once the failure threshold is reached then it should report fallback number.

SpiritZhou · 2024-04-16T07:59:15Z

Hmm, probably some glitch in the logic. The scaler (no matter if composite or just one) should in case of errors report errors to HPA normally and once the failure threshold is reached then it should report fallback number.

What errors should be reported to HPA when one of the composite scalers encounters an error? Nowadays it reports the normal one.

zroubalik · 2024-04-16T08:05:41Z

Hmm, probably some glitch in the logic. The scaler (no matter if composite or just one) should in case of errors report errors to HPA normally and once the failure threshold is reached then it should report fallback number.

What errors should be reported to HPA when one of the composite scalers encounters an error? Nowadays it reports the normal one.

I think it should report a new error stating that we weren't able to calculate the composite metric and then attach the failure fromt the specific scaler

SpiritZhou added the bug Something isn't working label Apr 7, 2024

JorTurFer mentioned this issue Apr 10, 2024

Release: 2.14 #5671

Closed

35 tasks

SpiritZhou mentioned this issue Apr 11, 2024

Fix fallback logic in formula-based evaluation #5684

Merged

4 tasks

JorTurFer closed this as completed in #5684 Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The replica count of target pods fluctuates when fallback is triggered in scaling-modifier #5666

The replica count of target pods fluctuates when fallback is triggered in scaling-modifier #5666

SpiritZhou commented Apr 7, 2024

JorTurFer commented Apr 7, 2024

SpiritZhou commented Apr 8, 2024

zroubalik commented Apr 8, 2024

SpiritZhou commented Apr 10, 2024

JorTurFer commented Apr 10, 2024

SpiritZhou commented Apr 11, 2024

zroubalik commented Apr 11, 2024

SpiritZhou commented Apr 12, 2024

zroubalik commented Apr 16, 2024

SpiritZhou commented Apr 16, 2024

zroubalik commented Apr 16, 2024

The replica count of target pods fluctuates when fallback is triggered in scaling-modifier #5666

The replica count of target pods fluctuates when fallback is triggered in scaling-modifier #5666

Comments

SpiritZhou commented Apr 7, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Apr 7, 2024

SpiritZhou commented Apr 8, 2024

zroubalik commented Apr 8, 2024

SpiritZhou commented Apr 10, 2024

JorTurFer commented Apr 10, 2024

SpiritZhou commented Apr 11, 2024

zroubalik commented Apr 11, 2024

SpiritZhou commented Apr 12, 2024

zroubalik commented Apr 16, 2024

SpiritZhou commented Apr 16, 2024

zroubalik commented Apr 16, 2024