Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The replica count of target pods fluctuates when fallback is triggered in scaling-modifier #5666

Closed
Tracked by #5671
SpiritZhou opened this issue Apr 7, 2024 · 11 comments · Fixed by #5684
Closed
Tracked by #5671
Labels
bug Something isn't working

Comments

@SpiritZhou
Copy link
Contributor

Report

If one of the scalers encounters an error while using scaling-modifier, the replica count cannot remain stable at the fallback value. Instead, it fluctuates between 1 and the fallback value.

Expected Behavior

The replica count of target pod keeps at the fallback value when scaler encounter an error.

Actual Behavior

The replica count of target pod keeps fluctuating between 1 and the fallback value.

Steps to Reproduce the Problem

  1. Running the fallback template in scaling_modifiers_test.go, scale metricsServerDeployment to 0 to trigger fallback and keep for a while.

Logs from KEDA operator

keda-keda-operator-5789f449c4-dprm4-1712480741312929239.log

KEDA Version

2.13.1

Kubernetes Version

1.27

Platform

Other

Scaler Details

No response

Anything else?

No response

@SpiritZhou SpiritZhou added the bug Something isn't working label Apr 7, 2024
@JorTurFer
Copy link
Member

Definitivelly it shouldn't happen and fallback should be applied always, could you take a look?

@SpiritZhou
Copy link
Contributor Author

After a quick check, I think the metric being provided to the metric server is incorrect.

When there is no fallback, the metric is a composite metric.
Screenshot 2024-04-08 110419

However, if a fallback occurs, the metric changes to separate metrics, and the metric values are not equal to the fallback value. This metric can trigger HPA to scale to 1.
Screenshot 2024-04-08 110800

Meanwhile, the doFallbackScaling() function continues to work, which causes the target pod to fluctuate between 1 and the fallback value.

@zroubalik
Copy link
Member

@SpiritZhou good catch, I also belive this is the root of the problem.

@SpiritZhou
Copy link
Contributor Author

There is another bug in the dofallback(). The metricSpec.External.Target.AverageValue will be 0 when the scakubg-modifier is active and the correct fallback value cannot be calculated. Should it be scaledObject.Spec.Advanced.ScalingModifiers.Target? @zroubalik @JorTurFer

func doFallback(scaledObject *kedav1alpha1.ScaledObject, metricSpec v2.MetricSpec, metricName string, suppressedError error) []external_metrics.ExternalMetricValue {
	replicas := int64(scaledObject.Spec.Fallback.Replicas)
	normalisationValue := metricSpec.External.Target.AverageValue.AsApproximateFloat64()
	metric := external_metrics.ExternalMetricValue{
		MetricName: metricName,
		Value:      *resource.NewMilliQuantity(int64(normalisationValue*1000)*replicas, resource.DecimalSI),
		Timestamp:  metav1.Now(),
	}
	fallbackMetrics := []external_metrics.ExternalMetricValue{metric}

	log.Info("Suppressing error, falling back to fallback.replicas", "scaledObject.Namespace", scaledObject.Namespace, "scaledObject.Name", scaledObject.Name, "suppressedError", suppressedError, "fallback.replicas", replicas)
	return fallbackMetrics
}

@JorTurFer
Copy link
Member

yes! we should use the scaledObject.Spec.Advanced.ScalingModifiers.Target there in case of using scalingModifiers

@SpiritZhou
Copy link
Contributor Author

If the user sets a failureThreshold, the pods will continue to fluctuate until the number of failures exceeds the threshold because doFallback() will not be called. Should we prevent the user from setting the failureThreshold while using the formula? @JorTurFer

@zroubalik
Copy link
Member

If the user sets a failureThreshold, the pods will continue to fluctuate until the number of failure

could you pleaes elaborate?

@SpiritZhou
Copy link
Contributor Author

In the fallback logic there is a comparison between healthStatus.NumberOfFailures and scaledObject.Spec.Fallback.FailureThreshold. The healthStatus.NumberOfFailures will increase each round and this func will return directly without generating fallback metric value until healthStatus.NumberOfFailures reached scaledObject.Spec.Fallback.FailureThreshold.

In composite scaler, it will return wrong metrics to HPA if healthStatus.NumberOfFailures is below scaledObject.Spec.Fallback. FailureThreshold. But at the same time KEDA will keep scaling target to fallback value, resulting in fluctuations in the replica count of target pods.

For example, there is a composite scaler with workload and metric-api scaler. The FailureThreshold is set 3 and the fallback value is set 5. If metric-api scaler encountered an error, this logic will return just workload metrics to HPA in the next 3 round. But KEDA will scale target to 5 in RequestScale() at the same time. As a result, the target will fluctuate between 5 and the wrong replica count in the following 45 seconds (suppose the interval of HPA metric request is 15 seconds).

Screenshot 2024-04-12 105424

In fact I don't quite understand the functionality of fallback failureThreshold. Even in normal scaler without composition, this logic just returns error to HPA which will not trigger any scale. But RequestScale() already trigger the fallback logic to scale target to fallback replica count. Is there any situation that needs the failureThreshold?

@zroubalik
Copy link
Member

Hmm, probably some glitch in the logic. The scaler (no matter if composite or just one) should in case of errors report errors to HPA normally and once the failure threshold is reached then it should report fallback number.

@SpiritZhou
Copy link
Contributor Author

Hmm, probably some glitch in the logic. The scaler (no matter if composite or just one) should in case of errors report errors to HPA normally and once the failure threshold is reached then it should report fallback number.

What errors should be reported to HPA when one of the composite scalers encounters an error? Nowadays it reports the normal one.

@zroubalik
Copy link
Member

Hmm, probably some glitch in the logic. The scaler (no matter if composite or just one) should in case of errors report errors to HPA normally and once the failure threshold is reached then it should report fallback number.

What errors should be reported to HPA when one of the composite scalers encounters an error? Nowadays it reports the normal one.

I think it should report a new error stating that we weren't able to calculate the composite metric and then attach the failure fromt the specific scaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants