Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oscillating Pod counts when misconfigured replicas #5288

Closed
BojanZelic opened this issue Dec 14, 2023 · 1 comment · Fixed by #5289
Closed

Oscillating Pod counts when misconfigured replicas #5288

BojanZelic opened this issue Dec 14, 2023 · 1 comment · Fixed by #5289
Assignees
Labels
bug Something isn't working

Comments

@BojanZelic
Copy link
Contributor

BojanZelic commented Dec 14, 2023

Report

I observed oscillating pod counts because both HPA controller & ScaledObject Controller attempt to update the scale subresource of a Deployment when the min & max replicas were misconfigured.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-deployment
spec:
  maxReplicaCount: 50
  minReplicaCount: 4
  scaleTargetRef:
    kind: Deployment
    name: my-deployment
  triggers:
  - metadata:
      value: "45"
    metricType: Utilization
    type: cpu
  - metadata:
      desiredReplicas: "15"
      end: 0 12 * * SUN-SAT
      start: 0 7 * * SUN-SAT
      timezone: America/Los_Angeles
    type: cron
  - metadata:
      desiredReplicas: "26"
      end: 0 18 * * SUN-SAT
      start: 0 12 * * SUN-SAT
      timezone: America/Los_Angeles
    type: cron

Here's the order of events:

  • User updated minReplicas & maxReplicas to 27, which in turn updated the HPA, and eventually the Deployment's replica count
  • User then updated & misconfigured replica counts by setting minReplicas = 50, and maxReplicas = 10, this ment that the HPA object wasn't updated
  • KEDA then attempted to scale the deployment to 50 replicas in it's reconcile loop and HPA reverting it back to 27. This happened every 5 mins

Expected Behavior

I would expect replica counts to stay at 27 because HPA can't be updated.
In addition, I would expect the ScaledObject webhook to enforce valid min & max replica settings; it should fail to be updated with invalid settings

Actual Behavior

  • Keda operator would update the /scale subresource and set the replica count to 50
  • HPA would override this and set it to 27
  • This behavior happened over and over again until the ScaledObject was fixed and minReplicas was <= maxReplicas

Steps to Reproduce the Problem

I can't seem to reproduce it since it first occurred :-(

Logs from KEDA operator

n/a - lost the logs :-(

here's a k8s audit log of when the deployment is getting scaled by Keda

{
  annotations: {
    authorization.k8s.io/decision: 'allow',
    authorization.k8s.io/reason:   'RBAC: allowed by ClusterRoleBinding "keda-operator" of ClusterRole "keda-operator" to ServiceAccount "keda-operator/infra"',
  },
  apiVersion:               'audit.k8s.io/v1',
  auditID:                  '9840e292-822c-43df-8f4f-378ab0b8fd79',
  kind:                     'Event',
  level:                    'RequestResponse',
  message:                  '',
  objectRef: {
    apiGroup:        'apps',
    apiVersion:      'v1',
    name:            'my-deployment',
    namespace:       'default',
    resource:        'deployments',
    resourceVersion: '3635019749',
    subresource:     'scale',
    uid:             'cbc8124a-b008-4336-a5bf-784082c5d310',
  },
  requestObject: {
    apiVersion: 'autoscaling/v1',
    kind:       'Scale',
    metadata: {
      creationTimestamp: '2022-08-09T22:33:21Z',
      name:              'my-deployment',
      namespace:         'default',
      resourceVersion:   '3635019749',
      uid:               'cbc8124a-b008-4336-a5bf-784082c5d310',
    },
    spec: {
      replicas: 50,
    },
    status: {
      replicas: 27,
      selector: 'app=my-app,service=my-deployment',
    },
  },
  requestReceivedTimestamp: '2023-12-14T07:54:46.539407Z',
  requestURI:               '/apis/apps/v1/namespaces/default/deployments/my-deployment/scale?timeout=32s',
  responseObject: {
    apiVersion: 'autoscaling/v1',
    kind:       'Scale',
    metadata: {
      creationTimestamp: '2022-08-09T22:33:21Z',
      name:              'my-deployment',
      namespace:         'default',
      resourceVersion:   '3635022604',
      uid:               'cbc8124a-b008-4336-a5bf-784082c5d310',
    },
    spec: {
      replicas: 50,
    },
    status: {
      replicas: 27,
      selector: 'app=my-app,service=my-deployment',
    },
  },
  responseStatus: {
    code:     200,
    metadata: {
    },
  },
  sourceIPs: [
    '10.21.19.216',
  ],
  stage:                    'ResponseComplete',
  stageTimestamp:           '2023-12-14T07:54:46.545965Z',
  time:                     '2023-12-14T07:54:46.546119559Z',
  user: {
    extra: {
      authentication.kubernetes.io/pod-name: [
        'keda-operator-79b499d775-6sr7n',
      ],
      authentication.kubernetes.io/pod-uid: [
        'bcc0603a-6af2-45cd-b1c6-dd70514f95d1',
      ],
    },
    groups: [
      'system:serviceaccounts',
      'system:serviceaccounts:infra',
      'system:authenticated',
    ],
    uid:      '7381072e-aed9-4a70-b970-bd6cc3280b9a',
    username: 'system:serviceaccount:infra:keda-operator',
  },
  userAgent:                'keda/v0.0.0 (linux/amd64) kubernetes/$Format',
  verb:                     'update',
}

and HPA attempting to update this back;


2023-12-14 01:10:50.000 | cluster=cell-002-00 kind=HorizontalPodAutoscaler name=keda-hpa-my-deployment reason=SuccessfulRescale New size: 27; reason: Current number of replicas above Spec.MaxReplicas
-- | --



KEDA Version

2.12.1

Kubernetes Version

1.27

Platform

None

Scaler Details

No response

Anything else?

I can't seem to reproduce it; Either way though the validating webhook should refuse to create a ScaledObject where minReplicas is > maxReplicas to prevent this misconfiguration.

@zroubalik
Copy link
Member

zroubalik commented Dec 15, 2023

Good catch. Thanks for tackling this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants