Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support when JetStream cluster not on kubernetes #4102

Merged
merged 1 commit into from
Mar 2, 2023

Conversation

mfadhlika
Copy link
Contributor

@mfadhlika mfadhlika commented Jan 13, 2023

I'm trying to use NATS Jetstream scaler, previously I workaround the lack of NATS Jetstream scaler by using Prometheus in middle of NATS and Keda. I run into issue with the NATS Jetstream scaler because our stack use 3 VMs for NATS Jetstream cluster and round robin load balancer.

Currently the scaler assume NATS node always have <node>.<monitoringURL> as URL for the node, which is not the case when NATS cluster is not on kubernetes. <monitoringURL>/varz endpoint respond with

curl -s -X GET "http://nats.company.internal:8222/varz" | jq .cluster
{
  "name": "nats",
  "addr": "172.0.0.2",
  "cluster_port": 4221,
  "auth_timeout": 2,
  "urls": [
    "172.0.0.3:4221", // <--- VM's IP
    "172.0.0.4:4221"
  ],
  "tls_timeout": 2
}

Because the scaler assume it's on kubernetes, it splits url by . (dot) and take first item as node and we get 172.nats.company.internal:8222 for the natsJetStreamMonitoringNodeURL as shown in code.

for _, clusterURL := range jetStreamServerResp.Cluster.HostUrls {
	node := strings.Split(clusterURL, ".")[0]
	natsJetStreamMonitoringNodeURL, err := s.getNATSJetStreamMonitoringNodeURL(node)
	if err != nil {
		return err
	}
	...
}

This PR only works if monitoring port of all node is the same as monitoringEndpoint

Checklist

Fixes #4101

@mfadhlika mfadhlika requested a review from a team as a code owner January 13, 2023 07:08
@semgrep-app
Copy link

semgrep-app bot commented Jan 13, 2023

Semgrep found 6 sprintf-host-port findings:

use net.JoinHostPort instead of fmt.Sprintf($XX, nodeHostname)

Created by sprintf-host-port.

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit tests are faling:

=== RUN   TestNATSJetStreamIsActive
    nats_jetstream_scaler_test.go:262: Expected error for 'Fail - Bad leader name (clustered)' but got success 
    nats_jetstream_scaler_test.go:266: Expected 'Fail - Bad leader name (clustered)' 'isActive=false', got 'true'
--- FAIL: TestNATSJetStreamIsActive (0.01s)
=== RUN   TestNewNATSJetStreamScaler
--- PASS: TestNewNATSJetStreamScaler (0.00s)
=== RUN   TestNATSJetStreamGetMetrics
    nats_jetstream_scaler_test.go:318: Expected error for 'Fail - Bad leader name (clustered)' but got success 
--- FAIL: TestNATSJetStreamGetMetrics (0.17s)

you can test this locally by running the individual test or make test

Also Static checks are failing because of the use of deprecated package.

CHANGELOG.md Outdated Show resolved Hide resolved
@mfadhlika
Copy link
Contributor Author

mfadhlika commented Jan 13, 2023

@zroubalik the failing unit tests probably no longer relevant because the changes don't use the consumer leader name as monitoring url anymore. Should I remove the test? or is there any other use case where bad leader name is unfavorable?

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfadhlika yeah, sorry for the delay I missed the notification. Please update the unit tests to reflect the new behavior and also extend the coverage if possible.

@zroubalik
Copy link
Member

@mfadhlika any update on this please?

@mfadhlika
Copy link
Contributor Author

@zroubalik sorry, been busy past couple weeks. I'll try to update this weekend

CHANGELOG.md Outdated Show resolved Hide resolved
@zroubalik
Copy link
Member

zroubalik commented Mar 2, 2023

/run-e2e nats*
Update: You can check the progress here

Signed-off-by: Muhammad Fadhlika <git@fadhlika.com>
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mfadhlika thanks a lot for the contibution!

@zroubalik zroubalik merged commit 55c9c74 into kedacore:main Mar 2, 2023
xoanmm pushed a commit to xoanmm/keda that referenced this pull request Mar 22, 2023
Signed-off-by: Muhammad Fadhlika <git@fadhlika.com>
@rayjanoka
Copy link
Contributor

rayjanoka commented May 1, 2023

This change seems to have broken the scaler on kubernetes.

I'm thinking we need to put the test that was removed back and get these changes working with that.

I'll do some troubleshooting.

@mfadhlika
Copy link
Contributor Author

mfadhlika commented May 2, 2023

@rayjanoka I did read your PR before working on this. All I did is replacing node monitoring url from <node name>.<nats monitoring url> to IPs/URLs from connectUrls. I was using nats cluster with 3 node without stream replica when first testing this

@rayjanoka
Copy link
Contributor

hmm I don't see connect_urls in my /varz. Do I need to upgrade NATS or something?

➜ curl -s 'localhost:8223/varz' | grep connect_urls

@rayjanoka
Copy link
Contributor

I see, I use --no_advertise in nats which does not include these connect_urls on purpose.

https://github.com/nats-io/nats-server/blob/main/server/route.go#LL1705C2-L1709C3

@rayjanoka
Copy link
Contributor

rayjanoka commented May 2, 2023

So maybe we can add back the old way and make it backward compatible...?

I think I get this error because it can't find connect_urls.

result: runtime error: invalid memory address or nil pointer dereference

and maybe we can make the testing include my use case...

@mfadhlika
Copy link
Contributor Author

mfadhlika commented May 2, 2023

I agree we should make it backward compatible, should we look for connect_urls then fallback to old way if empty or use some flag in config?

@rayjanoka
Copy link
Contributor

yes, I think fallback is good if connect_urls doesn't exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NATS JetStream scaler assume the cluster runs on kubernetes
3 participants