Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-controller-manager metrics port and address is not accessible or documented #912

Closed
1 task done
bgagnon opened this issue Jan 24, 2020 · 10 comments · Fixed by #1178
Closed
1 task done

cloud-controller-manager metrics port and address is not accessible or documented #912

bgagnon opened this issue Jan 24, 2020 · 10 comments · Fixed by #1178
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@bgagnon
Copy link

bgagnon commented Jan 24, 2020

Is this a BUG REPORT or FEATURE REQUEST?: Feature

/kind feature

The binaries affected:

  • openstack-cloud-controller-manager

What happened:

I wanted to access the internal metrics of the CCM, just like the in-tree controller-manager which previously served these metrics.

What you expected to happen:

Similar metrics (and more) as the standard controller-manager.

Anything else we need to know?:

It is not clear how to access the Prometheus metrics for the openstack-cloud-controller-manager binary. There are no documented flags or port number.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 31, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2020
@mikejoh
Copy link
Contributor

mikejoh commented Jun 1, 2020

@bgagnon I've not been running the in-tree CCM and i have no idea what it exposed in regards to metrics but there's a couple of metrics you can get out of the OpenStack (external) CCM. I think there's room for improvement here, not only as far as the docs go but also code-wise. With code-wise i mean exporting more metrics (if feasible).

The OCCM is assigned port 10258, or rather the k/k one is assigned this port automatically. The OCCM imports (and vendors) the packages needed to run and behave as a external cloud provider (CCM). If that makes any sense.

I did some local tests in a one of our clusters (v1.17.0 of the OCCM) , on one of the masters i could do a curl against https://localhost:10258/metrics endpoint which returned the following metrics (only names and types i'm leaving out actual metrics):

# TYPE apiserver_audit_requests_rejected_total counter
# TYPE apiserver_client_certificate_expiration_seconds histogram
# TYPE apiserver_storage_data_key_generation_duration_seconds histogram
# TYPE apiserver_storage_data_key_generation_failures_total counter
# TYPE apiserver_storage_data_key_generation_latencies_microseconds histogram
# TYPE apiserver_storage_envelope_transformation_cache_misses_total counter
# TYPE authenticated_user_requests counter
# TYPE authentication_attempts counter
# TYPE authentication_duration_seconds histogram
# TYPE go_gc_duration_seconds summary
# TYPE go_goroutines gauge
# TYPE go_info gauge
# TYPE go_memstats_alloc_bytes gauge
# TYPE go_memstats_alloc_bytes_total counter
# TYPE go_memstats_buck_hash_sys_bytes gauge
# TYPE go_memstats_frees_total counter
# TYPE go_memstats_gc_cpu_fraction gauge
# TYPE go_memstats_gc_sys_bytes gauge
# TYPE go_memstats_heap_alloc_bytes gauge
# TYPE go_memstats_heap_idle_bytes gauge
# TYPE go_memstats_heap_inuse_bytes gauge
# TYPE go_memstats_heap_objects gauge
# TYPE go_memstats_heap_released_bytes gauge
# TYPE go_memstats_heap_sys_bytes gauge
# TYPE go_memstats_last_gc_time_seconds gauge
# TYPE go_memstats_lookups_total counter
# TYPE go_memstats_mallocs_total counter
# TYPE go_memstats_mcache_inuse_bytes gauge
# TYPE go_memstats_mcache_sys_bytes gauge
# TYPE go_memstats_mspan_inuse_bytes gauge
# TYPE go_memstats_mspan_sys_bytes gauge
# TYPE go_memstats_next_gc_bytes gauge
# TYPE go_memstats_other_sys_bytes gauge
# TYPE go_memstats_stack_inuse_bytes gauge
# TYPE go_memstats_stack_sys_bytes gauge
# TYPE go_memstats_sys_bytes gauge
# TYPE go_threads gauge
# TYPE kubernetes_build_info gauge
# TYPE process_cpu_seconds_total counter
# TYPE process_max_fds gauge
# TYPE process_open_fds gauge
# TYPE process_resident_memory_bytes gauge
# TYPE process_start_time_seconds gauge
# TYPE process_virtual_memory_bytes gauge
# TYPE process_virtual_memory_max_bytes gauge
# TYPE rest_client_request_duration_seconds histogram
# TYPE rest_client_request_latency_seconds histogram
# TYPE rest_client_requests_total counter

There should actually be two other metrics registered with the OCCM:

  • cloudprovider_openstack_api_requests_duration_seconds
  • cloudprovider_openstack_api_requests_errors

They're not in the list above, and i can see that this function:


should register these metrics and the init() here:

should've triggered a call to the RegisterMetrics() function. Not sure what's going on there, i'll have to test some in our environments or if someone else have any idea? I can definitely try to bump the docs with information on how to monitor the OCCM.

Hopefully that clears things up a bit, and thanks for noticing!

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@seanschneeweiss
Copy link
Contributor

seanschneeweiss commented Aug 12, 2020

/reopen
/remove-lifecycle rotten

We can see that the metrics code was added in this PR kubernetes/kubernetes#46008

Observing times or incrementing counters were implemented here:
https://github.com/NickrenREN/kubernetes/blob/18852c58c17a6dcd75912057c0e113b469bb2ced/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L491-L498

The time measurement of the API call was implemented here:
https://github.com/NickrenREN/kubernetes/blob/18852c58c17a6dcd75912057c0e113b469bb2ced/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L86-L100

Due to volumes related code being removed, these metrics are not present anymore. Metrics only get exposed, if a value exists.
In PR #1036 the code was removed.

Now it would be desirable to create new metrics similar to the implementation that already existed.

We propose changing openstack_metrics.go to something similar to (using legacyregistry as most other k8s components).

package openstack

import (
	"sync"

	"k8s.io/component-base/metrics"
	"k8s.io/component-base/metrics/legacyregistry"
)

const (
	openstackSubsystem         = "openstack"
	openstackOperationKey      = "cloudprovider_openstack_api_request_duration_seconds"
	openstackOperationErrorKey = "cloudprovider_openstack_api_request_errors"
)

var (
	openstackOperationsLatency = metrics.NewHistogramVec(
		&metrics.HistogramOpts{
			Subsystem: openstackSubsystem,
			Name:      openstackOperationKey,
			Help:      "Latency of openstack api call",
		},
		[]string{"request"},
	)

	openstackAPIRequestErrors = metrics.NewCounterVec(
		&metrics.CounterOpts{
			Subsystem: openstackSubsystem,
			Name:      openstackOperationErrorKey,
			Help:      "Cumulative number of openstack Api call errors",
		},
		[]string{"request"},
	)
)

var registerMetrics sync.Once

// RegisterMetrics registers EndpointSlice metrics.
func RegisterMetrics() {
	registerMetrics.Do(func() {
		legacyregistry.MustRegister(openstackOperationsLatency)
		legacyregistry.MustRegister(openstackAPIRequestErrors)
	})
}

A PR might follow for metrics related to openstack_loadbalancer.go

Sean Schneeweiss sean.schneeweiss@daimler.com, Daimler TSS GmbH, legal info/Impressum

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 12, 2020
@seanschneeweiss
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@seanschneeweiss: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lingxiankong
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 19, 2020
@k8s-ci-robot
Copy link
Contributor

@lingxiankong: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants