Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PodMonitors/ServiceMonitors instead of annotations for configuring Prometheus scraping #623

Closed
hashemi-soroush opened this issue Jun 17, 2023 · 4 comments
Labels

Comments

@hashemi-soroush
Copy link
Contributor

The issue with the current Prometheus support

Currently, the operator annotates Redis and Sentinel pods with the following set of annotations and expects the maintainer of the operator to write a Prometheus scrape config to use these annotations for finding the pods and configuring scraping.

prometheus.io/scrape: "true"
prometheus.io/port: http
prometheus.io/path: /metrics

Writing a scrape config is tedious and low-level, so most people prefer the high-level PodMonitor and ServiceMonitor alternatives. Also, sometimes the operator maintainer doesn't have access to the Prometheus configuration (even through additional scrape configs), for example, when they are delegating Prometheus setup and maintenance to another team inside or outside their organization.

The alternatives

We can use PodMonitors and ServiceMonitors instead. But there are a number of ways we can use them and we need to decide which one is best for the target users of this operator.

We can create a single PodMonitor/ServiceMonitor that selects all Redis and Sentinel pods of all the RedisFailover instances, but it limits us in customizing the scraping configuration for each RedisFailover instance (maybe some need to be scraped every second, while 1-minute interval suffices for the rest).

We can make it more configurable by having one PodMonitor per RedisFailover. It fixes the above problem, but in most cases, Sentinels don't need to be scraped as often as Redises.

Creating one PodMonitor for Sentinels and one for Redises per RedisFailover fixes the above problem. But maybe we need to scrape Redis master pod more often.

And finally, we can have three PodMonitors per RedisFailover: one for Redis master, one for Redis slaves, and one for Sentinels. The problem with this solution is that it makes the cluster very crowded, especially if, due to an organization standard, all Redis instances must be created in a single designated Kubernetes namespace.

My suggestion

I believe the best course of action is to start with the simplest alternative and then let the community experiment with their ideas and discuss how to mature this feature.

At first sight, the simplest alternative seems to be the singular PodMonitor for all RedisFailovers, but it prevents the community to experience and experiment with customizing scraping. We all have critical RedisFailovers that we don't want to touch unless absolutely necessary, and not-so-important RedisFailovers that are victims to all our experiments, including the playful ones. If we have a single PodMonitor for all RedisFailovers, we won't experiment on it because it will also affect the critical ones. So, in order to let the community run all kinds of experiments, we need to have separate PodMonitor(s) for each RedisFailover.

If we decide to have a single PodMonitor per RedisFailover, we will face a number of problems in the RedisFailover CRD. To make it easier to follow, I rewrite the RefisFailovers spec struct and some of its composing structs here.

type RedisFailoverSpec struct {
	Redis          RedisSettings      `json:"redis,omitempty"`
	Sentinel       SentinelSettings   `json:"sentinel,omitempty"`
	Auth           AuthSettings       `json:"auth,omitempty"`
	LabelWhitelist []string           `json:"labelWhitelist,omitempty"`
	BootstrapNode  *BootstrapSettings `json:"bootstrapNode,omitempty"`
}
type RedisSettings struct {
        // ...
	Exporter    Exporter    `json:"exporter,omitempty"`
        // ...
}
type SentinelSettings struct {
        // ...
	Exporter    Exporter    `json:"exporter,omitempty"`
        // ...
}
type Exporter struct {
	Enabled    bool    `json:"enabled,omitempty"`
        // ...
}

In order to have a single PodMonitor for the entire RedisFailover, we will have to add a PodMonitor field to RedisFailoverSpec, since it belongs to the entire RedisFailover not just the Redis or Sentinel part. The sole function of this PodMonitor will be to connect Prometheus with the endpoints created by the exporters of Redis and Sentinel.

Both these exporters can be disabled, in which case, the creation of the PodMonitor CR is futile and possibly confusing. Also, if you enable only one of them, PodMonitor will only be able to connect that one to Prometheus and it won't be clear which one for the maintainer unless they check the RedisFailover or the pod containers, again a little confusing.

Also, if you disable PodMonitor and enable both Exporters, they won't be connected to Prometheus. So, since the functionality of the PodMonitor and Exporters are so intertwined, from a design standpoint, it makes more sense to put them closer to each other and preferably let them share a few subfields, for example, Enabled.

The easiest way to get rid of this poor design and confusion is by having one PodMonitor for Redises and one for Sentinels, adding a PodMonitor subfield to Exporter, and letting the Exporter's Enabled field control the creation of PodMonitor.

type Exporter struct {
	Enabled       bool                `json:"enabled,omitempty"`
        PodMonitor    PodMonitorConfig    `json:"podMonitor,omitempty"`
        // ...
}
@hashemi-soroush
Copy link
Contributor Author

hashemi-soroush commented Jun 17, 2023

Issues #383 and #494 have discussed the same problem but didn't reach a resolution. I tried to consider those discussions in writing this issue and take a step forward toward resolution.

I also have implemented and tested on production my suggested design in #624 which is a rather large PR because the operator isn't designed for frequent such changes and requires change all over the codebase for adding a CR from a new GKV(GroupKindVersion). But this PR paves the way for adding more Prometheus-related configs to RedisFailover CRD, for example, custom PrometheusRules for each RedisFailover.

@hashemi-soroush
Copy link
Contributor Author

Any thoughts? @sagikazarmark @Wouter0100 @colynn @nantiferov

Also, @ese, this is a feature requested over a year ago, Could you give it more priority? The PR is ready.

@github-actions
Copy link

github-actions bot commented Aug 7, 2023

This issue is stale because it has been open for 45 days with no activity.

@github-actions github-actions bot added the stale label Aug 7, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant