Only do telemetry computation if telemetry is enabled in the node #10245

ValarDragon · 2021-09-28T05:47:53Z

Summary

As detailed in other issues, the mutex locks and time.Now() syscalls' taken in telemetry are rather expensive for nodes. In general, nodes not using telemetry should not pay these costs.

Problem Definition

Nodes that aren't using telemetry should not pay the costs of telemetry.

Proposal

Somehow make the telemetry package know the result of its config option detailing whether or not its enabled. Then only do operations when its enabled. This is useful due to potential for mutex lock contention.

This doesn't solve the extraneous time.Now() call, but that should hopefully be solvable via other mechanisms. (e.g. not putting telemetry into hot loops / low level items) This can't be deadcode eliminated / constant folded, since telemetry enablement is a run-time flag, not a compile time flag

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

alexanderbez · 2021-09-28T14:53:58Z

ACK, yeah this totally makes sense and TBH was an oversight on my part. I didn't realize how much time these calls took.

ValarDragon · 2021-10-08T22:37:53Z

To improve the time.Now() call, we can make that a telemetry function. So replace time.Now() in the defer statements with telemetry.StartTimer(). And the StartTimer function would be if telemetry enabled { return time.Now() } else { time.Time{} }

ValarDragon · 2021-10-17T05:32:37Z

Does anyone have any guidance on how we can get there to be a flag for whether or not telemetry is enabled? Some thoughts that come to mind for me:

Does such a flag belong in the ctx
Should it be a global variable in the telemetry package thats set on node startup?
Should there be a telemetry object accessible via ctx?

I feel like putting a TelemetryEnabled flag in the context may be the simplest thing thats re-usable, but not at all sure its the right thing to do.

alexanderbez · 2021-10-18T13:04:57Z

@ValarDragon the SDK context type? I think that might be our only option. Alternatively, we have a global variable in the telemetry package.

ValarDragon · 2021-10-19T17:51:18Z

yeah meant SDK context type.

I feel like a global variable in telemetry will make the syntax for doing telemetry calls simpler at least, no idea if it would cause problems for integrators though. cc @jackzampolin

e.g. API with global variable

defer telemetry.ModuleMeasureSince(types.ModuleName, telemetry.StartTimer(), telemetry.MetricKeyBeginBlocker)

API choices with it in context:

if ctx.telemetryEnabled {
defer telemetry.ModuleMeasureSince(types.ModuleName, telemetry.StartTimer(), telemetry.MetricKeyBeginBlocker)`
}

or

defer telemetry.ModuleMeasureSince(ctx, types.ModuleName, telemetry.StartTimer(ctx), telemetry.MetricKeyBeginBlocker)`

I don't see a reason why if your running N tendermint chains, you'd only want telemetry on a subset. Is that an important thing to support?

ValarDragon · 2021-10-27T04:17:39Z

Asked @jackzampolin over DM, he saw no client issue with adding a global variable to the telemetry package, thats set on telemetry initialization. So lets go with that approach! (It has the easier API)

tac0turtle · 2021-11-16T10:27:26Z

can we do what Tendermint does and pass a noop telemetry? or this is about also avoiding time calls?

ValarDragon · 2021-11-16T13:56:31Z

Also avoiding the time.Now() calls.

alexanderbez · 2021-11-16T14:24:33Z

can we do what Tendermint does and pass a noop telemetry? or this is about also avoiding time calls?

Yes

Also avoiding the time.Now() calls.

Yes

fedekunze · 2022-05-21T07:48:30Z

@marbar3778 @alexanderbez, can you provide a list of actionable items here? Happy to work on this

@marbar3778: don't call time.Now() in the function but move time to a higher level and create a noop telemetry struct

What do you mean by higher level? server?

alexanderbez · 2022-05-23T13:24:05Z

I'm not really sure what @marbar3778 is referring to, but you can't really avoid defer timing calls. The idea is that we have a config in the config struct and delegate all telemetry calls to a function that checks that config. Something like:

// in the telemetry package
func Emit(ctx sdk.Context, metricCb func()) {
  if ctx.TelemetryEnabled() {
    metricCb()
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  defer telemetry.Emit(ctx, func() {
    telemetry.MeasureSince(time.Now(), "foo", "bar")
  })
}

ValarDragon · 2022-05-30T10:22:17Z

it means replacing the time.Now() call with telemetry.StartTimer(), which then does the config check, and if so runs time.Now(), otherwise doesn't run time.Now()

alexanderbez · 2022-05-30T13:57:51Z

I think my proposal achieves that whilst also being flexible to other things besides time measurements (e.g. counters)

tac0turtle · 2022-12-29T10:01:34Z

MeasureSInce and associated functions should call time.Now() instead of leaving it to the keeper. This way the noOP metric gatherer can be set without time.now or with.

elias-orijtech · 2023-05-12T19:07:18Z

What's the status of this? It sounds like an API change, right? If so, can we agree on the proposed API?

tac0turtle · 2023-05-13T08:49:34Z

this is still relevant, i dont have a particular api in mind, do you have any thoughts?

elias-orijtech · 2023-06-06T22:40:24Z

I'm not really sure what @marbar3778 is referring to, but you can't really avoid defer timing calls. The idea is that we have a config in the config struct and delegate all telemetry calls to a function that checks that config. Something like:
// in the telemetry package
func Emit(ctx sdk.Context, metricCb func()) {
  if ctx.TelemetryEnabled() {
    metricCb()
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  defer telemetry.Emit(ctx, func() {
    telemetry.MeasureSince(time.Now(), "foo", "bar")
  })
}

This design seems compelling, but has the issue that the lazy time.Now will be called when Foo returns, not when entered and so MeasureSince will not measure what you expect.

As detailed in other issues, the mutex locks and time.Now() syscalls' taken in telemetry are rather expensive for nodes. In general, nodes not using telemetry should not pay these costs.

What mutex locks are you referring to? go-metric accesses the global metric instance through atomic loads. A global, say, atomic bool to track the status of metrics would probably cost the same in CPU time as the atomic pointer load of the metric object.

Note that

if ctx.telemetryEnabled {
defer telemetry.ModuleMeasureSince(types.ModuleName, telemetry.StartTimer(), telemetry.MetricKeyBeginBlocker)`
}

is racy. It's otherwise a reasonable design if telemetryEnabled is made atomic, because we won't need telemetry.StartTimer.

alexanderbez · 2023-06-07T14:00:25Z

Good point. We could augment the metricCb to take ...any then.

elias-orijtech · 2023-06-08T22:52:34Z

Why not check and load the metric at the same time? Say,

	defer telemetry.ModuleMeasureSince(types.ModuleName, time.Now(), telemetry.MetricKeyBeginBlocker)

becomes

    if m := telemetry.Load(); m != nil {
	defer telemetry.ModuleMeasureSinceLocal(m, types.ModuleName, time.Now(), telemetry.MetricKeyBeginBlocker)
    }

?

It's an extra if-check that can side-step all extra work including time.Now, and doesn't slow down in the telemetry enabled case because there is only the one load of the global metric.

tac0turtle · 2023-06-09T12:57:58Z

is there a way to make so this is handled automatically? the if checks will become redundant and annoying to write quickly

alexanderbez · 2023-06-09T14:12:45Z

@elias-orijtech I don't think it should be the caller's responsibility -- poor UX IMO

elias-orijtech · 2023-06-09T14:34:06Z

What's the alternative? Can you expand on #10245 (comment)?

alexanderbez · 2023-06-10T00:20:12Z

Well with my proposal, the APIs wouldn't be that much cleaner anyway. I think we might have to have Emit* functions for each type of metrics call supported. Then that Emit* method checks if telemetry is enabled.

e.g.

// in the telemetry package
func EmitMeasureSince(ctx sdk.Context, t time.Time, args ...string) {
  if ctx.TelemetryEnabled() {
     telemetry.MeasureSince(t, args...)
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  t := time.Now()
  defer telemetry.EmitMeasureSince(ctx, t)
}

elias-orijtech · 2023-06-11T23:09:00Z

Well with my proposal, the APIs wouldn't be that much cleaner anyway. I think we might have to have Emit* functions for each type of metrics call supported. Then that Emit* method checks if telemetry is enabled.

e.g.
// in the telemetry package
func EmitMeasureSince(ctx sdk.Context, t time.Time, args ...string) {
  if ctx.TelemetryEnabled() {
     telemetry.MeasureSince(t, args...)
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  t := time.Now()

This design doesn't omit the call to time.Now, as requested by the OP. telemetry.Now was suggested, but that's even more API. Also, the resulting design does 3 atomic loads: 1 atomic for each of telemetry.Now and ctx.TelemetryEnabled to check whether metrics is enabled, and one atomic inside telemetry.Measure... to access the global telemetry object. By comparison my suggestion is 1 atomic load.

I still think my suggestion is simpler overall, although I admit it is somewhat clumsier.

defer telemetry.EmitMeasureSince(ctx, t)
}

tac0turtle · 2023-06-12T10:14:23Z

can time.Now() be put into EmitMeasureSince?

elias-orijtech · 2023-06-12T14:03:58Z

can time.Now() be put into EmitMeasureSince?

EmitMeasureSince is deferred, so its time.Now would be too late.

alexanderbez · 2023-06-12T15:24:45Z

You're putting responsibility on the caller via if m := telemetry.Load(); m != nil { ... }
Isn't the time.Now() not evaluated until the deferred call? If so, that won't be accurate.

elias-orijtech · 2023-06-12T16:33:58Z

You're putting responsibility on the caller via if m := telemetry.Load(); m != nil { ... }

Yes, that's the clumsiness of my proposal. However, the idiom is almost akin to if err := ...; err != nil.

Isn't the time.Now() not evaluated until the deferred call? If so, that won't be accurate.

No. The arguments to a deferred function call are evaluated eagerly.

alexanderbez · 2023-06-12T17:57:56Z

Yeah I just don't see how that's a cleaner API and dev UX personally.

elias-orijtech · 2023-06-12T19:36:12Z

Alright. So what's the final API? #10245 (comment) plus telemetry.Now?

alexanderbez · 2023-06-13T15:00:28Z

I think so? Unless you can think of a clean way w/o needing the caller to check the enablement?

lucaslopezf · 2024-03-26T14:49:40Z

Hi guys, I've been thinking about this and I come with two different proposals. One using sdk.Context, and the other using a global variable.

Using sdk.Context

Advantages:

Flexibility: Allows greater flexibility where different parts of the application can have different telemetry configurations based on their execution context.

Disadvantages:

Verbosity: Requires the context to be explicitly passed through the call chain, which can increase verbosity and code complexity.
Context Dependency: The design of functions and methods must consider and depend on sdk.Context, which could increase coupling and reduce modularity.

Using a Global Variable

Advantages:

Simplicity: A global variable is simple to implement and use. It does not require passing additional objects through the function call chain.
Consistency: Ensures consistent configuration across the entire application, as the telemetry enablement state is unique and centralized.

Disadvantages:

Limited Flexibility: Does not allow for contextual variations in telemetry configuration, as the state is global and cannot be adjusted for specific use cases within the application.

From my point of view, the best approach is the global variable. I don't see a need for having flexibility in telemetry configuration, generally, it's something that is either activated or not at the start of the application until the end of its lifetime.

For this, I've created two PRs (for demonstration purposes) showcasing the two approaches, so we can decide together which path to take

sdk.Context: #19857
global variable: #19867

tac0turtle · 2024-03-28T06:16:39Z

This is a hard one. Globals are bad but sdk.context is something we want to remove. The global here doesn't seem so bad.

lucaslopezf · 2024-03-28T10:48:34Z

This is a hard one. Globals are bad but sdk.context is something we want to remove. The global here doesn't seem so bad.

I know it, and I don't see many alternatives. Given that it's a global configuration constant, it feels like the lesser of the evils. For now, I'm opting for the global variable approach. If you come up with any other ideas, please let me know.

ValarDragon added the C: telemetry Issues and features pertaining to SDK telemetry. label Sep 28, 2021

alexanderbez self-assigned this Sep 28, 2021

ValarDragon mentioned this issue Oct 8, 2021

Create or integrate linter to check time.Now use #10329

Closed

4 tasks

ValarDragon added the T: Performance Performance improvements label Oct 9, 2021

daniel-farina mentioned this issue Nov 19, 2021

Make telemetry only do work if telemetry is enabled on the node osmosis-labs/feature-requests#24

Closed

tac0turtle added the help wanted label Jan 12, 2022

fedekunze unassigned alexanderbez May 21, 2022

fedekunze mentioned this issue Jun 1, 2022

analytics(app): add telemetry to Evmos modules evmos/evmos#637

Merged

tac0turtle added the T:Sprint label Aug 29, 2022

tac0turtle removed the T:Sprint label Oct 5, 2022

tac0turtle removed the help wanted label Nov 16, 2023

educlerici-zondax assigned lucaslopezf Mar 25, 2024

educlerici-zondax added the S:zondax Squad: Zondax label Mar 25, 2024

lucaslopezf mentioned this issue Mar 26, 2024

feat: Check telemetry enabled state before emitting metrics #19857

Closed

12 tasks

lucaslopezf mentioned this issue Mar 29, 2024

feat: Conditionally emit metrics based on enablement #19903

Merged

12 tasks

alexanderbez closed this as completed in #19903 Apr 11, 2024

mergify bot mentioned this issue Apr 11, 2024

feat: Conditionally emit metrics based on enablement (backport #19903) #20017

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only do telemetry computation if telemetry is enabled in the node #10245

Only do telemetry computation if telemetry is enabled in the node #10245

ValarDragon commented Sep 28, 2021 •

edited

Loading

alexanderbez commented Sep 28, 2021

ValarDragon commented Oct 8, 2021

ValarDragon commented Oct 17, 2021

alexanderbez commented Oct 18, 2021

ValarDragon commented Oct 19, 2021 •

edited

Loading

ValarDragon commented Oct 27, 2021

tac0turtle commented Nov 16, 2021

ValarDragon commented Nov 16, 2021

alexanderbez commented Nov 16, 2021

fedekunze commented May 21, 2022

alexanderbez commented May 23, 2022

ValarDragon commented May 30, 2022

alexanderbez commented May 30, 2022

tac0turtle commented Dec 29, 2022

elias-orijtech commented May 12, 2023

tac0turtle commented May 13, 2023

elias-orijtech commented Jun 6, 2023

alexanderbez commented Jun 7, 2023

elias-orijtech commented Jun 8, 2023

tac0turtle commented Jun 9, 2023

alexanderbez commented Jun 9, 2023

elias-orijtech commented Jun 9, 2023

alexanderbez commented Jun 10, 2023

elias-orijtech commented Jun 11, 2023

tac0turtle commented Jun 12, 2023

elias-orijtech commented Jun 12, 2023

alexanderbez commented Jun 12, 2023

elias-orijtech commented Jun 12, 2023 •

edited

Loading

alexanderbez commented Jun 12, 2023

elias-orijtech commented Jun 12, 2023

alexanderbez commented Jun 13, 2023

lucaslopezf commented Mar 26, 2024

tac0turtle commented Mar 28, 2024

lucaslopezf commented Mar 28, 2024

Only do telemetry computation if telemetry is enabled in the node #10245

Only do telemetry computation if telemetry is enabled in the node #10245

Comments

ValarDragon commented Sep 28, 2021 • edited Loading

Summary

Problem Definition

Proposal

For Admin Use

alexanderbez commented Sep 28, 2021

ValarDragon commented Oct 8, 2021

ValarDragon commented Oct 17, 2021

alexanderbez commented Oct 18, 2021

ValarDragon commented Oct 19, 2021 • edited Loading

ValarDragon commented Oct 27, 2021

tac0turtle commented Nov 16, 2021

ValarDragon commented Nov 16, 2021

alexanderbez commented Nov 16, 2021

fedekunze commented May 21, 2022

alexanderbez commented May 23, 2022

ValarDragon commented May 30, 2022

alexanderbez commented May 30, 2022

tac0turtle commented Dec 29, 2022

elias-orijtech commented May 12, 2023

tac0turtle commented May 13, 2023

elias-orijtech commented Jun 6, 2023

alexanderbez commented Jun 7, 2023

elias-orijtech commented Jun 8, 2023

tac0turtle commented Jun 9, 2023

alexanderbez commented Jun 9, 2023

elias-orijtech commented Jun 9, 2023

alexanderbez commented Jun 10, 2023

elias-orijtech commented Jun 11, 2023

tac0turtle commented Jun 12, 2023

elias-orijtech commented Jun 12, 2023

alexanderbez commented Jun 12, 2023

elias-orijtech commented Jun 12, 2023 • edited Loading

alexanderbez commented Jun 12, 2023

elias-orijtech commented Jun 12, 2023

alexanderbez commented Jun 13, 2023

lucaslopezf commented Mar 26, 2024

Using sdk.Context

Advantages:

Disadvantages:

Using a Global Variable

Advantages:

Disadvantages:

tac0turtle commented Mar 28, 2024

lucaslopezf commented Mar 28, 2024

ValarDragon commented Sep 28, 2021 •

edited

Loading

ValarDragon commented Oct 19, 2021 •

edited

Loading

elias-orijtech commented Jun 12, 2023 •

edited

Loading