Heartbeat interval: Make it configurable #2243

NickCraver · 2022-09-04T15:43:37Z

When I was looking at a separate timer for async timeout evaluation < 1000ms fidelity it became apparent most of the cost in a heartbeat is the timeout evaluation anyway, but also: if you are chasing a lower timeout fidelity you likely want to observe a server outage ASAP as well so I think it makes more sense to actually open up the heartbeat to make it configurable.

I added some docs about how this is risky but that portion and comments could use scrutiny (if we even agree with doing this) on being strong enough warnings around this. I would prefer to leave this as code-only due to potential downsides people may not appreciate otherwise.

Thoughts?

When I was looking at a separate timer for async timeout evaluation < 1000ms fidelity it became apparent _most_ of the cost in a heartbeat is the timeout evaluation anyway, but also: if you are chasing a lower timeout fidelity you likely want to observe a server outage ASAP as well so I think it makes more sense to actually open up the heartbeat to make it configurable.

NickCraver · 2022-09-04T15:44:09Z

cc @deanward81

tyzoid

Thanks for this. I'll re-run some of the load testing that got me to #2146 in the first place this week. This seems reasonable to me as a way of resolving our original issue.

Given the cautionary notes I've seen from developers here regarding shortening this interval, is this a measurable/meaningful impact? Decreasing the interval from 1s to 100ms will 10x the total amount of work the library will incur overall, but how much of an impact is this? Would this just be burning extra cycles in a separate thread, or does this impact request processing and throughput?

One of the workarounds we implemented in our codebase is await on a Task.Delay with a short timeout, and just abandon the request if the timeout is exceeded. Not a great solution, since this still leaves tasks floating around (since there doesn't seem to be a cancellation interface), and also retries seem to still select the same failing connection.

If decreasing this heartbeat interval is seen as too drastic a measure, perhaps an alternative implementation to achieve a similar result is to present a cancellation interface to the client allowing the requester to abort the request early? I have no data to support either conclusion, hence the need for testing, but I don't have a great way of fixturing this library directly without digging into the code more in depth, only in the black-box method of testing we've performed so far.

tyzoid · 2022-09-05T20:28:19Z

src/StackExchange.Redis/PhysicalBridge.cs

@@ -20,7 +20,7 @@ internal sealed class PhysicalBridge : IDisposable

        private const int ProfileLogSamples = 10;

-        private const double ProfileLogSeconds = (ConnectionMultiplexer.MillisecondsPerHeartbeat * ProfileLogSamples) / 1000.0;
+        private const double ProfileLogSeconds = (1000 /* ms */ * ProfileLogSamples) / 1000.0;


Why multiply by a thousand just to divide again?

This was really just showing intent of the net change without doing a bigger rename, etc. - can be simplified for sure.

NickCraver · 2022-09-05T21:56:46Z

@tyzoid We can't abort it early really, because the command has been sent, so we have to handle/account for it no matter what - any per-command tracking (especially out of order) would be much more costly.

On the overall cost of this...it'd be awesome to experiment and see. The work of a heartbeat isn't crazy but there's a brief lock around the work queue to check timeouts. If people lower it a ton though, say to 10ms, it's probably not awesome and going to have some measurable jitter in spikes. I don't have exact numbers, by nature it's going to depend on the workload a lot as to if it'll have any impact or how much.

slorello89

@NickCraver - So I understand, the issue here is that the event that wakes up to check the heartbeat is the same that checks all of the timeouts?

If the goal here is to allow for tighter propagation of aggressively set timeouts, this seems like a somewhat roundabout way to get there. Wouldn't it make more sense to decouple the process that checks the timeouts from the hearbeat and make that interval configurable? It strikes me that the heartbeat needs a network call and a full round trip to Redis, which, even if it's piggybacking along with other packets being multiplexed to Redis, sounds like an expense you shouldn't need to pay to check some timeouts and wakeup the hung tasks/threads associated with those timeouts?

NickCraver · 2022-09-06T12:15:57Z

@slorello89 We only connect to Redis if we haven't written in a long time, this shouldn't actually increase the network activity any...and most of the work left is the timeout and messages. There's no additional thread here - only effectively the timeout checks and a few if statements ahead of them.

I agree it sounds roundabout and my first work was the split path, but 2 things drove me back to simpler:

Most of the work is message timeouts anyway, or all timeout related (e.g. connections timing out themselves, from Redis)
People who want crazy low timeouts are almost 1:1 with wanting to know Redis is offline and handle it ASAP, so a quicker window in those cases to evaluate the backup checks (if we didn't actively detect a socket failure) is usually desired for the same crowd.

docs/Configuration.md

…eartbeat

NickCraver added ⚙️ area:config ⏱️ timeout ⚙️ area:API labels Sep 4, 2022

NickCraver marked this pull request as ready for review September 4, 2022 16:47

NickCraver requested review from philon-msft, slorello89 and mgravell September 4, 2022 16:48

NickCraver changed the title ~~Heartbeat: make it configurable~~ Heartbeat interval: Make it configurable Sep 4, 2022

NickCraver mentioned this pull request Sep 4, 2022

Small timeouts: Time elapsed is much greater than timeout #2146

Closed

NickCraver linked an issue Sep 4, 2022 that may be closed by this pull request

Small timeouts: Time elapsed is much greater than timeout #2146

Closed

Add release notes

1a58dea

tyzoid reviewed Sep 5, 2022

View reviewed changes

slorello89 reviewed Sep 6, 2022

View reviewed changes

philon-msft approved these changes Sep 6, 2022

View reviewed changes

docs/Configuration.md Outdated Show resolved Hide resolved

docs/Configuration.md Outdated Show resolved Hide resolved

philon-msft reviewed Sep 6, 2022

View reviewed changes

docs/Configuration.md Outdated Show resolved Hide resolved

NickCraver added 2 commits September 6, 2022 10:57

Merge main and tweak docs

c36c11d

Merge remote-tracking branch 'origin/main' into craver/configurable-h…

8b197ee

…eartbeat

NickCraver merged commit d1b802b into main Sep 6, 2022

NickCraver deleted the craver/configurable-heartbeat branch September 6, 2022 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat interval: Make it configurable #2243

Heartbeat interval: Make it configurable #2243

NickCraver commented Sep 4, 2022

NickCraver commented Sep 4, 2022

tyzoid left a comment

tyzoid Sep 5, 2022

NickCraver Sep 5, 2022

NickCraver commented Sep 5, 2022

slorello89 left a comment

NickCraver commented Sep 6, 2022

Heartbeat interval: Make it configurable #2243

Heartbeat interval: Make it configurable #2243

Conversation

NickCraver commented Sep 4, 2022

NickCraver commented Sep 4, 2022

tyzoid left a comment

Choose a reason for hiding this comment

tyzoid Sep 5, 2022

Choose a reason for hiding this comment

NickCraver Sep 5, 2022

Choose a reason for hiding this comment

NickCraver commented Sep 5, 2022

slorello89 left a comment

Choose a reason for hiding this comment

NickCraver commented Sep 6, 2022