Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat interval: Make it configurable #2243

Merged
merged 4 commits into from
Sep 6, 2022

Conversation

NickCraver
Copy link
Collaborator

When I was looking at a separate timer for async timeout evaluation < 1000ms fidelity it became apparent most of the cost in a heartbeat is the timeout evaluation anyway, but also: if you are chasing a lower timeout fidelity you likely want to observe a server outage ASAP as well so I think it makes more sense to actually open up the heartbeat to make it configurable.

I added some docs about how this is risky but that portion and comments could use scrutiny (if we even agree with doing this) on being strong enough warnings around this. I would prefer to leave this as code-only due to potential downsides people may not appreciate otherwise.

Thoughts?

When I was looking at a separate timer for async timeout evaluation < 1000ms fidelity it became apparent _most_ of the cost in a heartbeat is the timeout evaluation anyway, but also: if you are chasing a lower timeout fidelity you likely want to observe a server outage ASAP as well so I think it makes more sense to actually open up the heartbeat to make it configurable.
@NickCraver
Copy link
Collaborator Author

cc @deanward81

@NickCraver NickCraver marked this pull request as ready for review September 4, 2022 16:47
@NickCraver NickCraver changed the title Heartbeat: make it configurable Heartbeat interval: Make it configurable Sep 4, 2022
@NickCraver NickCraver linked an issue Sep 4, 2022 that may be closed by this pull request
Copy link

@tyzoid tyzoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I'll re-run some of the load testing that got me to #2146 in the first place this week. This seems reasonable to me as a way of resolving our original issue.

Given the cautionary notes I've seen from developers here regarding shortening this interval, is this a measurable/meaningful impact? Decreasing the interval from 1s to 100ms will 10x the total amount of work the library will incur overall, but how much of an impact is this? Would this just be burning extra cycles in a separate thread, or does this impact request processing and throughput?

One of the workarounds we implemented in our codebase is await on a Task.Delay with a short timeout, and just abandon the request if the timeout is exceeded. Not a great solution, since this still leaves tasks floating around (since there doesn't seem to be a cancellation interface), and also retries seem to still select the same failing connection.

If decreasing this heartbeat interval is seen as too drastic a measure, perhaps an alternative implementation to achieve a similar result is to present a cancellation interface to the client allowing the requester to abort the request early? I have no data to support either conclusion, hence the need for testing, but I don't have a great way of fixturing this library directly without digging into the code more in depth, only in the black-box method of testing we've performed so far.

@@ -20,7 +20,7 @@ internal sealed class PhysicalBridge : IDisposable

private const int ProfileLogSamples = 10;

private const double ProfileLogSeconds = (ConnectionMultiplexer.MillisecondsPerHeartbeat * ProfileLogSamples) / 1000.0;
private const double ProfileLogSeconds = (1000 /* ms */ * ProfileLogSamples) / 1000.0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why multiply by a thousand just to divide again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really just showing intent of the net change without doing a bigger rename, etc. - can be simplified for sure.

@NickCraver
Copy link
Collaborator Author

@tyzoid We can't abort it early really, because the command has been sent, so we have to handle/account for it no matter what - any per-command tracking (especially out of order) would be much more costly.

On the overall cost of this...it'd be awesome to experiment and see. The work of a heartbeat isn't crazy but there's a brief lock around the work queue to check timeouts. If people lower it a ton though, say to 10ms, it's probably not awesome and going to have some measurable jitter in spikes. I don't have exact numbers, by nature it's going to depend on the workload a lot as to if it'll have any impact or how much.

Copy link
Collaborator

@slorello89 slorello89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickCraver - So I understand, the issue here is that the event that wakes up to check the heartbeat is the same that checks all of the timeouts?

If the goal here is to allow for tighter propagation of aggressively set timeouts, this seems like a somewhat roundabout way to get there. Wouldn't it make more sense to decouple the process that checks the timeouts from the hearbeat and make that interval configurable? It strikes me that the heartbeat needs a network call and a full round trip to Redis, which, even if it's piggybacking along with other packets being multiplexed to Redis, sounds like an expense you shouldn't need to pay to check some timeouts and wakeup the hung tasks/threads associated with those timeouts?

@NickCraver
Copy link
Collaborator Author

@slorello89 We only connect to Redis if we haven't written in a long time, this shouldn't actually increase the network activity any...and most of the work left is the timeout and messages. There's no additional thread here - only effectively the timeout checks and a few if statements ahead of them.

I agree it sounds roundabout and my first work was the split path, but 2 things drove me back to simpler:

  1. Most of the work is message timeouts anyway, or all timeout related (e.g. connections timing out themselves, from Redis)
  2. People who want crazy low timeouts are almost 1:1 with wanting to know Redis is offline and handle it ASAP, so a quicker window in those cases to evaluate the backup checks (if we didn't actively detect a socket failure) is usually desired for the same crowd.

docs/Configuration.md Outdated Show resolved Hide resolved
docs/Configuration.md Outdated Show resolved Hide resolved
docs/Configuration.md Outdated Show resolved Hide resolved
@NickCraver NickCraver merged commit d1b802b into main Sep 6, 2022
@NickCraver NickCraver deleted the craver/configurable-heartbeat branch September 6, 2022 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Small timeouts: Time elapsed is much greater than timeout
4 participants