-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): introduce rss oom limit #3702
base: main
Are you sure you want to change the base?
Conversation
e2671af
to
678a6c9
Compare
src/server/main_service.cc
Outdated
@@ -103,6 +103,11 @@ ABSL_FLAG(double, oom_deny_ratio, 1.1, | |||
"commands with flag denyoom will return OOM when the ratio between maxmemory and used " | |||
"memory is above this value"); | |||
|
|||
ABSL_FLAG(double, rss_oom_deny_ratio, 1.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romange do you think this should be enabled by default or disabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled but maybe increase the default ratio to 1.25
because it affects maxmemory
semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to provide explanation in release notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if one upgrades to new version and does not use admin port. if he hits the rss limit he will not be able to even connect to the server as we will reject new connections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are right but how we introduce this behavior then? I do think that some sane multiplier is needed. This can be eve 2,3 but if RSS jumps by a huge factor - it's not ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should reject new connections only if admin port is set?
src/server/main_service.cc
Outdated
auto memory_stats = etl.GetMemoryUsage(start_ns); | ||
double oom_deny_ratio = GetFlag(FLAGS_oom_deny_ratio); | ||
if (used_memory > (max_memory_limit * oom_deny_ratio)) { | ||
DLOG(WARNING) << "Out of memory, used " << used_memory << " vs limit " << max_memory_limit; | ||
double rss_oom_deny_ratio = GetFlag(FLAGS_rss_oom_deny_ratio); | ||
if (memory_stats.used_mem > (max_memory_limit * oom_deny_ratio) || | ||
(rss_oom_deny_ratio > 0 && | ||
memory_stats.rss_mem > (max_memory_limit * rss_oom_deny_ratio))) { | ||
DLOG(WARNING) << "Out of memory, used " << memory_stats.used_mem << " ,rss " | ||
<< memory_stats.rss_mem << " ,limit " << max_memory_limit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd wrap this logic, possibly including the DLOG
part, in a function
src/server/main_service.cc
Outdated
auto memory_stats = etl.GetMemoryUsage(start_ns); | ||
double oom_deny_ratio = GetFlag(FLAGS_oom_deny_ratio); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with reading 2 flags in the hot path?
We could have thread-local caches, and update them upon flag update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok using thread local now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wait for Roman's approval on the default value
src/server/main_service.cc
Outdated
optional<ErrorReply> Service::VerifyCommandExecution(const CommandId* cid, | ||
const ConnectionContext* cntx, | ||
CmdArgList tail_args) { | ||
static optional<ErrorReply> ShouldDenyOnOOM(const CommandId* cid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be simpler if this function returns bool
@@ -925,7 +955,12 @@ void Service::Init(util::AcceptServer* acceptor, std::vector<facade::Listener*> | |||
} | |||
|
|||
// Initialize shard_set with a global callback running once in a while in the shard threads. | |||
shard_set->Init(shard_num, [this] { server_family_.GetDflyCmd()->BreakStalledFlowsInShard(); }); | |||
shard_set->Init(shard_num, [this] { | |||
server_family_.GetDflyCmd()->BreakStalledFlowsInShard(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that EngineShard::RunPeriodic
runs shard_handler once in 100ms, even though the loop inside RunPeriodic is once in 1ms.
So with this code now server_family_.UpdateMemoryGlobalStats();
will run once in 100ms as well. I think it's not ok. I suggest that you move the 100ms check here and make it special only for BreakStalledFlowsInShard()
so that shard_handler will be called every iteration inside RunPeriodic.
src/server/server_family.cc
Outdated
return; | ||
} | ||
time_t curr_time = time(nullptr); | ||
if (curr_time == global_stats_update_time_) { // Runs one a second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you changed semantics of the previous code by updating used_mem_peak
once in a second. I am confused now whether we need to run UpdateMemoryGlobalStats
as frequently as 1ms. Maybe you are right and it's not needed. But I think it's an opportunity to simplify this logic so that timing rules will be clearer.
Lets run UpdateMemoryGlobalStats once in 100ms, and the whole shard_handler will run once in 100ms. But then lets remove the 1s restriction here and update used_mem_peak
, rss_mem_current
, rss_mem_peak
once in 100ms as wll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I removed the logic to restrict this logic to run only once a second
src/server/server_family.cc
Outdated
if (rss_mem_peak.load(memory_order_relaxed) < total_rss) { | ||
rss_mem_peak.store(total_rss, memory_order_relaxed); | ||
} | ||
double rss_oom_deny_ratio = absl::GetFlag(FLAGS_rss_oom_deny_ratio); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you added rss_oom_deny_ratio
to server state but here you use the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
src/server/server_family.h
Outdated
Service& service_; | ||
|
||
util::AcceptServer* acceptor_ = nullptr; | ||
std::vector<facade::Listener*> listeners_; | ||
bool accepting_connections_ = true; | ||
time_t global_stats_update_time_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for global_stats_update_time_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com> Signed-off-by: adiholden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
0bb67ed
to
bfbde9e
Compare
Signed-off-by: adi_holden <adi@dragonflydb.io>
fixes #3616