Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numa: fix scheduler panic due to topology serialization bug #23284

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Jun 10, 2024

The NUMA topology struct field NodeIDs is a idset.Set, which has no public members. As a result, this field is never serialized via msgpack and persisted in state. When numa.affinity = "prefer", the scheduler dereferences this nil field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but because the field already exists and is just always empty, this breaks RPC wire compatibility across upgrades. Instead, create a new field that's populated at the same time we populate the more useful idset.Set, and repopulate the set on demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1527

The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross added backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.8.x backport to 1.8.x release line labels Jun 11, 2024
@tgross tgross merged commit 7d73065 into main Jun 11, 2024
43 checks passed
@tgross tgross deleted the numa-nodeids-serialization branch June 11, 2024 12:55
@tgross tgross removed the backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent label Jun 11, 2024
pkazmierczak added a commit that referenced this pull request Jul 9, 2024
… NUMA (#23399)

After changes introduced in #23284 we no longer need to make a if
!st.SupportsNUMA() check in the GetNodes() topology method. In fact this check
will now cause panic in nomadTopologyToProto method on systems that don't
support NUMA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.8.x backport to 1.8.x release line theme/crash theme/enterprise Issues related to Enterprise features theme/scheduling type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants