Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2832] [core] Add non-Yunikorn allocation tracking logic #975

Closed
wants to merge 3 commits into from

Conversation

pbacsko
Copy link
Contributor

@pbacsko pbacsko commented Sep 26, 2024

What is this PR for?

Add foreign pod tracking logic to the scheduler core.

Note: existing code which updates occupiedResources is not removed in this PR. Removal will be done as a follow-up task.
Updating the allocation resource usage be implemented in YUNIKORN-1765.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2832

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@pbacsko pbacsko self-assigned this Sep 26, 2024
@pbacsko pbacsko marked this pull request as draft September 26, 2024 13:50
Copy link

codecov bot commented Sep 26, 2024

Codecov Report

Attention: Patch coverage is 90.83969% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.49%. Comparing base (fe067b7) to head (d8818e8).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/partition.go 77.77% 10 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #975      +/-   ##
==========================================
+ Coverage   81.03%   81.49%   +0.45%     
==========================================
  Files          97       97              
  Lines       12523    12632     +109     
==========================================
+ Hits        10148    10294     +146     
+ Misses       2104     2055      -49     
- Partials      271      283      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pbacsko pbacsko force-pushed the YUNIKORN-2832 branch 7 times, most recently from f35f054 to 724f3de Compare September 27, 2024 14:54
@pbacsko pbacsko marked this pull request as ready for review September 27, 2024 14:58
@pbacsko pbacsko changed the title [YUNIKORN-2832] [core] Add non-Yunikorn tracking logic [YUNIKORN-2832] [core] Add non-Yunikorn allocation tracking logic Sep 27, 2024
Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, and some questions.

pkg/scheduler/objects/allocation.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/node.go Outdated Show resolved Hide resolved
pkg/scheduler/partition.go Show resolved Hide resolved
pkg/scheduler/partition.go Show resolved Hide resolved
Copy link
Contributor

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions.

Comment on lines +1296 to +1304
if !allocated {
return false, false, fmt.Errorf("trying to add a foreign request (non-allocation) %s", allocationKey)
}
if alloc.GetNodeID() == "" {
return false, false, fmt.Errorf("node ID is empty for allocation %s", allocationKey)
}
if node == nil {
return false, false, fmt.Errorf("failed to find node %s for allocation %s", nodeID, allocationKey)
}
Copy link
Contributor

@chenyulin0719 chenyulin0719 Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return error means the foreign allocation will be sent back to shim as "rejected allocation"

May I check what will be the followup actions after shim receives the rejected foreign allocations?

The reject reasons for foreign allocations:

  1. foreign request is non-allocated
  2. foreign allocation don't have node ID empty
  3. foreign allocation failed to find node in partition

Besides, should we add a flag of allocation type to &si.RejectedAllocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it's sent back and it needs to be handled properly.
Currently, this happens inside scheduler_callback.go:

	for _, reject := range response.RejectedAllocations {
		// request rejected by the scheduler, reject it
		log.Log(log.ShimRMCallback).Debug("callback: response to rejected allocation",
			zap.String("allocationKey", reject.AllocationKey))
		dispatcher.Dispatch(NewRejectTaskEvent(reject.ApplicationID, reject.AllocationKey,
			fmt.Sprintf("task %s allocation from application %s is rejected by scheduler",
				reject.AllocationKey, reject.ApplicationID)))
	}

This doesn't look right, we don't need a task reject event for a non-Yunikorn allocation. Therefore, we need changes in this code path (my feeling is that there's not much we can do other than logging something) but, this is yet to be done.

Copy link
Contributor

@chenyulin0719 chenyulin0719 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, there is nothing we can do except logging.

If we want to log the information back to the Pod (Pod event? or other places?), we need to determine whether "This rejected allocation is YuniKorn or Non-YuniKorn." How do we identify it?

Will we keep the Non-YuniKorn allocation ID in shim?
My previous question assumes we don't keep the Non-YuniKorn allocation ID list in shim.

Besides, should we add a flag of allocation type to &si.RejectedAllocation?

Therefore, we might need a flag from si.RejectedAllocation.

Copy link
Contributor

@chenyulin0719 chenyulin0719 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, maybe we can just keep the rejected allocation in core?
Actually, I prefer this because there is nothing helpful the shim can do after receiving the rejected Non-YuniKorn allcoations.

Copy link
Contributor

@craigcondit craigcondit Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way we would ever reject a non-YK alloc in the core is if the node was not found. This would indicate a bug in the shim, and so there's no appropriate way to handle it. So logging loudly is probably sufficient (ideally with a "BUG:" prefix). Since it should be unreachable code, sending back a reject seems unnecessary.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for debug instead of warning in logging.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

@pbacsko pbacsko closed this in a2d3d43 Oct 2, 2024
Copy link
Contributor

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants