Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes binpack plugin score nodes with terminating pods too high, causing pending pods go to pipeline but not allocated #2782

Closed
jiangkaihua opened this issue Apr 13, 2023 · 4 comments · Fixed by #2786 or #2815
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@jiangkaihua
Copy link
Contributor

jiangkaihua commented Apr 13, 2023

What happened:

In allocate action, nodes with terminating pods would also be permitted to go through predicateFn as long as futureIdle(idle + releasing) is larger than pod request:

predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) error {
// Check for Resource Predicate
if !task.InitResreq.LessEqual(node.FutureIdle(), api.Zero) {
return api.NewFitError(task, node, api.NodeResourceFitFailed)
}
return ssn.PredicateFn(task, node)
}

So nodes with terminating pods could be candidates in score period even if they would not provide enough resources for pending pod to allocate immediately. Therefore in some scenarios nodes with terminating pods would score higher than nodes with enough idle resources.

What you expected to happen:

Nodes with terminating pods would got less score in score period than nodes with enough idle resources.

How to reproduce it (as minimally and precisely as possible):

  1. Enable binpack plugin and give it a high weight;
  2. Give different weight for each resources, like:
- name: binpack
  arguments:
    binpack.cpu: 8
    binpack.memory: 1
    binpack.weight: 4
  1. Compose a scenario that a pod kept interminating on a node, whose idle resource not meet memory request of pending pod, but cpu is enough. And its idle + releasing resources just fit the pending pod.(idle.mem+releasing.mem=request.mem, idle.cpu=request.cpu)
  2. In binpack plugin, it would score 0 for memory:
    usedFinally := requested + used
    if usedFinally > capacity {
    return 0
    }
  3. But for cpu, it would return non-zero score since cpu is enough, And the total score would be high since cpu weight is 8/9:
    score += resourceScore
    weightSum += resourceWeight
    }
    // mapping the result from [0, weightSum] to [0, 10(MaxPriority)]
    if weightSum > 0 {
    score /= float64(weightSum)
    }
    score *= float64(k8sFramework.MaxNodeScore * int64(weight.BinPackingWeight))
    return score
  4. Then pending pod would be pipelined to this node even if the other nodes owned enough idle sources, Because binpack plugin would score this node higher than others.

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@jiangkaihua jiangkaihua added the kind/bug Categorizes issue or PR as related to a bug. label Apr 13, 2023
@wangyang0616
Copy link
Member

/reopen

@volcano-sh-bot
Copy link
Contributor

@wangyang0616: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wangyang0616
Copy link
Member

Except that the binpack plugin has this problem, I understand that other algorithm plugins may encounter similar problems, such as task-topology, nodeorder, etc.

I was wondering if this generic problem could be solved by:
When allocate scores nodes, it divides nodes into two groups. One group is machines whose idle resources meet task resource requests, and the second group is futrue idle machines that meet task resource demands.

First, score the first group of machines, and if a suitable machine can be found, schedule the task to a suitable node; if the first group does not have a machine that meets the resource request, then score the second group of machines, and then select a suitable node for scheduling.

In this way, the pod can be dispatched to the machine that meets the resource requirements in the current session first, so that the pod will not be pending for a long time. If all the machines in the current session do not meet the requirements, it can also be scheduled to wait in the machine that meets the futrue idle.

@zhaizhch
Copy link

zhaizhch commented Jan 8, 2024

we also met this issue, the pr solved this issue perfect

@william-wang william-wang added this to the v1.8 milestone Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
5 participants