Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app Container can't reuse its init Container cpuset in a specific condition #124797

Open
lianghao208 opened this issue May 10, 2024 · 11 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@lianghao208
Copy link
Member

lianghao208 commented May 10, 2024

What happened?

We can't make sure app Container always reuses init Container cpuset, which may lead to the waste of CPU(init container has already exited but the cpuset can not be reused by other containers) and not enough cpus available to satis fy request error.

image

What did you expect to happen?

app Container always reuses init Container cpuset after init Container exits.

How can we reproduce it (as minimally and precisely as possible)?

This is one of the spefic condition that might cause the issue:

  • Pod A ready to allocate cpuset with init container and app container both request 92 cpu.
  • Pod B already running on the node and ready to be deleted.
  1. Pod A's init container starts to allocate cpuset(4-24,48-60,73-84,100-120,144-156,169-180):
I0510 16:40:21.232949   20266 state_mem.go:80] "Updated desired CPUSet" podUID="2f9922ce-df66-4b58-abd8-01187b813318" containerName="init-container" cpuSet="4-24,48-60,73-84,100-120,144-156,169-180"
  1. Pod A's init container exits.
  2. Before Pod A's app container starts to allocate cpuset, Pod B gets deleted and release it's cpuset(cpuSet="0-3,25-47,61-72,85-99,121-143,157-168,181-191):
I0510 16:40:27.759335   20266 state_mem.go:107] "Deleted CPUSet assignment" podUID="74510e24-48ba-4fd7-ab85-80dd99c6df5d" containerName="deleted-container"
I0510 16:40:27.759714   20266 state_mem.go:88] "Updated default CPUSet" cpuSet="0-3,25-47,61-72,85-99,121-143,157-168,181-191"
  1. Pod A's app container starts to allocate cpuset.
    What we expect is that Pod A's app container reuses its init container's cpuset.
    But due to Pod B's deletion, it won't allocate the same cpuset as its init container.(4-49,100-145):
 I0510 16:40:27.989453   20266 state_mem.go:80] "Updated desired CPUSet" podUID="2f9922ce-df66-4b58-abd8-01187b813318" containerName="app-container" cpuSet="4-49,100-145"

Now we have Pod A's init container taking cpuset: 4-24,48-60,73-84,100-120,144-156,169-180
And Pod A's app container taking cpuset: 4-49,100-145
The init container cpuset won't be reused as expected.

A new Pod C starts to allocate cpuset, it may get not enough cpus available to satis fy request error

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
1.30

Cloud provider

NONE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@lianghao208 lianghao208 added the kind/bug Categorizes issue or PR as related to a bug. label May 10, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 10, 2024
@lianghao208
Copy link
Member Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 10, 2024
@lianghao208
Copy link
Member Author

/cc @klueska Hi Klues, I noticed you have solved some similar issues like #102014, I wonder if you have encountered this issue before.

@lianghao208
Copy link
Member Author

The piont is: the cpuset allocations for init-containers and app-containers differ due to changes in available cpusets in the time interval between the start of the init-container and the app-container.

@ffromani
Copy link
Contributor

related: #94220

@chengjoey
Copy link
Contributor

#124282
similar issue?

@lianghao208
Copy link
Member Author

related: #94220

@ffromani Thanks for the mention, the issue I mention in this issue is a little different from #94220 .
In #94220, the bug is caused by different cpu request between init-container and app-container.
This bug might be caused by the changes of available cpusets in the time interval between the start of the init-container and the app-container.

@lianghao208
Copy link
Member Author

#124282 similar issue?

@chengjoey Not exactly the same issue. In #124282 , the init-container and app-container request different amount of cpu(init > app), and the cpuset from init-container can't be released even though it has exited (similar to #94220).

But in this case, init-container and app-container request same amount of cpu(init == app), so this is a kubelet issue instead of kube-scheduler.

@ffromani
Copy link
Contributor

related: #94220

@ffromani Thanks for the mention, the issue I mention in this issue is a little different from #94220 . In #94220, the bug is caused by different cpu request between init-container and app-container. This bug might be caused by the changes of available cpusets in the time interval between the start of the init-container and the app-container.

Yes, I realized after re-reading the description of this issue. I'd need to check if the system does guarantee the maximum reuse of init container cpu cores when allocating the app container cpu cores. Nevertheless, it's a very desirable property the system should strive to ensure. My gut feeling is there is just a bug in this area, I remember various conversations over time.

@ffromani
Copy link
Contributor

the core issue is here: https://github.com/kubernetes/kubernetes/blob/v1.30.0/pkg/kubelet/cm/cpumanager/policy_static.go#L394

with this line of code, all the available CPUs are put in a single pool. IOW, nothing guarantees that the reusable CPUs from terminated init container will be consumed first, or at all if the system has enough CPUs to fulfill the app container requirement.

I vaguely remember some past conversations in this area about guaranteeing optimal allocation in the context of the topology manager-enforced constraints. Also, I wonder if and how we should extend this guarantee. IOW, should the reuse be best-effort (and so, arguably, there's no bug?)

Perhaps the best way to fix would be to add a new cpu manager policy option.

@ffromani
Copy link
Contributor

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 13, 2024
@lianghao208
Copy link
Member Author

@ffromani

with this line of code, all the available CPUs are put in a single pool. IOW, nothing guarantees that the reusable CPUs from terminated init container will be consumed first, or at all if the system has enough CPUs to fulfill the app container requirement.

In this case, should we release init-container cpuset as soon as it exits? If a init-container exits successfully and won't restart anymore, it's cpuset either be reused by its own pod's app-container, or other pods' containers. Or else this "available" cpuset will not be used at all.
However, From the scheduler perspective, it considers these cpu as available.

I vaguely remember some past conversations in this area about guaranteeing optimal allocation in the context of the topology manager-enforced constraints. Also, I wonder if and how we should extend this guarantee. IOW, should the reuse be best-effort (and so, arguably, there's no bug?)

If we release init-container cpuset as soon as it exits, the reuse will be guarantee.

@pacoxu pacoxu added this to Triage in SIG Node Bugs May 14, 2024
@ffromani ffromani moved this from Triage to Triaged in SIG Node Bugs May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

4 participants