Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 1544: Add support for Kubernetes limits #1566

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tfurmston
Copy link
Contributor

@tfurmston tfurmston commented Oct 3, 2023

This PR adds support for specifying limits (as opposed to requests) for workflows on Kubernetes. It addresses issue 1544.

Notes:

  • I've added a single flag in the Metaflow config to cover CPU, RAM & disk space. Not sure if you would prefer different flags for each resource?
  • I've tested by performing some manual runs. (I couldn't find any relevant tests, so assume this is the current practice for this part of the code base.)
  • There is an open comment on the issue. Not sure how you would like to address that comment?

@tfurmston tfurmston marked this pull request as draft October 3, 2023 21:10
@tfurmston tfurmston changed the title Issue 1544: Add support for Kubernetes limits [WIP] Issue 1544: Add support for Kubernetes limits Oct 3, 2023
@tfurmston tfurmston marked this pull request as ready for review October 8, 2023 09:20
@tfurmston tfurmston changed the title [WIP] Issue 1544: Add support for Kubernetes limits Issue 1544: Add support for Kubernetes limits Oct 8, 2023
@elgehelge
Copy link

Too me it seems odd to have either requests or limits, but I am not a Kubernetes expert myself, so the only feedback I can really give is to seek advice in a Cloud Native community.

For my own specific use-case I need a couple of different requests to make sure the job can run while also limiting the memory to avoid all memory being captured which sometimes caused the the node to be killed because it could not free the memory fast enough.

@shrinandj
Copy link
Contributor

There is a genuine use-case of having requests and NOT limits. Consider the case where a user doesn't know exactly how much cpu/mem is going to be used for a given job. The user can specify some basic value so as to get started. The pod starts with that value guaranteed but can use more. When the pod completes, users can look at resource utilization metrics and can adjust their decorators/cli args, etc.

The above would still be possible with this PR. But the user will have to additionally factor in the value of KUBERNETES_RESOURCE_REQUESTS option.

There is also the case of actually needing to specify both requests AND limits which will have to be solved at some point.

The approach in #1612 might be better. WDYT?

@tfurmston
Copy link
Contributor Author

Hey @shrinandj ,

Thanks for the review!.

The above would still be possible with this PR. But the user will have to additionally factor in the value of KUBERNETES_RESOURCE_REQUESTS option.

You mean the user will need to remember to switch this flag?

There is also the case of actually needing to specify both requests AND limits which will have to be solved at some point.

You mean requests and limits with different values, right?

If there is a need to support the possibility of different requests & limits, then I agree the approach in this PR would be too restrictive. #1612 is definitely more flexible in that respect.

One thing from my perspective is that I would like to be able to provide an easy way to ensure that all our internal users are specifying limits. For example, this would allow us to start using LimitRangers, which is something we would like to do internally. Expecting users to specify both requests & limits in all the steps of their different workflows themselves wouldn't be the best user experience.

Perhaps a mix of the two approaches would work? So something like:

  • The ability to specify limits via arguments in the Kubernetes decorator.
  • Introduction of some new flag. If the flag is set and the limits are not specified in the kubernetes decorator, then the limits are set to match the requests. If the limits are specified in the kubernetes decorator, then this takes precendence.

In this way it would be easy to set limits on all workflows, but also provide flexibility if people want them to be different.

What do you think?

@cstano-attn
Copy link

Adding on some thoughts here - I think support for both requests and limits would be incredibly helpful. Right now on K8s, we see about 1/4 of our flows encounter the noisy neighbor problem when pods burst CPU or memory past the request and if they are scheduled on the same node (which in about 25% of the time they are) they exhaust the node's resources and kill the node.

I would +1 on the need for both limits and requests as separate values that can be passable. Setting them as the same just requires teams to increase everything to account for bursting

@elgehelge
Copy link

There is also the case of actually needing to specify both requests AND limits which will have to be solved at some point.

You mean requests and limits with different values, right?

If there is a need to support the possibility of different requests & limits, then I agree the approach in this PR would be too restrictive. #1612 is definitely more flexible in that respect.

One thing from my perspective is that I would like to be able to provide an easy way to ensure that all our internal users are specifying limits. For example, this would allow us to start using LimitRangers, which is something we would like to do internally. Expecting users to specify both requests & limits in all the steps of their different workflows themselves wouldn't be the best user experience.

I think I was one of the voices arguing for requests AND limits, however, I have become wiser (I think 🤔). As I understand it now, setting limits means that the requests are automatically set to the same on the pods. This was basically what I wanted in the first place.

But @cstano-attn seems to want the values to be able to differ:

I would +1 on the need for both limits and requests as separate values that can be passable. Setting them as the same just requires teams to increase everything to account for bursting

Could you elaborate on why it is beneficial to have separate values? What is a scenario where this is preferred?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants