Issue 1544: Add support for Kubernetes limits #1566

tfurmston · 2023-10-03T21:10:19Z

This PR adds support for specifying limits (as opposed to requests) for workflows on Kubernetes. It addresses issue 1544.

Notes:

I've added a single flag in the Metaflow config to cover CPU, RAM & disk space. Not sure if you would prefer different flags for each resource?
I've tested by performing some manual runs. (I couldn't find any relevant tests, so assume this is the current practice for this part of the code base.)
There is an open comment on the issue. Not sure how you would like to address that comment?

elgehelge · 2023-11-07T15:57:04Z

Too me it seems odd to have either requests or limits, but I am not a Kubernetes expert myself, so the only feedback I can really give is to seek advice in a Cloud Native community.

For my own specific use-case I need a couple of different requests to make sure the job can run while also limiting the memory to avoid all memory being captured which sometimes caused the the node to be killed because it could not free the memory fast enough.

shrinandj · 2023-11-29T17:24:12Z

There is a genuine use-case of having requests and NOT limits. Consider the case where a user doesn't know exactly how much cpu/mem is going to be used for a given job. The user can specify some basic value so as to get started. The pod starts with that value guaranteed but can use more. When the pod completes, users can look at resource utilization metrics and can adjust their decorators/cli args, etc.

The above would still be possible with this PR. But the user will have to additionally factor in the value of KUBERNETES_RESOURCE_REQUESTS option.

There is also the case of actually needing to specify both requests AND limits which will have to be solved at some point.

The approach in #1612 might be better. WDYT?

tfurmston · 2023-11-29T18:58:54Z

Hey @shrinandj ,

Thanks for the review!.

The above would still be possible with this PR. But the user will have to additionally factor in the value of KUBERNETES_RESOURCE_REQUESTS option.

You mean the user will need to remember to switch this flag?

There is also the case of actually needing to specify both requests AND limits which will have to be solved at some point.

You mean requests and limits with different values, right?

If there is a need to support the possibility of different requests & limits, then I agree the approach in this PR would be too restrictive. #1612 is definitely more flexible in that respect.

One thing from my perspective is that I would like to be able to provide an easy way to ensure that all our internal users are specifying limits. For example, this would allow us to start using LimitRangers, which is something we would like to do internally. Expecting users to specify both requests & limits in all the steps of their different workflows themselves wouldn't be the best user experience.

Perhaps a mix of the two approaches would work? So something like:

The ability to specify limits via arguments in the Kubernetes decorator.
Introduction of some new flag. If the flag is set and the limits are not specified in the kubernetes decorator, then the limits are set to match the requests. If the limits are specified in the kubernetes decorator, then this takes precendence.

In this way it would be easy to set limits on all workflows, but also provide flexibility if people want them to be different.

What do you think?

cstano-attn · 2024-03-21T12:34:07Z

Adding on some thoughts here - I think support for both requests and limits would be incredibly helpful. Right now on K8s, we see about 1/4 of our flows encounter the noisy neighbor problem when pods burst CPU or memory past the request and if they are scheduled on the same node (which in about 25% of the time they are) they exhaust the node's resources and kill the node.

I would +1 on the need for both limits and requests as separate values that can be passable. Setting them as the same just requires teams to increase everything to account for bursting

elgehelge · 2024-04-02T18:07:41Z

There is also the case of actually needing to specify both requests AND limits which will have to be solved at some point.

You mean requests and limits with different values, right?

If there is a need to support the possibility of different requests & limits, then I agree the approach in this PR would be too restrictive. #1612 is definitely more flexible in that respect.

One thing from my perspective is that I would like to be able to provide an easy way to ensure that all our internal users are specifying limits. For example, this would allow us to start using LimitRangers, which is something we would like to do internally. Expecting users to specify both requests & limits in all the steps of their different workflows themselves wouldn't be the best user experience.

I think I was one of the voices arguing for requests AND limits, however, I have become wiser (I think 🤔). As I understand it now, setting limits means that the requests are automatically set to the same on the pods. This was basically what I wanted in the first place.

But @cstano-attn seems to want the values to be able to differ:

I would +1 on the need for both limits and requests as separate values that can be passable. Setting them as the same just requires teams to increase everything to account for bursting

Could you elaborate on why it is beneficial to have separate values? What is a scenario where this is preferred?

Add specification of requests or limits

206a966

tfurmston marked this pull request as draft October 3, 2023 21:10

tfurmston changed the title ~~Issue 1544: Add support for Kubernetes limits~~ [WIP] Issue 1544: Add support for Kubernetes limits Oct 3, 2023

tfurmston added 3 commits October 4, 2023 21:29

fix formatting issues

cfcfe02

fix formatting issues

10bbbe9

fix resource configuration

ebc778c

tfurmston marked this pull request as ready for review October 8, 2023 09:20

tfurmston changed the title ~~[WIP] Issue 1544: Add support for Kubernetes limits~~ Issue 1544: Add support for Kubernetes limits Oct 8, 2023

roofurmston mentioned this pull request Oct 25, 2023

Add kuberenetes resource limits & security context configs #1612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 1544: Add support for Kubernetes limits #1566

Issue 1544: Add support for Kubernetes limits #1566

tfurmston commented Oct 3, 2023 •

edited

elgehelge commented Nov 7, 2023

shrinandj commented Nov 29, 2023

tfurmston commented Nov 29, 2023

cstano-attn commented Mar 21, 2024

elgehelge commented Apr 2, 2024

Issue 1544: Add support for Kubernetes limits #1566

Are you sure you want to change the base?

Issue 1544: Add support for Kubernetes limits #1566

Conversation

tfurmston commented Oct 3, 2023 • edited

elgehelge commented Nov 7, 2023

shrinandj commented Nov 29, 2023

tfurmston commented Nov 29, 2023

cstano-attn commented Mar 21, 2024

elgehelge commented Apr 2, 2024

tfurmston commented Oct 3, 2023 •

edited