(Feature) Configurable cost info messages and limits when creating jobs and endpoints #4643

athewsey · 2024-05-02T05:11:20Z

Describe the feature you'd like

To promote better end-user visibility of costs and organizational cost management, it would be useful if the SM Python SDK could (when configured by an administrator):

Project (via the AWS Price List API) the
- Maximum possible compute cost for training/processing jobs (with ref to the max_run stopping condition)
- Maximum per-hour compute cost for real-time inference endpoints (or maybe cost at the initial_instance_count, if tracing auto-scaling config is too hard?)
Give an informational message including the projected/max cost
Optionally reject the operation, if the projection exceeds an organizationally-configured threshold

How would this feature be used? Please describe.

An administrator would enable max-cost projection and optionally configure hard limits through the SDK intelligent defaults YAML config file
- We could expose the options as arguments to e.g. Estimator/Predictor/etc as well, but I see minimal value unless an org can turn them on by default for their team.
- I acknowledge the messaging (esp around training job max run time) could be complex and confusing for new users, so would not suggest to enable either logging or limits for all SDK users by default - just offer it as a configurable feature
When creating a job or endpoint through the normal SDK methods, the data scientist would be notified of the projected (max) costs for the actions via log messages, and the action would fail with an error if the pre-configured threshold is exceeded

For example something like:

[INFO] With the configured max_run = 3600 seconds, this training job could
generate up to $2.12 in compute instance charges.

The messaging would need to be carefully chosen to avoid confusion, because:

A total cost estimate would involve a range of other factors like configured job EBS size (known up-front) and S3 data access patterns / data transfer fees (impossible to know).
Enabling SageMaker Managed Spot could offer some (unknowable?) discount over the on-demand price

Describe alternatives you've considered

I appreciate that it's already possible to restrict IAM CreateTrainingJob & CreateProcessingJob permissions by both sagemaker:InstanceTypes and sageamker:MaxRuntimeInSeconds conditions, for strictly-enforceable controls on this... But the resulting AccessDenied errors are challenging for end users to understand and don't foster cost-awareness beyond enforcing the hard limits.

Any additional context

This was raised by a customer of mine today who are considering implementing similar functionality in their own internal Python utility for SageMaker - but not immediately clear how their internal SDK and the SageMaker Python SDK could integrate together for an effective user workflow.

The text was updated successfully, but these errors were encountered:

athewsey added the type: feature request label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Feature) Configurable cost info messages and limits when creating jobs and endpoints #4643

(Feature) Configurable cost info messages and limits when creating jobs and endpoints #4643

athewsey commented May 2, 2024

(Feature) Configurable cost info messages and limits when creating jobs and endpoints #4643

(Feature) Configurable cost info messages and limits when creating jobs and endpoints #4643

Comments

athewsey commented May 2, 2024