Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Feature) Configurable cost info messages and limits when creating jobs and endpoints #4643

Open
athewsey opened this issue May 2, 2024 · 0 comments

Comments

@athewsey
Copy link
Collaborator

athewsey commented May 2, 2024

Describe the feature you'd like

To promote better end-user visibility of costs and organizational cost management, it would be useful if the SM Python SDK could (when configured by an administrator):

  • Project (via the AWS Price List API) the
    • Maximum possible compute cost for training/processing jobs (with ref to the max_run stopping condition)
    • Maximum per-hour compute cost for real-time inference endpoints (or maybe cost at the initial_instance_count, if tracing auto-scaling config is too hard?)
  • Give an informational message including the projected/max cost
  • Optionally reject the operation, if the projection exceeds an organizationally-configured threshold

How would this feature be used? Please describe.

  • An administrator would enable max-cost projection and optionally configure hard limits through the SDK intelligent defaults YAML config file
    • We could expose the options as arguments to e.g. Estimator/Predictor/etc as well, but I see minimal value unless an org can turn them on by default for their team.
    • I acknowledge the messaging (esp around training job max run time) could be complex and confusing for new users, so would not suggest to enable either logging or limits for all SDK users by default - just offer it as a configurable feature
  • When creating a job or endpoint through the normal SDK methods, the data scientist would be notified of the projected (max) costs for the actions via log messages, and the action would fail with an error if the pre-configured threshold is exceeded

For example something like:

[INFO] With the configured max_run = 3600 seconds, this training job could
generate up to $2.12 in compute instance charges.

The messaging would need to be carefully chosen to avoid confusion, because:

  • A total cost estimate would involve a range of other factors like configured job EBS size (known up-front) and S3 data access patterns / data transfer fees (impossible to know).
  • Enabling SageMaker Managed Spot could offer some (unknowable?) discount over the on-demand price

Describe alternatives you've considered

I appreciate that it's already possible to restrict IAM CreateTrainingJob & CreateProcessingJob permissions by both sagemaker:InstanceTypes and sageamker:MaxRuntimeInSeconds conditions, for strictly-enforceable controls on this... But the resulting AccessDenied errors are challenging for end users to understand and don't foster cost-awareness beyond enforcing the hard limits.

Any additional context

This was raised by a customer of mine today who are considering implementing similar functionality in their own internal Python utility for SageMaker - but not immediately clear how their internal SDK and the SageMaker Python SDK could integrate together for an effective user workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant