Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Change back from autodown to autostop #3535

Merged
merged 6 commits into from
May 17, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented May 10, 2024

#3377 changes the autostop for skyserve controller to autodown, which will teardown the controller when the sky serve controller job exited unexpectedly and remove any related replica information/logs. This PR changes it back to autostop to preserve the info.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Terminate sky serve controller, and spin up a new service on k8s controller, it successfully launched and without autostop. Using the following config yaml:
jobs:
  controller:
    resources:
      cloud: kubernetes
      cpus: 4+
      memory: 4+
kubernetes:
  remote_identity: SERVICE_ACCOUNT
serve:
  controller:
    resources:
      cloud: kubernetes
      cpus: 4+
      memory: 4+
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll
Copy link
Collaborator

For discussion, this seems fine to stop serve controller, since we have made the controller on kubernetes to skip the autostop configuration in #3521. Wdyt @romilbhardwaj?

@cblmemo please check #3521 #3524 #3525, we need to make sure the serve controller will skip the autostop setting to allow the serve controller to run on kubernetes

@romilbhardwaj
Copy link
Collaborator

Yes, should be okay if we skip autostop similar to #3521. So IIUC, the behavior is going to be:

Serve and jobs controller:

  • On k8s - run indefinitely
  • On other clouds - autostop after configured time

Let's go with this for now and reevaluate the "run indefinitely" based on user feedback.

@cblmemo
Copy link
Collaborator Author

cblmemo commented May 14, 2024

bumping for this @Michaelvll @romilbhardwaj - are there any changes I need to make for this PR? IIUC it will automatically skip the autostop for k8s controller?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @cblmemo! LGTM.

@cblmemo cblmemo merged commit e134a35 into master May 17, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-down-serve-controller branch May 17, 2024 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants