Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CML kubernetes pod destruction - job still running #1332

Open
rickvanveen opened this issue Feb 1, 2023 · 5 comments
Open

CML kubernetes pod destruction - job still running #1332

rickvanveen opened this issue Feb 1, 2023 · 5 comments
Assignees
Labels
external-request You asked, we did invalid This doesn't seem right p2-nice-to-have Low priority

Comments

@rickvanveen
Copy link

rickvanveen commented Feb 1, 2023

Hi this problem occurred after #1330 was fixed. Somehow the runner is still running when the termination request is made. Can/should I set a "grace period" somewhere?

{"level":"info","message":"Unregistering runner cml-cml-runner-1x1f33ub-5fg9nhw1..."}
{"level":"info","message":"GET /repos/cml-test/cml-tutorial-1/actions/runners?per_page=100 - 200 in 105ms"}
{"level":"info","message":"DELETE /repos/cml-test/cml-tutorial-1/actions/runners/30 - 422 in 136ms"}
{"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-cml-runner-1x1f33ub-5fg9nhw1\" is still running a job\""}

EDIT:
Found a similar issue #1103.

@0x2b3bfa0
Copy link
Member

@rickvanveen, that warning message is a side effect of the fixes we applied for #1255, and you can safely ignore it as long as runners are being deregistered.1

Footnotes

  1. If they aren't being deregistered, then it's an issue. Can you please check that they aren't effectively busy?

@rickvanveen
Copy link
Author

Hi I checked and the runner is deregistered from Github. However, the pod is still running and not destroyed.

@0x2b3bfa0
Copy link
Member

Do you have any additional logs after the runner was deregistered?

@0x2b3bfa0
Copy link
Member

Pods should be destroyed after the runner process finishes running. If not, maybe it was because of #1330 and some stale container cache?

@rickvanveen
Copy link
Author

rickvanveen commented Feb 2, 2023

Do you have any additional logs after the runner was deregistered?

I don't have any additional logs after 4 lines I included already. After that it just does nothing anymore, but the pod still shows as running and the job did not complete. Full log of kubectl logs -f cml-cml-runner-5j5e2hpl-2b7ce9re-wfw7b

Failed to get unit file state for cml.service: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 84.5M  100 84.5M    0     0  21.2M      0  0:00:03  0:00:03 --:--:-- 25.8M
bash: line 46: lsof: command not found
{"level":"info","message":"POST /repos/cml-test/cml-tutorial-1/actions/runners/registration-token - 201 in 147ms"}
{"level":"info","message":"GET /repos/cml-test/cml-tutorial-1/actions/runners?per_page=100 - 200 in 83ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.3.7"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/cml-test/cml-tutorial-1/actions/runners/registration-token - 201 in 150ms"}
{"date":"2023-02-02T09:35:11.240Z","level":"info","message":"runner status","repo":"https://github.company/cml-test/cml-tutorial-1","status":"ready"}
{"date":"2023-02-02T09:35:20.655Z","level":"info","message":"runner status","repo":"https://github.company/cml-test/cml-tutorial-1","status":"job_started"}
{"date":"2023-02-02T09:36:41.608Z","level":"info","message":"runner status","repo":"https://github.company/cml-test/cml-tutorial-1","status":"job_ended","success":true}
{"level":"info","message":"Unregistering runner cml-cml-runner-5j5e2hpl-2b7ce9re..."}
{"level":"info","message":"GET /repos/cml-test/cml-tutorial-1/actions/runners?per_page=100 - 200 in 123ms"}
{"level":"info","message":"DELETE /repos/cml-test/cml-tutorial-1/actions/runners/34 - 422 in 169ms"}
{"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-cml-runner-5j5e2hpl-2b7ce9re\" is still running a job\""}

Plot twist: Yesterday I had two attempts that did somehow terminate. Which today I tried to recreate again, but it's hanging again. When it happens again I will add those logs... EDIT: it happened.

{"level":"info","message":"Unregistering runner cml-cml-runner-1t7mdm1s-2q0r74ue..."}
{"level":"info","message":"GET /repos/cml-test/cml-tutorial-1/actions/runners?per_page=100 - 200 in 202ms"}
{"level":"info","message":"DELETE /repos/cml-test/cml-tutorial-1/actions/runners/37 - 204 in 232ms"}
{"level":"info","message":"\tSuccess"}
{"level":"info","message":"Waiting 10 seconds to destroy"}
{"level":"error","message":"\tFailed destroying with LEO: leo,destroy-runner,--cloud,kubernetes,--region,us-west,cml-cml-runner-1t7mdm1s-2q0r74ue\n\t\n\t2023/02/02 09:58:12 [INFO] Deleting job: \"cml-cml-runner-1t7mdm1s-2q0r74ue\"\n2023/02/02 09:58:12 [DEBUG] Received error: &url.Error{Op:\"Get\", URL:\"https://hpc.company:9500/apis/batch/v1/namespaces/demo/jobs/cml-cml-runner-1t7mdm1s-2q0r74ue\", Err:(*errors.errorString)(0xc000379d60)}\nError: Get \"https://hpc.company:9500/apis/batch/v1/namespaces/demo/jobs/cml-cml-runner-1t7mdm1s-2q0r74ue\": getting credentials: exec: executable kubectl not found\n\nIt looks like you are trying to use a client-go credential plugin that is not installed.\n\nTo learn more about this feature, consult the documentation available at:\n      https://kubernetes.io/docs/reference/access-authn-authz/authentication/#client-go-credential-plugins\nUsage:\n  leo destroy-runner <identifier> [flags]\n\nFlags:\n  -h, --help   help for destroy-runner\n\nGlobal Flags:\n      --cloud string    cloud provider\n      --region string   cloud region (default \"us-east\")\n      --verbose         verbose output\n\n"}
{"level":"info","message":"runner status","reason":"single job","status":"terminated"}

Pods should be destroyed after the runner process finishes running.

This is something I was wondering about. If the pod finishes, it receives the status "Completed" but is not "destroyed" in the sense it is "Terminated" and gone. Don't know how this fits within the philosophy of kubernetes and expected behavior of cml?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-request You asked, we did invalid This doesn't seem right p2-nice-to-have Low priority
Projects
None yet
Development

No branches or pull requests

3 participants