Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions Runner Controller gives up pending or initializing runner pod too quickly #3516

Closed
4 tasks done
sungmincs opened this issue May 14, 2024 · 2 comments
Closed
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@sungmincs
Copy link

sungmincs commented May 14, 2024

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. start with 0 build node
2. trigger a new workflow that runs on a new build node
3. let the autoscaler (karpenter in my case, which takes ~40secs to provision a new node) spin up a new node.
4. Controller thinks this runner is pending, and it terminates the runner pod while the pod get initialized.

Describe the bug

I have a dedicated arc-runners nodepool to build the arm64 workloads, and the the pool count is 0 until someone tries to use the arm64 runner. When someone launches a new actions workflow that runs on the arm64 runner, the controller and the listener are quick enough to assign the new runner pod in the arm64 nodepool.
Now, the autoscaler (karpenter in my case) kicks off to assign a new node, and this takes roughly 30~40secs and the runner pod becomes the init state (pulling runner image) from Pending state.
After in total of about 50 secs since the controller assigns the new pod, it starts killing the runner pod that is still in init state.

2024-05-14T02:21:22Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"runner-arm64-bvwtl-runner-pbtc7","namespace":"arc-runners"}}
2024-05-14T02:21:26Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"runner-arm64-bvwtl","namespace":"arc-runners"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-05-14T02:21:26Z	INFO	EphemeralRunnerSet	Scaling comparison	{"ephemeralrunnerset": {"name":"runner-arm64-bvwtl","namespace":"arc-runners"}, "current": 1, "desired": 0}
2024-05-14T02:21:26Z	INFO	EphemeralRunnerSet	Deleting ephemeral runners (scale down)	{"ephemeralrunnerset": {"name":"runner-arm64-bvwtl","namespace":"arc-runners"}, "count": 1}

I also tested the same scenario with a node that is already running, but this case also failed because the runner image pull wasn't fast enough (~30sec) for the runner pod to be ready.
Here the runner image I used was just the default Pulling image "ghcr.io/actions/actions-runner:latest" not anything big that I customized.

Describe the expected behavior

The controller should be more patient to wait for the runner pod to be ready, or there should be wait timeout / retry configuration so the users can apply how much time they can tolerate in case of node scaling scenario.

Additional Context

Controller

VERSION=0.9.1
NAMESPACE="arc-systems"
helm upgrade --install arc \
    --namespace "${NAMESPACE}" \
    --create-namespace \
    --version ${VERSION} \
    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
    --set nodeSelector."myproject\\.io/node-tier=core-system"

Runner

INSTALLATION_NAME="runner-arm64"
NAMESPACE="arc-runners"
GITHUB_CONFIG_URL="https://github.com/myorg"
helm upgrade --install "${INSTALLATION_NAME}" \
    --namespace "${NAMESPACE}" \
    --create-namespace \
    --version ${VERSION} \
    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
    --values - <<EOF
githubConfigUrl: ${GITHUB_CONFIG_URL}
githubConfigSecret:
  github_token: ${GITHUB_PAT}
template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: ghcr.io/actions/actions-runner:latest
        command:
          ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
    nodeSelector:
      myproject.io/node-tier: build-runner
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///var/run/docker.sock
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
      - name: dind
        image: docker:dind
        args:
          - dockerd
          - --host=unix:///var/run/docker.sock
          - --group=123
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        securityContext:
          privileged: true
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
          - name: dind-externals
            mountPath: /home/runner/externals
    volumes:
      - name: work
        emptyDir: {}
      - name: dind-sock
        emptyDir: {}
      - name: dind-externals
        emptyDir: {}
EOF

Controller Logs

https://gist.github.com/sungmincs/5f34fc2e4f59cc34315398408db89861

Runner Pod Logs

terminated before becoming ready. please note the age of the pod.

$ kubectl get pod -n arc-runners
NAME                              READY   STATUS     RESTARTS   AGE
runner-arm64-bvwtl-runner-pbtc7   0/2     Init:0/1   0          43s
$ kubectl get pod -n arc-runners
NAME                              READY   STATUS            RESTARTS   AGE
runner-arm64-bvwtl-runner-pbtc7   0/2     PodInitializing   0          45s
$ kubectl get pod -n arc-runners
NAME                              READY   STATUS        RESTARTS   AGE
runner-arm64-bvwtl-runner-pbtc7   0/2     Terminating   0          49s
@sungmincs sungmincs added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels May 14, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Member

Closing this one as a duplicate of #3450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants