Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to add tagged worker to the default worker pool #6565

Open
drahnr opened this issue Feb 23, 2021 · 3 comments
Open

No way to add tagged worker to the default worker pool #6565

drahnr opened this issue Feb 23, 2021 · 3 comments
Labels

Comments

@drahnr
Copy link
Contributor

drahnr commented Feb 23, 2021

Summary

Currently there is no way to add workers to the default pool from the worker cli.

The issue at hand:

I have quite a few containers and resources which are both being used on tagged workers and un-tagged workers as well.

The tasks either have a tag: framework:cuda or are untagged.

The result is that the containers are unavailable to the the tagged task on the tagged worker.

In the images lemonaid is a tagged worker with framework:cuda, framework:opencl, lemonaid.
orangecestswirl is a untagged worker.

image
image

For my case, the tagged workers are rather boring instances, which just have a couple of special futures, but should also be considered for regular, untagged tasks.

Steps to reproduce

- name: container-fedora-cuda
  type: registry-image

  tags: [framework:cuda]
  source:
    repository: quay.io/spearow/machine-learning-container-fedora-cuda
    username: ((username))
    password: ((password))
- name: container-fedora-default
  type: registry-image

  tags: [framework:opencl,framework:cuda]
  source:
    repository: quay.io/spearow/machine-learning-container-fedora-default
    username: ((username))
    password: ((password))

- name: container-fedora-native
  type: registry-image

  tags: []
  source:
    repository: quay.io/spearow/machine-learning-container-fedora-native
    username: ((username))
    password: ((password))

  - name: pr-test-juice
    build_logs_to_retain: 4
    public: true
    serial: true
    plan:
    - get: pr-juice
      trigger: true
      version: every
    - get: container-fedora-cuda
      trigger: true

    - get: container-fedora-default
      trigger: true

    - get: container-fedora-native
      trigger: true

jobs:
    - put: pr-juice-stat
      resource: pr-juice
      get_params:
        skip_download: true
      params:
        path: pr-juice
        base_context: sirmergealot
        context: greenlit
        status: pending
    - in_parallel:

        limit: 1
        fail_fast: false
        steps:
        - task: pr-coaster-test-fedora-cuda
          image: container-fedora-cuda

          tags: [framework:cuda]

          config:
            platform: linux
            inputs:
            - name: pr-juice
            caches:
            - path: cargo_home
            - path: pr-juice/coaster/target
            run:
              path: sh
              args:
              - -exc
              - |
                # two levels up is the cache
                export CARGO_HOME=$(realpath "$(pwd)/../../cargo_home")
                echo "Sweet (cached) home is \"$CARGO_HOME\""
                # export RUST_LOG=coaster=warn,rcublas=debug,rcublas-sys=debug
                mkdir -p $CARGO_HOME
                prepare
                cargo-override-injection
                export RUST_LOG=juice=debug
                export RUST_BACKTRACE=1
                cargo test --no-default-features --features=native,cuda -- --nocapture
              dir: pr-juice/coaster
          on_failure:
            put: pr-juice-stat
            resource: pr-juice
            get_params:
              skip_download: true
            params:
              path: pr-juice
              base_context: sirmergealot
              context: coaster-test-fedora-cuda
              status: failure
          on_success:
            put: pr-juice-stat
            resource: pr-juice
            get_params:
              skip_download: true
            params:
              path: pr-juice
              base_context: sirmergealot
              context: coaster-test-fedora-cuda
              status: success

https://ci.spearow.io/teams/spearow/pipelines/juice/resources/container-fedora-cuda

Expected results

Resources are available.

Actual results

Resource is not available on the tagged worker.

Additional context

The whole situation could be solved by adding a cli flag, that allows adding the worker to the default pool as well, since the machine really has special features, but should be used for normal tasks as well.

Triaging info

  • Concourse version: 6.7.2
  • Browser (if applicable): firefox 85.0.1 (64-bit)
  • Did this used to work? Yes
@drahnr drahnr added the bug label Feb 23, 2021
@taylorsilva
Copy link
Member

If I'm understanding correctly, you want some (or all?) containers to land on the specified tagged worker or an untagged worker?

This kinda goes against the way tags currently work. Tags are mainly used for isolating workloads and are AND'd together. From the docs:

The step will be placed within the a pool of workers that match all of the given set of tags.


So you have a tagged worker, lemonaid. You have a resource, container-fedora-cuda, that should run on that worker, but the error is saying that worker doesn't have the registry-image resource. What do you see when you run fly workers -d under the resource types column for the lemonaid worker? Does it have the registry-image resource?


I'm confused by what you said here:

The result is that the containers are unavailable to the the tagged task on the tagged worker.

That's kinda the point... Why would a task on one worker try and reach out to a task/container on another worker? Each step is meant to be run isolated from other steps. If you need to pass information between containers, use inputs/outputs. Or does your use-case fall into something that services would eventually fix? concourse/rfcs#84


Hope some of that helps! I'm very confused but intrigued by what you're doing :)

@drahnr
Copy link
Contributor Author

drahnr commented Feb 23, 2021

If I'm understanding correctly, you want some (or all?) containers to land on the specified tagged worker or an untagged worker?

This kinda goes against the way tags currently work. Tags are mainly used for isolating workloads and are AND'd together. From the docs:

The step will be placed within the a pool of workers that match all of the given set of tags.

So you have a tagged worker, lemonaid. You have a resource, container-fedora-cuda, that should run on that worker, but the error is saying that worker doesn't have the registry-image resource. What do you see when you run fly workers -d under the resource types column for the lemonaid worker? Does it have the registry-image resource?

lemonaid         0  linux  lemonaid, framework:cuda, framework:opencl  none  running  2.3  6h37m  127.0.0.1:452
89  http://127.0.0.1:34267  0  bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, mo
ck, pool, registry-image, s3, semver, time, tracker
orangecestswirl  0  linux  none                                        none  running  2.2  9h55m  127.0.0.1:353
31  http://127.0.0.1:33887  0  bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, mo
ck, pool, registry-image, s3, semver, time, tracker

The result is that the containers are unavailable to the the tagged task on the tagged worker.

The expectation would have been while having the task tagged, resource being available wherever the tasks are executed, but I guess that is a misconception on my end.

Hope some of that helps! I'm very confused but intrigued by what you're doing :)

My issue is twofold:

  1. I generally would like to have tagged workers be of the default pool, since the special hardware is only used in a small fraction of the tasks that are run.
  2. Duplicating resources seemed to be a bit verbose, but I guess it's consistent at least.

@drahnr
Copy link
Contributor Author

drahnr commented Sep 7, 2022

Since there is now a configurable task/resource affinity, the issue is partially mitigated. The core issue remains though, being able schedule both regular (read; without tag requirements) and those tasks that do required tagged workers (read; those with special HW capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants