Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: k8s storage #77

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

RFC: k8s storage #77

wants to merge 18 commits into from

Conversation

taylorsilva
Copy link
Member

Rendered

This is meant for the next iteration of our Kubernetes runtime work. This has nothing to do with how the Concourse helm chart currently functions.


# Summary

After spiking on a few solutions for storage on Kubernetes our recommendation is to use an image registry to store **cache objects** for steps.
Copy link
Member

@cirocosta cirocosta Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be interesting to check out the approach that kaniko took - it essentially caches layers in a container registry if you don't specify a local path (so, e.g., if you want to run a Job that runs kaniko but you don't want to care about providing it a volume - which would imply concurrency=1)

$ k logs -f build-v1.18.6-4tn7w  | grep cache
INFO[0008] Checking for cached layer cirocosta/envtest-cache:8642ec4096d286832017c667e68548ac74436793bd49b1cf59d6a238044f6b0a...
INFO[0008] No cached layer found for cmd RUN set -x &&          apt update -y && apt install -y rsync &&                git clone https://github.com/kubernetes/kubernetes &&                 cd ./kubernetes &&                      git checkout $KUBE_TAG &&            go mod download &&                       make generated_files
INFO[0194] Checking for cached layer cirocosta/envtest-cache:21931b04c009b9e67db0c6c8a1f907ee699dd32ad3fef2aa14efaf6b1a5dc995...
INFO[0313] Found cached layer, extracting to filesystem
INFO[0348] Checking for cached layer cirocosta/envtest-cache:b862d7fa29205a55ec028679dc2bb416b707a1a29d24feb9714075e4be5a5c74...
INFO[0435] Found cached layer, extracting to filesystem
INFO[0447] Checking for cached layer cirocosta/envtest-cache:e6c3ee315be35a8483933a79682375de049bdef0560721ce9962d728993803d0...
INFO[0464] Found cached layer, extracting to filesystem
INFO[0470] Checking for cached layer cirocosta/envtest-cache:e78396635e5a8c3a4b750b11796109b2a6a954b628c6e24a3dfbc1f579ce7024...
INFO[0471] No cached layer found for cmd RUN go install -v ./cmd/envtest
INFO[0508] Pushing layer cirocosta/envtest-cache:e78396635e5a8c3a4b750b11796109b2a6a954b628c6e24a3dfbc1f579ce7024 to cache now

(https://github.com/GoogleContainerTools/kaniko/blob/master/pkg/cache/cache.go)

i'm not familiar with the details of the implementation, but it might serve as some inspiration 😁


buildkit seems to allow something similar: https://github.com/moby/buildkit/blob/master/README.md#export-cache

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cirocosta
We can see this style of local caching being beneficial for pulling the contents on an input for a task. Is that what you meant ? We weren't sure how this might be useful for pushing 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't have any specifics in mind, it was more about showcasing the pattern of using a registry as a place where you just throw layers at a registry regardless of what you're building and then try to reuse those layers (pulling if they exist) when running another build

+ would have negligible overhead for steps scheduled on the same node as no input/output stream would be required

### Cons
- Not being able to use IaaS based persisent disks doesn't offer a viable solution. K8s nodes would need to have large root volumes.
Copy link

@ari-becker ari-becker Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually an issue? Under Persistent Volumes you note as a con:

IaaS based limits on volume limits per node prevents this from being a scalable solution

Large root volumes is closer to how Concourse Workers work today (i.e. closer to the implicit foundational design assumptions in the Concourse domain that this RFC attempts to deal with), and if there are scalability issues with Persistent Volumes, then doesn't this represent a simpler way (because it builds on prior baggageclaim work) to get started with building the Kubernetes runtime?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ari-becker,
Yes, this is true. It would be useful to get perspectives such as yours for us to understand what constraints or lack thereof folks are dealing with.

We are assuming the cons;

  • resizing of nodes with larger root volumes
  • running privileged baggageclaim pods
  • being ok with baggagelciam's current lack of security (this can be enhanced)

creates larger challenges for operators.

It sounds like, you'd be okay with this setup ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xtreme-sameer-vohra currently, we already run Concourse workers on a Kubernetes cluster, one where we've architected our workers such that the workers are restricted to their own underlying nodes (we manually scale both the underlying nodes and the worker StatefulSet in tandem) and the worker pods are privileged. So all of your cons are already present in our deployment and therefore implicitly acceptable.

Maybe you need to clarify what the goals are supposed to be of developing a Kubernetes runtime for Concourse? is it to phase out the "worker" in favor of ATC directly scheduling Kubernetes Jobs that represent Concourse jobs or steps? Because then running privileged baggageclaim pods are fine, basically then the worker isn't really phased out, it's just simplified down to only the baggageclaim code, and presumably we'd be able to configure ATC to create Jobs with node restrictions / affinity / taints / tolerations to get the Jobs to run on a dedicated worker instance group. Is it the ability to run on common / shared instances? Because then running privileged baggageclaim pods is probably unacceptable. Should untrusted Concourse workloads run on common / shared instances in the first place, even if they're not allowed to be privileged? Ehhh, debatable, and a lot of security experts will tell you no, particularly if the jobs are not configured with PodSecurityPolicies and the like, which in a CI / development environment rarely makes sense.

The main value that we'd get out of a Kubernetes-native runtime would be the ability to set up Cluster Autoscaler so that Kubernetes will add additional nodes (and reduce nodes) as needed according to scheduling workloads, as currently we need to scale workers by hand. Cluster Autoscaler works well with workloads that are scheduled to dedicated instance groups, adding and subtracting nodes from the dedicated instance group that will permit Jobs that are restricted to those instance groups.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, we are working on the K8s Runtime RFC that should have preceded this one 😸
The primary objective being able to leverage the scaling & operational capabilities of K8s by deploying steps as K8s pods without changing the pipeline schema. The latter would allow pipelines to be portable across runtimes. Along the same lines we're also assessing;

  • the least amount of required permissions on K8s for fulfilling the same requirements as a current worker (container & storage management)
  • the # of dependencies (besides postgres) for a Concourse deployment on K8s

As you mentioned, the container management would be delegated to k8s. We are spiking on using IaaS PVs + propagating mounts using CSI, hoping to ultimately leverage baggageclaim. We are early in the exploration, but this path might also provide the ability of using volume location for the purposes of scheduling 💭

The initial draft of the proposal has a worker being mapped to K8s cluster + namespace. "Instance Groups" seems interesting. Do you see it fitting in naturally with the cluster+namespace mapping or would you propose something different ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xtreme-sameer-vohra One of the things that worries me about using an external image registry is the prospect of cross-availability-zone (on AWS, $0.01/GB) and/or VPC egress traffic costs (on AWS, much higher). Currently, all of our workers are in the same availability zone (for CI workers, we're less concerned with the prospect of an AZ-wide outage and more concerned with the traffic costs) and so our traffic costs are much lower.

Effectively using a external image registry while keeping costs under control would mean a) deploying a separate image registry per availability zone alongside the workers, where workers are correctly configured to use the image registry that is in their availability zone, b) the image registry itself needs to use either a local disk / IaaS PV (so as not to incur traffic costs when storing layers in s3, for example), so really all that an image registry does is consolidate storage usage in one place. This eventually leads to cost savings compared to using a large local disk / IaaS PV per worker, with the tradeoff of additional operational complexity.

When talking about instance groups, I'm using the Kops nomenclature. An instance group maps to an AWS AutoscalingGroup, where the nodes are of a certain instance type (i.e. CI is generally best-served with burstable nodes, which on AWS are t3 nodes, whereas t3 nodes may not make sense for other workloads). In order to schedule workers on a specific ASG that we set up for them, we use a combination of node affinity to direct Kubernetes to schedule the CI workloads (currently, the worker pods) to those nodes, and taints and tolerations to not allow any other workloads to schedule there. This is orthogonal to Kubernetes namespacing, which is a good idea anyway to improve logical isolation between individual CI workloads.


# Proposal

**TL;DR**: We recommend going with the image registry option because it satisfies all the requirements and gives us a bunch of options to improve performance when compared to the blobstore option. It also provides a very flexible solution that works across multiple runtime workers. [See Details](#image-registry-to-store-artifacts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trade-study - push/pull from registry using

  1. init & side car containers
  2. CSI with independent volume/input (eg. image populator)
  3. Custom image per combination of step for pulling and TODO for pushing

- Not being able to use IaaS based persisent disks doesn't offer a viable solution. K8s nodes would need to have large root volumes.
- Wouldn't have support for hosting images by default. However, `baggageclaim` could be extended to add the APIs
- `baggageclaim` itself doesn't have any authentication/authorization or transport security (https) mechanisms built into it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ - `baggageclaim` needs to be run in a privileged container. It would be useful to understand the perspective of operators for this constraint ? Is this tenable ?

Copy link
Member

@cirocosta cirocosta Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as i know, TKG clusters created via TMC expects pods to adhere to a podsecurity policy that's quite restrictive:

buuut, I'd imagine someone installing a CSI plugin (like anything network related) would expect to grant higher privileges anyway (being an "infra component"), but not to any container (i.e., the containers that run the workloads, like tasks, etc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this does make the assumption that installing Concourse requires more than just pod* capabilities.
However, installing a CSI driver would be an Operator action similar to installing concourse today.

@taylorsilva and I were also discussing that this approach wouldn't preclude us from using the IaaS PVs directly for simpler installations where user doesn't have privileges to install a CSI driver. This would have performance implications in the overhead to create/destroy volumes and limit of volumes / node, but would be a very simple solution 🤔


### Cons
- Wouldn't have support for hosting images by default.
- IaaS based limits on [volume limits per node](https://kubernetes.io/docs/concepts/storage/storage-limits/#dynamic-volume-limits) prevents this from being a scalable solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super related, but found this idea interesting: argoproj/argo-workflows#4130

Signed-off-by: Sameer Vohra <[email protected]>
Co-authored-by: Taylor Silva <[email protected]>
Signed-off-by: Sameer Vohra <[email protected]>
Co-authored-by: Taylor Silva <[email protected]>
Signed-off-by: Sameer Vohra <[email protected]>
Co-authored-by: Taylor Silva <[email protected]>
Signed-off-by: Taylor Silva <[email protected]>
Signed-off-by: Taylor Silva <[email protected]>
Signed-off-by: Taylor Silva <[email protected]>
Signed-off-by: Taylor Silva <[email protected]>
@xtreme-sameer-vohra
Copy link

TODO - Document the permissions required to setup the baggageclaim CSI + Storage Class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants