6.1.0 workers' download speed becomes EXTREMELY slow after a few builds of one job #5754

fg-j · 2020-06-11T19:12:44Z

fg-j
Jun 11, 2020

Hello concourse folks,
We started noticing issues with container creation on Windows and Linux vSphere workers following an upgrade to 6.1. We believed that issue to be resolved, after a change to the task run params led to the next several builds going green. However, we've seen this come up again with some delay, even after deleting and recreating the deployment.

We have seen these issues on a variety of tasks run on these workers. The issues manifest in two ways:
1. builds being slow due to slow download speed onto the worker
  - For example, downloading a ~4GB artifact onto the worker normally taking 5 minutes that suddenly starts taking ~1 hour
    - The integration task of this build takes ~2 hr whereas the same task of this build from before the upgrade takes ~1 hr.
  - If we bosh ssh onto an affected worker and try to download something using curl, we see download speeds of about ~1.1 Mi/B
  - The initialization time for certain tasks is also taking much longer. For example, a task that used to take 10 seconds to initialize now takes ~2 minutes to.
2. Volumes failing to mount in task containers with Put "/volumes/{volume-id}/stream-in?path=.": net/http: timeout awaiting response headers leading to a failed job.
  - Failure shown in the contract-test task in this job.
Things we tried:
- We noticed that after deleting and redeploying the workers, the job completed normally (no volume mounting or slowdown issue). However, after a few builds of this job, the workers started to show one of those two issues again.
- bosh recreate did not fix this issue
- We do not believe anything has changed about our vSphere infrastructure.
- Running bosh vms --vitals did not reveal anything abnormal:

Deployment 'vsphere-linux-worker'

Instance                                     Process State  AZ  IPs         VM CID                                   VM Type  Active  Stemcell  VM Created At                 Uptime        Load              CPU    CPU   CPU   CPU   Memory       Swap      System      Ephemeral   Persistent
                                                                                                                                                                                            (1m, 5m, 15m)     Total  User  Sys   Wait  Usage        Usage     Disk Usage  Disk Usage  Disk Usage
worker/05eb84dd-1b4b-4e69-a0bf-7bfbd47316b8  running        z1  10.74.35.8  vm-514b56ac-fe4c-4b4e-86be-9b1c464f6d04  worker   true    -         Thu Jun 11 18:21:24 UTC 2020  0d 0h 8m 2s   0.00, 0.06, 0.05  -      0.0%  0.0%  0.0%  2% (644 MB)  0% (0 B)  51% (34i%)  0% (0i%)    -
worker/b7bc6eec-d362-4f6d-816e-1c736e221114  running        z1  10.74.35.7  vm-bf1134e5-9ef5-4022-84d3-3621d97e2496  worker   true    -         Thu Jun 11 18:27:23 UTC 2020  0d 0h 2m 11s  0.45, 0.16, 0.06  -      0.1%  0.2%  3.0%  2% (648 MB)  0% (0 B)  51% (34i%)  0% (0i%)    -

Could we get some suggestions of things to try or look for to diagnose and resolve our issue?

jamieklassen · 2020-06-11T20:34:17Z

jamieklassen
Jun 11, 2020
Collaborator

Preliminaries:

to help understand the impact of the upgrade, what version were you running before 6.1.0?
thank you for diligently sharing links -- unfortunately those jobs are not public: true so I can't see the logs when unauthenticated, and my github user has no team memberships.

These things point to slow volume i/o (i.e. slow filesystem):

slow download speed inside a container but fast curl on the host
slow task initialization (wiring up the input volumes before execution)
PUT to /volumes/:handle/stream-in on baggageclaim timing out

Do you have any kind of system probe that can tell whether disk i/o has dropped? Things like telegraf and the datadog agent measure these things pretty nicely, but I don't think bosh does out of the box. If you don't then the only way to get a scientific measurement would be to downgrade and install such a probe, establish a baseline and then re-upgrade. And that will only be conclusive if the issue is actually correlated to the upgrade.

this all reminds me a bit of a customer incident where a vSphere stemcell got upgraded and the baggageclaim driver, through automated detection, got switched from overlay to btrfs. Especially because the issue takes a little while to surface in a fresh VM. You could kill this wild speculation by determining exactly what the baggageclaim volume driver is.

EDIT: in general perhaps there could be some errors in the worker logs? particularly those from baggageclaim.

0 replies

jamieklassen · 2020-06-11T20:48:25Z

jamieklassen
Jun 11, 2020
Collaborator

I'm also reminded of #5298, but I have no idea how that helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.1.0 workers' download speed becomes EXTREMELY slow after a few builds of one job #5754

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

6.1.0 workers' download speed becomes EXTREMELY slow after a few builds of one job #5754

fg-j Jun 11, 2020

Replies: 2 comments

jamieklassen Jun 11, 2020 Collaborator

jamieklassen Jun 11, 2020 Collaborator

fg-j
Jun 11, 2020

jamieklassen
Jun 11, 2020
Collaborator

jamieklassen
Jun 11, 2020
Collaborator