Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH Artifact lookup times out #482

Open
mydea opened this issue Jul 31, 2023 · 23 comments
Open

GH Artifact lookup times out #482

mydea opened this issue Jul 31, 2023 · 23 comments

Comments

@mydea
Copy link
Member

mydea commented Jul 31, 2023

It happens repeatedly for us that publishes time out when looking for artifacts uploaded to github. One example: https://github.com/getsentry/publish/actions/runs/5715907847/job/15486244767

This happens basically on every sentry-javascript release if we approve it while the checks are still running, so it seems to me something is not right with the waiting. Maybe we need to extend the ARTIFACTS_POLLING_INTERVAL (currently 10s), or extend the MAX_TRIES (currently 3)?

@asottile-sentry
Copy link
Member

from the logs it looks like craft sees your CI as successful before the artifact upload job has started -- once it sees success it moves on to trying to locate artifacts from the successful CI

you can configure the necessary jobs here

@mydea
Copy link
Member Author

mydea commented Jul 31, 2023

Ah, I see, so basically this is completely wrong the way it is set up currently 😅

So I need to add this https://github.com/getsentry/sentry-javascript/actions/runs/5715683859/job/15486522728 as the required context, so:

statusProvider:
  name: github
  config:
    contexts:
      - All required tests passed or skipped

@Lms24
Copy link
Member

Lms24 commented Aug 3, 2023

@asottile-sentry can you clarify what contexts should be set to? We tried the name property of the job but that's not it as craft tells us it can't find the context.

For reference, here's the job we want to wait on:
https://github.com/getsentry/sentry-javascript/blob/754b08279c171ab730224d2432e2e89327fc5932/.github/workflows/build.yml#L858-L873

@Lms24
Copy link
Member

Lms24 commented Aug 3, 2023

Or is it enough to just set without any context?

statusProvider:
  name: github

@asottile-sentry
Copy link
Member

I am not sure -- I'm just going by the docs :S

Lms24 added a commit to getsentry/sentry-javascript that referenced this issue Aug 3, 2023
…8685) (#8729)

Seems like the `context` passed isn't found which is blocking our
release. Reverting for now until we know how to configure the status
provider correctly (getsentry/craft#482).

#uncraft
@BYK
Copy link
Contributor

BYK commented Aug 3, 2023

@Lms24 newest versions of Craft use github as the default status provider. Let me look into your runs.

@BYK
Copy link
Contributor

BYK commented Aug 3, 2023

Very first step would be to change this line: https://github.com/getsentry/sentry-javascript/blob/99e347907812873cc0c832aa25968c3db6587af1/.craft.yml#L1

To say 1.4.2 instead of 0.23.1. I don't remember when we change the defaults for the context but the more recent version the better.

Moreover, the issue you originally reported doesn't seem to be related to contexts. I'll dig a bit more and report back here.

@BYK
Copy link
Contributor

BYK commented Aug 3, 2023

If you look at the logs starting from here: https://github.com/getsentry/publish/actions/runs/5715907847/job/15486244767#step:10:80

It actually just waits for all the checks to pass. The context parameter is to give it an explicit list of contexts to pass (so you don't have to wait for all).

@BYK
Copy link
Contributor

BYK commented Aug 3, 2023

So here in the artifact upload job for the failed publish, it says it successfully uploaded the artifacts at 23 past the hour: https://github.com/getsentry/sentry-javascript/actions/runs/5715683859/job/15485977459#step:6:970

The publish job fails at 27 minutes past the hour: https://github.com/getsentry/publish/actions/runs/5715907847/job/15486244767#step:10:133

That means for some reason GitHub API did not make the uploaded artifacts available for about 4 minutes. That's quite long. It maybe because the artifacts seem quite large.

I googled a bit to see if there's any mention of an expected delay for artifacts to be available but failed to find one. You may wanna raise this issue with GitHub support for investigation. In the meantime, either increasing the number of tries or extending the delay as you originally suggested makes sense for a workaround.

@Lms24
Copy link
Member

Lms24 commented Aug 3, 2023

@BYK thanks for looking into this!

To say 1.4.2 instead of 0.23.1. I don't remember when we change the defaults for the context but the more recent version the better.

I'm confused now. I thought we just always use the latest craft version when starting a release via getsentry/publish? We only recently were blocked by a bug in the latest craft version. So why would it take an older version just for the status provider?

It actually just waits for all the checks to pass. The context parameter is to give it an explicit list of contexts to pass (so you don't have to wait for all).

Ok, so this, plus your last reply, plus the fact that GH is already the default status provider, suggests to me that our config doesn't need adjustments but it's a GH problem 🤔 The weird thing here is that our initial publish attempts fail often. Really often. I'm wondering how often that's overall because of test flakes vs. this artifact retrieval problem.

In the meantime, either increasing the number of tries or extending the delay as you originally suggested makes sense for a workaround.

Still seems like a good idea to me.

@BYK
Copy link
Contributor

BYK commented Aug 3, 2023

My pleasure!

I'm confused now. I thought we just always use the latest craft version when starting a release via getsentry/publish? We only recently were blocked by a bug in the latest craft version. So why would it take an older version just for the status provider?

Sorry for the confusion I caused. Some years ago, we did not default to GitHub as the status provider. That was changed around version 0.21.0 and we added a check: if the config file mentioned a version earlier than 0.21.0, don't default to anything. For anything else, use GitHub. So this file was already mentioning 0.23.1, hence past that version, hence nothing to change or worry about. Moreover, that epoch check was removed from the code a long time ago too 😅

Ok, so this, plus your last reply, plus the fact that GH is already the default status provider, suggests to me that our config doesn't need adjustments but it's a GH problem 🤔

Exactly! If you had an issue with your config, you'd have known about it a very long time ago. .craft.yml does not seem to get many updates.

The weird thing here is that our initial publish attempts fail often. Really often. I'm wondering how often that's overall because of test flakes vs. this artifact retrieval problem.

That is very easy to check and quantify. Just go check the logs and see why they fail. This one you linked was due to GitHub not making artifacts available quickly enough. If this is frequent, I'd blame that as if your tests fail I think you get another fail email for the failed workflow on the release branch.

@Lms24
Copy link
Member

Lms24 commented Aug 3, 2023

The same thing happened again. Opened #484 to fix this. I spent too much time this week trying to fix releases. This needs to be improved.

@BYK
Copy link
Contributor

BYK commented Aug 3, 2023

The artifacts however are available here: https://github.com/getsentry/sentry-javascript/actions/runs/5752023602

I'd definitely raise this with GitHub Support. Again, I suspect it may have todo something with the number of files or the size of the unzipped artifacts.

@Lms24
Copy link
Member

Lms24 commented Aug 7, 2023

Hmm I'm not sure if we're missing something here. I think there might still be some problem in the status provider. Looking at the run I linked, we can see the following time stamps:

So we started downloading artifacts before the action even completed.

Shouldn't the status provider in Craft only start downloading artifacts once the entire action completed (by default)? I have a feeling that the artifacts might only be available once the action completed. Does this make sense?

@BYK
Copy link
Contributor

BYK commented Aug 7, 2023

Shouldn't the status provider in Craft only start downloading artifacts once the entire action completed (by default)? I have a feeling that the artifacts might only be available once the action completed. Does this make sense?

Aha, that makes sense! Then I guess calculating the combined status of the commit step may some issues. That part was always a bit tricky. Relevant code is here:

} else {
logger.debug(
'No config provided for GitHub status provider, calculating the combined status...'
);
let commitApiStatusResult;
if (
revisionStatus.total_count === 0 &&
revisionStatus.state === 'pending'
) {
// Edge case, this is what GitHub returns when there are no registered legacy checks.
logger.debug('No legacy checks detected, checking for runs...');
const hasPendingActiveSuites = revisionCheckSuites.check_suites.some(
suite =>
// Need the any cast as octokit lacks this prop in its types
(suite as any).latest_check_runs_count > 0 &&
suite.status === 'queued'
);
if (revisionChecks.total_count > 0) {
logger.debug('Check runs exist, continuing...');
commitApiStatusResult = CommitStatus.SUCCESS;
} else if (hasPendingActiveSuites) {
logger.debug('Pending check suites exist, continuing...');
commitApiStatusResult = CommitStatus.PENDING;
} else {
logger.warn('No valid build contexts detected, did any checks run?');
commitApiStatusResult = CommitStatus.NOT_FOUND;
}

We may wanna do some digging into the API and modernize that code, especially due to the "legacy checks" part.

@asottile-sentry
Copy link
Member

@Lms24 it seems in that case then a job would need to be added to your workflow which requires all others to finish -- then you'd use that as the indicator for craft's status

@Lms24
Copy link
Member

Lms24 commented Aug 7, 2023

I guess it's worth trying as long as sentry-javascript is the only affected repo. But it means we're adapting to broken behaviour.

@asottile-sentry
Copy link
Member

I think what's happening is there's a point where everything is passed (all statuses green, none pending) and craft can't know that there's more things to run

Lms24 added a commit to getsentry/sentry-javascript that referenced this issue Aug 8, 2023
… and other required jobs (#8751)

Adjust our final "Required Tests Passed" CI job to not only
depend on required _test_ jobs but on _all_ required jobs. Specifically,
this adds

* Lint
* Circular deps check
* Most importantly: Upload Artifacts

With this change, we can configure craft's status provider to
specifically wait for this job which should fix the artifacts download
timeout (getsentry/craft#482). It's worth
noting that this PR will only fix this timeout in our repo and in a way
we're adjusting to the status provider check. However, given that we're
apparently the only repo where this happens, it's probably justified and
it makes things more explicit.
@Lms24
Copy link
Member

Lms24 commented Aug 8, 2023

Soo, I just tried to set

statusProvider:
  name: github
  config:
    contexts:
      - job_required_jobs_passed

but the release failed again.

I'm going to try jobs.job_required_jobs_passed next which should be a valid GH context but I have no idea if this is even working correctly.

When I query the API endpoint manually with

gh api \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  /repos/getsentry/sentry-javascript/commits/583e0bd53968b56d0f63c132c32b989dbcb999cb/status

I get an empty array for statuses which should contain the context we're waiting on.

{
  "state": "pending",
  "statuses": [],
  // ...
}

Not sure if this endpoint even works correctly. For instance, it always returns pending and this was already reported here.

@Lms24
Copy link
Member

Lms24 commented Aug 8, 2023

Update: I tried jobs.job_required_jobs_passed in dry run mode and it didn't work either. Not sure what's happening here but if I had to guess that endpoint seems broken.

@Lms24
Copy link
Member

Lms24 commented Aug 8, 2023

Given that there's no way to make this work with our current status checks:

I'm wondering if we should use the Actions endpoint instead of the commit based endpoints: https://docs.github.com/en/rest/actions/workflow-jobs?apiVersion=2022-11-28#get-a-job-for-a-workflow-run

Seems like this is what we need to poll the status of a specific job. Now what's left to figure out is

  • how to get the run id
  • how to not break everyone by changing stuff around 🙃

@BYK
Copy link
Contributor

BYK commented Aug 8, 2023

Actions endpoint should it be used here as that's specific to GitHub actions but the GitHub status provider (and GitHub status checks) are independent from GitHub Actions.

You may wanna try graphql endpoints.

@Lms24
Copy link
Member

Lms24 commented Aug 9, 2023

Alright folks, I'm done with trying to fix this. Given that there's a somewhat working but annoying workaround (namely, telling our managers to wait ~30min with adding the accept label in getsentry/publish) I can't justify working on this any longer. My gut feeling tells me that we can't be the only repo that's affected by this but what can I do...

To whoever picks this up: If we can't easily change the current github status provider, maybe the path of least friction/resistance is to create a second Github (Action) status provider that uses working endpoints. Individual repos could opt-in to use this provider instead and we wouldn't risk breaking existing publishing configs who rely on the older endpoints (if there are any).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants