Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform validate: Force rerun t init when plugin cache is enabled and parrallelism conflit happens #640

Open
maxbrunet opened this issue Mar 6, 2024 · 8 comments
Labels
bug_with_workaround Something isn't working but there is a workaround documentation Improvements or additions to documentation estimate/1day Need 1 work day to be done hook/terraform_providers_lock Bash hook hook/terraform_validate Bash hook

Comments

@maxbrunet
Copy link
Contributor

What problem are you facing?

#620 introduces parallelism, this can lead to data races when the plugin cache is enabled, with terraform init writing in parallel to the cache, and creating corrupted providers (especially visible with the lock file where random provider hashes get appended).

This can be worked around by setting the --hook-config=--parallelism-limit=1 arg on terraform_validate hook.

How could pre-commit-terraform help solve your problem?

It would be nice if the hooks using terraform init (mostly terraform_validate, but terragrunt_providers_lock has it deprecated too) could disable parallelism automatically when the TF_PLUGIN_CACHE_DIR environment variable is set.

Not sure if handling the case where plugin_cache_dir is set in the config file would be possible.

@maxbrunet maxbrunet added the feature New feature or request label Mar 6, 2024
@yermulnik
Copy link
Collaborator

I reckon @MaxymVlasov spent more than decent amount of time trying to implement locking for the TF init with plugin cache enabled and, since it was not quite implementable straightforwardly, the outcome was this code snippet: https://github.com/antonbabenko/pre-commit-terraform/pull/620/files#diff-a9db869be8201236cd6abaed19087fc571d29c10d679fc66bf945eba1a4ccd46R461-R483 (long story short: TF locks destination file and, if parallel TF init encounters error writing plugin, it will simply re-try)

If you're hitting any issues with this implementation, please add details so that we can investigate.

Not sure if handling the case where plugin_cache_dir is set in the config file would be possible.

That's a good shout by the way. We missed to check TF config for this parameter and relied upon env var only as far as I can see form the code snippet I mentioned above.
Something simple like grep ^plugin_cache_dir path/to/config/file might work w/o a fuss, though the config file location would need some if/else scaffold: https://developer.hashicorp.com/terraform/cli/config/config-file#locations — standard non-windows location of ~/.terraformrc can be overridden with TF_CLI_CONFIG_FILE env var.

@MaxymVlasov What do you think of adding logic for the plugin_cache_dir rc-file parameter handling (we only need to detect whether it is in config file) or you did not relied upon it in #620 on purpose?

@maxbrunet
Copy link
Contributor Author

Hey @yermulnik, thanks for the quick reply

long story short: TF locks destination file and, if parallel TF init encounters error writing plugin, it will simply re-try

terraform init does not fail writing to it, it writes corrupted providers and the lock files is updated with the hashes of these corrupted providers

Terraform validate.......................................................Failed
- hook id: terraform_validate
- exit code: 1
- files were modified by this hook

Command 'terraform init' successfully done: somedirectory
Validation failed: somedirectory
╷
│ Error: registry.terraform.io/hashicorp/google: the cached package for registry.terraform.io/hashicorp/google 4.84.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
│ 
│ 
╵
Error: Terraform exited with code 1.


Command 'terraform init' successfully done: someotherdirectory

Lock terraform provider versions.........................................Passed
pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/somedirectory/.terraform.lock.hcl b/somedirectory/.terraform.lock.hcl
index bcd3d4ff..09cb469e 100644
--- a/somedirectory/.terraform.lock.hcl
+++ b/somedirectory/.terraform.lock.hcl
@@ -10,6 +10,7 @@ provider "registry.terraform.io/hashicorp/google" {
     "h1:3qXFXgT2RAseuYW20OUqBLMeSuKt5tbpNI15GU5J7KE=",
     "h1:FtmujB+frf40ztU300FEd/cnah5o/ZVItCTYk+Qj1Wo=",
     "h1:PhC05RRIJi2iPLf9ny2JsYxVlhdC4LvXsuPfg4U/Pbg=",
+    "h1:RhnrI2jZIB6BmqGeP+lNXY5UqRKZp8QcuTz5CG3zPOo=",
     "h1:W9PU6CJQSb0p6pL+OonuWrCW2HJITiIykQkrwCAgYnw=",
     "h1:c4ksHVh0kRBFDvP3peYx99oZgh/afzOGxPJwdApcKBU=",
     "h1:fybaK74buTd4Ys2CUZm6jw7NXtSqtcLoW2jeNB4Ff2E=",

Btw if the automation is too complicated to implement, that's totally ok with me to close as wontfix. I mainly wanted to call out the problem, I think the minimum would be to document in the README that if the plugin cache is enabled then parallelism should be disabled for terraform_validate, it could take a while for users to figure out the issue

@yermulnik
Copy link
Collaborator

That's odd =( I might mixed up the bits about TF locking that I tried to recall from when Max was working on #620. When Max has a chance, he will add insights from his experience in #620 I guess.
Thanks for calling this out though. That's valuable.

I think the minimum would be to document in the README that if the plugin cache is enabled then parallelism should be disabled for terraform_validate, it could take a while for users to figure out the issue

Meanwhile would you be keen to contribute by outlining this in README? 🤞🏻

@MaxymVlasov
Copy link
Collaborator

Okay, let's assume that terraform_validate failed during t init, wrote corrupted SHAs and then terraform_providers_lock passed

  1. You still not able to commit that, as one of hooks failed (terraform_validate)
  2. During rerun terraform_validate will as long as I understand that you need working provider binaies to validate tf code
  3. So you'll reinit broken provider till terraform_validate will be happy => till terraform_providers_lock generates right lock file.

At the same time if you check README for terraform_validate p.3, you'll find that recommended solution is

- id: terraform_validate
  args:
    - --hook-config=--retry-once-with-cleanup=true     # Boolean. true or false

Also, README for terraform_providers_lock in p.1 describe supported modes, and each example includes - --hook-config=--retry-once-with-cleanup=true

@MaxymVlasov MaxymVlasov closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2024
@MaxymVlasov MaxymVlasov added question hook/terraform_validate Bash hook hook/terraform_providers_lock Bash hook and removed feature New feature or request labels Mar 7, 2024
@MaxymVlasov
Copy link
Collaborator

MaxymVlasov commented Mar 7, 2024

If you think that such docs are confusing or not enough and you have ideas in which places there should be some extra mentions of that - please let us know. Ideally, via PR.

@MaxymVlasov MaxymVlasov added the documentation Improvements or additions to documentation label Mar 7, 2024
@maxbrunet
Copy link
Contributor Author

Thank you @MaxymVlasov for the detailed reply

At the same time if you check README for terraform_validate p.3, you'll find that recommended solution is

Since --retry-once-with-cleanup only retries once, the races would still happen and be visible, so I feel like the safest is to recommend disabling parallelism when cache is enabled. Here is my suggestion #642

@MaxymVlasov
Copy link
Collaborator

Since --retry-once-with-cleanup only retries once, the races would still happen and be visible,

I tested it a few times on a repo with more than 200 lockfiles with paralelizm=64, and there were 0 problems. But whatever

btw, init retries 10 times

# Locking just doesn't work, and the below works quicker instead. Details:
# https://github.com/hashicorp/terraform/issues/31964#issuecomment-1939869453
for i in {1..10}; do
init_output=$(terraform init -backend=false "${TF_INIT_ARGS[@]}" 2>&1)
exit_code=$?
if [ $exit_code -eq 0 ]; then
break
fi
sleep 1
common::colorify "green" "Race condition detected. Retrying 'terraform init' command [retry $i]: $dir_path."
[[ $recreate_modules == true ]] && rm -rf .terraform/modules
[[ $recreate_providers == true ]] && rm -rf .terraform/providers
done

@MaxymVlasov
Copy link
Collaborator

Oh, okay, it possible to commit broken checksum in some corner cases

image
There should be a retry and it is specified in the configuration, but it never happened, and then to 1 broken h1, added 4 valid h1:

      - id: terraform_validate
        args:
          - --hook-config=--parallelism-ci-cpu-cores=4 # DinD, can't calculate CPU cores
          - --hook-config=--parallelism-limit=CPU*2
          - --hook-config=--retry-once-with-cleanup=true
          - --tf-init-args=-upgrade

image

@MaxymVlasov MaxymVlasov reopened this Apr 12, 2024
@MaxymVlasov MaxymVlasov added the bug_with_workaround Something isn't working but there is a workaround label Apr 12, 2024
@MaxymVlasov MaxymVlasov changed the title terraform init: Disable parrallelism when plugin cache is enabled terraform validate: Force rerun t init when plugin cache is enabled and parrallelism conflit happens Apr 12, 2024
@MaxymVlasov MaxymVlasov added the estimate/1day Need 1 work day to be done label Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug_with_workaround Something isn't working but there is a workaround documentation Improvements or additions to documentation estimate/1day Need 1 work day to be done hook/terraform_providers_lock Bash hook hook/terraform_validate Bash hook
Projects
None yet
Development

No branches or pull requests

3 participants