Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent "unexpected EOF" causes terraform to lose track of resources just created #2335

Open
awood-prologis opened this issue Mar 22, 2024 · 0 comments
Labels

Comments

@awood-prologis
Copy link

Datadog Terraform Provider Version

v3.38.0

Terraform Version

v1.5.7

What resources or data sources are affected?

  • datadog_monitor resource

Terraform Configuration Files

terraform {
  required_providers {
    datadog = {
      source  = "DataDog/datadog"
    }
  }
}

variable "combined_backend_services" {
  type = map(map(string))
  default = {
    a = { id = "28757c79-0258-4df8-bbf3-c3c46dfaadb6" }
    b = { id = "01e40243-67a2-4913-8390-53d363ad944d" }
    c = { id = "5b328afa-a801-4611-acd1-8c1645bb6de1" }
    d = { id = "945417cd-9775-48c4-903f-5d147254ff1e" }
    e = { id = "e4f44c80-a3b5-4465-bb89-a893e057b111" }
  }
}

variable "Name" {
  type = string
  default = "EntAuth"
}

variable "Environment" {
  type = string
  default = "Dev"
}

variable "datadog_api_key" {
  type = string
}

variable "datadog_app_key" {
  type = string
}

locals {
  datadog_monitor_prefix = "DELETE ME - Scratch Work"
  datadog_notify_users_plus_opsgenie = "badvalue"
  default_datadog_notify_users = "@[email protected]"
}

provider "datadog" {
  api_key = var.datadog_api_key
  app_key = var.datadog_app_key
}

resource "datadog_monitor" "health_check_monitor" {
    name = "${local.datadog_monitor_prefix} | Health Check Failing"
    type = "metric alert"
    message = "One or more health checks in the primary region are failing, which will trigger an automated DR failover. ${var.Environment == "Prod" ? local.datadog_notify_users_plus_opsgenie : local.default_datadog_notify_users}"
    query = "min(last_5m):min:aws.route53.health_check_status{healthcheckid:${each.value}} < 1"
    for_each = toset([ for k, v in var.combined_backend_services: v.id ])
    monitor_thresholds {
        critical = 1
    }
}

resource "datadog_monitor" "health_check_failing" {
  name = "${local.datadog_monitor_prefix} | Automated DR failover"
  type = "event-v2 alert"
  message = "One or more health checks in the primary region are failing, which will trigger an automated DR failover. ${var.Environment == "Prod" ? local.datadog_notify_users_plus_opsgenie : local.default_datadog_notify_users}"
  query = "events(\"\\\"ALARM: \\\\\"${lower("${var.Name}-${var.Environment}")}-primary-unhealthy\\\\\"\\\"\").rollup(\"count\").last(\"5m\") > 0"
  monitor_thresholds {
    critical = 0
  }
}

Relevant debug or panic output

https://gist.github.com/awood-prologis/239b577fcf01e90f2b3719c982450396

Expected Behavior

I expected all resources created by Terraform when applying the given configuration to then be tracked by Terraform.
After running a destroy, and then going into the Datadog UI and manually deleting resources that Terraform created but did not track, I expect the next apply for the same configuration, with the same inputs, against an identical clean environment (i.e. no extant objects that will conflict with resources defined in the configuration) to produce the same results as the first run.

Actual Behavior

Terraform attempts to apply the provided configuration. During the run, it lists all resources to be created as expected, however a number of the resources report an error during creation because of "unexpected EOF". The number of resources that fail, and which ones, appears to vary from run to run, but in my testing there have always been at least one and less than six failures.
After the run is finished, referring to the Datadog monitor UI shows that all the planned resources reported by terraform were in fact created, and appear correct. However, subsequently running terraform show indicates that terraform does not know about the resources that it reported as failures, resumably because it did not write them into the state, believing the creation operation to have failed.
Attempting to run a plan after this point shows that terraform wishes to try again to create the "failed" resources - which fails if an apply is attempted, because the new resources have a namespace conflict in the DD API with the "forgotten" created resources.
Running a subsequent terraform destroy succeeds, but terraform cleans up only those resources it knows about. The rest must be cleaned up manually via the DD API.

Steps to Reproduce

  • terraform apply
  • terraform show
  • examine the Datadog monitor management UI to identify the newly-created resources in Datadog. It should include all resources reported by Terraform to have been successfully created, as well as all resources that were reported as failed due to "unexpected EOF".
  • terraform apply again
  • this will fail, as the resources that terraform believes failed do in fact exist, and cannot be created again

Important Factoids

No response

References

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant