Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Security Group Change Issue #3014

Open
1 task done
sidewinder12s opened this issue Apr 17, 2024 · 7 comments
Open
1 task done

Cluster Security Group Change Issue #3014

sidewinder12s opened this issue Apr 17, 2024 · 7 comments
Labels

Comments

@sidewinder12s
Copy link

sidewinder12s commented Apr 17, 2024

Description

It appears I've hit this bug: #2469

But specifically I am already over the v18/v19 module upgrades. The only change we've made is that our VPC has added a CIDR block, which seems to be breaking the module with:

Plan:

  # module.cluster.module.eks.aws_security_group_rule.cluster["vpc"] must be replaced
+/- resource "aws_security_group_rule" "cluster" {
      ~ cidr_blocks              = [ # forces replacement
            # (2 unchanged elements hidden)
            "10.60.7.0/24",
          + "10.40.0.0/16",
        ]
      ~ id                       = "sgrule-422031136" -> (known after apply)
      - prefix_list_ids          = [] -> null
      + security_group_rule_id   = (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }

Apply:

module.cluster.module.eks.aws_security_group_rule.cluster["vpc"]: Creating...
╷
│ Error: [WARN] A duplicate Security Group rule was found on (sg-07c6df08897a15c50). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.35.192.0/18, TCP, from port: 443, to port: 443, ALLOW" already exists
│ 	status code: 400, request id: 299a0a30-3cd1-47b9-9192-26f9f0cfda35
│ 
│   with module.cluster.module.eks.aws_security_group_rule.cluster["vpc"],
│   on .terraform/modules/cluster.eks/main.tf line 295, in resource "aws_security_group_rule" "cluster":
│  295: resource "aws_security_group_rule" "cluster" {
│ 

So I am somewhat concerned this is not an issue specific to the module upgrades but a bug/limitation in the module.

If your request is for a new feature, please use the Feature request template.

  • ✋ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]: 20.1.1

  • Terraform version: v1.3.6

  • Provider version(s): v5.35.0

Reproduction Code

Steps to reproduce the behavior:

Modify the incoming VPC to add another CIDR block:

  cluster_security_group_additional_rules = merge(
    {
      # Default Private Control Plane Access
      vpc = {
        description = "Allow VPC Network Access"
        protocol    = "tcp"
        from_port   = 443
        to_port     = 443
        cidr_blocks = data.aws_vpc.vpc.cidr_block_associations[*].cidr_block
        type        = "ingress"
      },
      private_network_access = {
        description = "Allow Private Network Access"
        protocol    = "tcp"
        from_port   = 443
        to_port     = 443
        cidr_blocks = ["10.0.0.0/8"]
        type        = "ingress"
      }
  }, var.cluster_security_group_additional_rules)

Is the issue potentially that we're using a computed value as the input to this variable in the module?

Minimal re-pro would probably be to set this cluster_security_group_additional_rules and then modify it or provide a computed value that changes/forces modification.

Expected behavior

Terraform updates the security group rules

Actual behavior

Terraform errors out on an existing cluster

Additional context

@isaccavalcante
Copy link

I've been seeing this error a lot. All the other issues that I've seen with this are closed without solution. Theres the workaround of changing the IP ranges to force a change but thats not pratical. Another hacky way I found is to create a variable and append it to the key of the security group rules. For instance, in your case would be something like:

variable "security_group_update_trigger" {
  description = "Trigger updates without getting duplicate security group rule error"
  type        = bool
  default     = false
}
  cluster_security_group_additional_rules = merge(
    {
      # Default Private Control Plane Access
      "vpc_${var.security_group_update_trigger}" = {
        description = "Allow VPC Network Access"
        protocol    = "tcp"
        from_port   = 443
        to_port     = 443
        cidr_blocks = data.aws_vpc.vpc.cidr_block_associations[*].cidr_block
        type        = "ingress"
      },
      "private_network_access_${var.security_group_update_trigger}" = {
        description = "Allow Private Network Access"
        protocol    = "tcp"
        from_port   = 443
        to_port     = 443
        cidr_blocks = ["10.0.0.0/8"]
        type        = "ingress"
      }
  }, var.cluster_security_group_additional_rules)

But if you have a huge list of CIDRs like in my case, its bad since you dont have the easy tracking of changes when adding/removing a single cidr, because it will show as adding/removing all.

I hope we can have a proper fix for this issue.

@bryantbiggs
Copy link
Member

I don't see how these relate to the module - this looks like you could easily reproduce this on just simple security group rule resources.

@sidewinder12s
Copy link
Author

I guess I had opened this because it was not clear to me why the aws_security_group_rule was triggering a replace as opposed to just modifying it. It seemed like it was a side effect of however its been implemented in the module.

@bryantbiggs
Copy link
Member

its being replaced due to changes that it has identified here:

~ cidr_blocks = [ # forces replacement
      # (2 unchanged elements hidden)
      "10.60.7.0/24",
    + "10.40.0.0/16",
  ]

And I suspect the duplicate rules is being thrown because both 10.60.7.0/24 and 10.40.0.0/16 are already captured within the rule

private_network_access = {
    description = "Allow Private Network Access"
    protocol    = "tcp"
    from_port   = 443
    to_port     = 443
    cidr_blocks = ["10.0.0.0/8"]
    type        = "ingress"
  }

@sidewinder12s
Copy link
Author

I can confirm the second isn't an issue, we had multiple overlapping rules already in place within the rule that do overlap with 10.0.0.0/8 and should have failed if that was an EC2 SG limitation.

I was also able to manually add 10.40.0.0/16 within the EC2 UI without error, its just Terraform failing because it is attempting to create_before_destroy the aws_security_group_rule and the rule it attempts to create is already on the SG.

To get around this issue, since I already had that overlapping 10.0.0.0/8 rule defined in Terraform I actually just deleted all the other more specific rules Terraform was trying to append manually, removed them from state, then re-applied Terraform successfully.

I'd suspected something about how the variables are getting passed into the aws_security_group_rule resource was triggering the create_before_destroy: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/main.tf#L296-L299

But wasn't sure how to investigate further. Also wonder if this even newer resource might avoid this as well, especially since within the EC2 console, each CIDR block we're adding has its own unique security group rule ID already: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_ingress_rule
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_egress_rule

@bryantbiggs
Copy link
Member

the newer individual rules (cannot pass a list of CIDRs, etc.) are recommended but, that comes at the cost of disruption to users. we have started to use those where possible, but we have to identify the right time to introduce those since it will be disruptive for users

Copy link

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants