Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ proposal: new API for managed Security Groups #1756

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

EmilienM
Copy link
Contributor

@EmilienM EmilienM commented Nov 22, 2023

What this PR does / why we need it:

KEP to enhance Security Group API.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related #1752

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2023
Copy link

netlify bot commented Nov 22, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 3992c24
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65dc8fdba06ca3000843cf27
😎 Deploy Preview https://deploy-preview-1756--kubernetes-sigs-cluster-api-openstack.netlify.app/print
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 22, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: EmilienM
Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@EmilienM EmilienM changed the title proposals: new API for managed Security Groups ✨ proposal: new API for managed Security Groups Nov 22, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 22, 2023
@EmilienM
Copy link
Contributor Author

/cc mdbooth

@EmilienM
Copy link
Contributor Author

/cc jichenjc

docs/proposals/20231122-security-groups.md Outdated Show resolved Hide resolved
namespace: default
spec:
securityGroupsSpec:
controlPlaneSecurityGroupRules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we talked about whether we should create a CRD and reusable in other clusters
e.g you create multiple clusters and if modify one json file or yaml file then it can be used in multiple clusters?

I know it's complicated so I think we can defer to next stage to add such extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's try without a new CRD and see how it goes. I'll keep this comment open as it's still relevant in the last KEP's iteration.

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the overall design of this

docs/proposals/20231122-security-groups.md Outdated Show resolved Hide resolved
2. Avoid causing and breaking changes to previous users, they'll still be able to use the legacy rules
3. Successfully be able to create and manage security groups for both the Control Plane and the Worker Nodes
4. Provide a migration path (documentation or code) for users who are currently using the legacy rules provided by `OpenStackCluster.spec.managedSecurityGroups`
5. Deprecate `OpenStackCluster.spec.apiServerLoadBalancer.additionalPorts` as we'll have an API to create Control Plane Security Group rules (with more flexibility than we have now). Conversion will happen between this parameter and the one for Control Plane Security Group rules, in order to maintain backward compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this. This parameters has more uses than just security group rules and it would break existing use cases. See #1749 (comment)

However, I would like to separate responsibility for these rules into:

  • A separate security group
  • Managed by a separate Octavia control plane endpoint provider

i.e. Create a new controller for managing an Octavia API loadbalancer. That controller is responsible for creating a security group suitable for its needs and attaching that security group to control plane machines.

In this case, it would simply be the responsibility of the cluster controller to remove the legacy rules from the default security group.

@EmilienM
Copy link
Contributor Author

@mkjpryor I'll like your feedback on this one when time permits for you. Thanks!

Copy link
Contributor

@dulek dulek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an important design decision to be made here - how do we deal with users messing with SGs managed by CAPO through OpenStack API.

My idea would be to adopt K8s model here. The controller should periodically monitor SG rules of the SGs it's managing and SG associations and reconcile them if user made any changes. I think this is the only sane way to do it, otherwise we allow users to modify SGs on their own and we never know if the SG that's attached is to be kept or removed during e.g. upgrade.

Comment on lines 136 to 133
# Allow to provide rules for the CNI that will be applied to all nodes
cniSecurityGroupRules:
- direction: ingress
ethertype: IPv4
portRangeMin: 1234
portRangeMax: 1234
protocol: tcp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call this one CNI, when it's basically a thing gathering rules added to each of the nodes? I'd abstain from designing this against a CNI. From the technical standpoint I think it's possible to use different CNIs on different nodes. It's totally not supported at the moment, but might be one day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can name it allNodesSecurityGroupRules, I don't mind.

@mkjpryor
Copy link
Contributor

mkjpryor commented Nov 28, 2023

@EmilienM

Do you think that this should be consistent with whatever we do for server groups? Personally, I favour new CRDs for both, i.e. OpenStackSecurityGroup and OpenStackServerGroup CRDs.

I worry about us cluttering up the spec of the OpenStackCluster with too much complicated stuff like this. IMHO we should start to establish a pattern of having additional CRDs for resources that don't have a direct correspondence with a cluster/machine.

For example, a port is something that I don't mind having in OpenStackMachine.spec, as it is clearly related to the machine.

@EmilienM
Copy link
Contributor Author

@EmilienM

Do you think that this should be consistent with whatever we do for server groups? Personally, I favour new CRDs for both, i.e. OpenStackSecurityGroup and OpenStackServerGroup CRDs.

I worry about us cluttering up the spec of the OpenStackCluster with too much complicated stuff like this. IMHO we should start to establish a pattern of having additional CRDs for resources that don't have a direct correspondence with a cluster/machine.

For example, a port is something that I don't mind having in OpenStackMachine.spec, as it is clearly related to the machine.

To me there is a relation between Security Groups and Machines, as we want to update a Machine if a security group is added to the control plane machines.
Example given: if we move the Calico rules in their own Security Group, we'll need a migration path where we remove the rules from the control plane security group and then create a new one with Calico rules and add that group to the machines. This require the controller to update the machine with the new SG.
Do we want to do that in the OpenStackMachine (or Cluster) controller or via a new controller dedicated to security groups but that can also update a Machine?

@EmilienM
Copy link
Contributor Author

There is an important design decision to be made here - how do we deal with users messing with SGs managed by CAPO through OpenStack API.

My idea would be to adopt K8s model here. The controller should periodically monitor SG rules of the SGs it's managing and SG associations and reconcile them if user made any changes. I think this is the only sane way to do it, otherwise we allow users to modify SGs on their own and we never know if the SG that's attached is to be kept or removed during e.g. upgrade.

I think we all agree; the question remains whether we have a new CRD (+ controller) dedicated to Security Groups lifecycle or continue to use OpenStackCluster.

@mkjpryor
Copy link
Contributor

mkjpryor commented Nov 29, 2023

It just feels odd to me that all the security groups are defined in the OpenStackCluster object and the OpenStackMachine object says "I want to use security group X from the cluster object", vs having a new OpenStackSecurityGroup CRD 🤷‍♂️ Same for server groups when we do that thing.

I do agree that it probably makes the conversion more complicated.

Happy to be overruled though.

@jichenjc
Copy link
Contributor

jichenjc commented Dec 1, 2023

It just feels odd to me that all the security groups are defined in the OpenStackCluster object and the OpenStackMachine object says "I want to use security group X from the cluster object", vs having a new OpenStackSecurityGroup CRD 🤷‍♂️ Same for server groups when we do that thing.

that's exactly what I asked above, and I Think maybe we can check comments from @EmilienM

For now, let's try without a new CRD and see how it goes. I'll keep this comment open as it's still relevant in the last KEP's iteration.

so maybe we can check https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1765/files first

@mdbooth
Copy link
Contributor

mdbooth commented Jan 15, 2024

As discussed at the last office hours, I think this is the biggest single API stabilisation task we have for v1beta1. Specifically, we need to remove the hard-coded Calico rules, but allow:

  • CNI rules to be used in managed security groups
  • Safely upgrading clusters which are currently using the hard-coded Calico rules

I would very much like to get to v1beta1 asap, so to make that simpler I'd like to propose reducing the scope of the change here as much as possible.

managedSecurityGroups:
  # Enable the management of security groups with the default rules (kubelet, etcd, etc.)
  # The default stays false
  enabled: true
  # Allow to extend the default security group rules for the Bastion,
  # the Control Plane and the Worker security groups
  additionalBastionSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax: 1234
      protocol: tcp
  additionalControlPlaneSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax: 1234
      protocol: tcp
  additionalWorkerSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax:1234
      protocol: tcp
  # Allow to provide rules that will be applied to all nodes
  allNodesSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax: 1234
      protocol: tcp

I would remove additionalSecurityGroups for now for 2 reasons. Firstly, in this proposal they are nested under managedSecurityGroups, but they're not managed. I think it would be confusing to write:

managedSecurityGroups:
  enabled: false
  additionalSecurityGroups:
  - ...

Does the user expect these to be created when managed security groups is disabled? additionalSecurityGroups are unmanaged, so I'd have them separate.

Secondly, adding a new additionalSecurityGroups field would be a backwards compatible change, so we can do it in v1beta1. Lets punt on anything we don't strictly need right now.

We discussed having a useLegacyCalicoRules field, but it occurs to me we don't need it: we can just add the Calico rules directly to allNodesSecurityGroupRules during upgrade. This means that the legacy cruft only lives in the legacy API code. We can document what the rules are, and others can add to the documentation the required rules for other CNIs.

In the first instance, we can also remove additional controller complexity by not changing the set of security groups attached to nodes. This means we don't have an upgrade problem, and don't need to be able to dynamically attach and detach security groups. Instead, at least in the first instance, we can simply replicate rules in allNodesSecurityGroupRules to control-plane and workers.

Lastly, we make extensive use of RemoteGroupID in our current rules. We need to allow that to be expressed in the new API, but it obviously can't take the UUID of a managed security group. I suggest that in the first instance we add a remoteManagedGroups field to security group rules which references one of bastion, control-plane, or worker. As a fully worked example, I would expect an upgraded configuration containing upgraded Calico rules to be:

managedSecurityGroups:
  enabled: true
  allNodesSecurityGroupRules:
  - description: BGP (calico)
    direction: ingress
    ethertype: IPv4
    portRangeMin: 179
    portRangeMax: 179
    protocol: tcp
    remoteManagedGroups:
    - control-plane
    - worker
  - description: IP-in-IP (calico)
    direction: ingress
    ethertype: IPv4
    protocol: 4
    remoteManagedGroups:
    - control-plane
    - worker

These 2 rules would become 4 rules, as we would create a duplicate for each of the remoteManagedGroups. The resulting 4 rules would be added to both control-plane and worker.

I suggest that this change alone could also be the minimum change, and we could add additional<X>SecurityGroupRules later as a backwards-compatible change.

@lentzi90
Copy link
Contributor

Thank you for that thought through comment! I completely agree!

@EmilienM
Copy link
Contributor Author

Cool, I'm working on a change here: #1826 to address what Matt has asked for.

@dulek
Copy link
Contributor

dulek commented Jan 19, 2024

I'll just note one thing here - in complicated envs Neutron has performance issues with remote_group_id as it generates a lot of flows. It's advised to use CIDRs directly if possible. This probably isn't a problem for CAPO scale, but I'm mentioning it here for the record.

@EmilienM
Copy link
Contributor Author

Note: this one is not staled, I'll come back to it at some point this year, now that we have a new API for Security Groups.

@EmilienM
Copy link
Contributor Author

this is still highly WIP, I need to update it based on our recent work. Please don't review it yet.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2024
@EmilienM
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants