Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SSL Mode Feature for Cloud SQL in v4 Terraform #1766

Merged

Conversation

tylerreidwaze
Copy link
Collaborator

@tylerreidwaze tylerreidwaze commented May 10, 2024

Followed this guide and implemented the feature. I mostly followed along with the v5 code as it was hyper similar.

@maqiuyujoyce maqiuyujoyce self-assigned this May 21, 2024
Copy link
Collaborator

@maqiuyujoyce maqiuyujoyce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change, @tylerreidwaze !

Overall LGTM - have you tested and verified that the new field works? Cover this new field in the testdata is recommended.

@tylerreidwaze
Copy link
Collaborator Author

Basic tests have been created, and ran to spin up an instance with these features. I confirmed that the resources are created expected with the expected fields

I confirmed the requireSSL and sslmode features were set properly

In the tests, The resources is 1) successfully created 2) successfully updated and 3) successfully deleted. However, the tests do return FAIL, but i cannot explain why. I have compared results between sqlinstance and postgresinstance and the exact same results occur. I don't understand the testing logic enough to determine why these this would show FAIL. AFAICT, they are succeeding

go test -v -tags=integration ./pkg/controller/dynamic/ -test.run TestCreateNoChangeUpdateDelete -run-tests postgresinstance -timeout 900s
go test -v -tags=integration ./pkg/controller/dynamic/ -test.run TestCreateNoChangeUpdateDelete -run-tests sqlinstance -timeout 900s

@acpana
Copy link
Collaborator

acpana commented May 23, 2024

@tylerreidwaze could you share your logs from running the dynamic tests?

@tylerreidwaze
Copy link
Collaborator Author

Yes
test.txt

Copy link
Collaborator

@maqiuyujoyce maqiuyujoyce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to run make ready-pr again to adopt the schema changes in the third_party folder

pkg/test/resourcefixture/contexts/sql_context.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@maqiuyujoyce maqiuyujoyce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a couple of more testdata changes will be needed.

@maqiuyujoyce
Copy link
Collaborator

Yes test.txt

According to the test log, this is the problem that caused the failure:

k8s.go:84: expected event with reason 'Updating' to not be recorded for SQLInstance mvvefik4ho4qshyqf5nq/sqlinstance-sample-mvvefik4ho4qshyqf5nq, but it was

This means there was an update when we test no change - as in, there is a diff found while there shouldn't be.

@tylerreidwaze
Copy link
Collaborator Author

Yes test.txt

According to the test log, this is the problem that caused the failure:

k8s.go:84: expected event with reason 'Updating' to not be recorded for SQLInstance mvvefik4ho4qshyqf5nq/sqlinstance-sample-mvvefik4ho4qshyqf5nq, but it was

This means there was an update when we test no change - as in, there is a diff found while there shouldn't be.

Is there any way to tell what change was there that it did not expect?

Thanks for finding that!

@maqiuyujoyce
Copy link
Collaborator

To help identify the issue, an easy starting point would be print out the diff result to understand what's in it. You can add the following code after https://github.com/GoogleCloudPlatform/k8s-config-connector/blob/master/pkg/controller/tf/controller.go#L350:

fmt.Printf("[your username or whatever that highlights this line]...\n%+v\n", diff)

@tylerreidwaze
Copy link
Collaborator Author

Got the log message in there. Will continue investigation tomorrow

&{mu:{state:0 sema:0} Attributes:map[settings.0.disk_size:0xc00b2069c0 settings.0.ip_configuration.0.ssl_mode:0xc00b207380] Destroy:false DestroyDeposed:false DestroyTainted:false RawConfig:{ty:{typeImpl:<nil>} v:<nil>} RawState:{ty:{typeImpl:<nil>} v:<nil>} RawPlan:{ty:{typeImpl:<nil>} v:<nil>} Meta:map[e2bfb730-ecaa-11e6-8f88-34363bc7c4c0:map[create:2400000000000 delete:1800000000000 update:1800000000000]]}

looks like it might be issues with disk_size and/or ssl_mode

@maqiuyujoyce
Copy link
Collaborator

Nice finding! Yes, you can dig more details by printing out the content of config (desired state) and liveState variables after printing out the diff. Didn't suggest it because they may contain many fields. But since you've locate the exact problematic fields, then you can compare the desired state and live state to see why those fields are different.

@tylerreidwaze
Copy link
Collaborator Author

Super helpful again. I was able to find the output, it seems that ssl_mode is not getting set

settings.0.ip_configuration.0.require_ssl = false
settings.0.ip_configuration.0.ssl_mode = 

This is weird because 1) all the terraform tests pass and 2) the right settings were set on the resource when I looked at the test instance. Continuing to debug

@tylerreidwaze
Copy link
Collaborator Author

I think I may have just found something looking at the terraform v5 code

Looks like the state of ssl_mode is dropped unless locally defined, so we can't verify it during state checks

@maqiuyujoyce
Copy link
Collaborator

maqiuyujoyce commented May 25, 2024

Ah, that makes sense. This basically means the field is unreadable.

From the field schema, it looks like this field is also mutable.

And for mutable but unreadable field, we need to add it to the mutableButUnreadableFields list. More information can be found at step 10 in this section.

@tylerreidwaze
Copy link
Collaborator Author

Tests are passing after adding the field per your instructions! Thanks so much for the help.

Let me know if there is anything else we want to change prior to merge :)

@tylerreidwaze
Copy link
Collaborator Author

Getting the below error in the fixture tests

2024/05/28 21:41:24 [DEBUG] Retry Transport: Returning after 1 attempts
{"severity":"error","timestamp":"2024-05-28T21:41:24.963Z","logger":"sqlinstance-controller","msg":"error applying desired state","resource":{"name":"sqlinstance-ifae5j5b7defgbhqphmq","namespace":"ifae5j5b7defgbhqphmq"},"error":"summary: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection."}

I enabled this manually in my test project. I believe I will need to enable this in the dependencies to resolve this

@maqiuyujoyce
Copy link
Collaborator

Getting the below error in the fixture tests

2024/05/28 21:41:24 [DEBUG] Retry Transport: Returning after 1 attempts
{"severity":"error","timestamp":"2024-05-28T21:41:24.963Z","logger":"sqlinstance-controller","msg":"error applying desired state","resource":{"name":"sqlinstance-ifae5j5b7defgbhqphmq","namespace":"ifae5j5b7defgbhqphmq"},"error":"summary: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection."}

I enabled this manually in my test project. I believe I will need to enable this in the dependencies to resolve this

In this case, I suggest you use a new network (instead of the default one) instead.

@tylerreidwaze
Copy link
Collaborator Author

Getting the below error in the fixture tests

2024/05/28 21:41:24 [DEBUG] Retry Transport: Returning after 1 attempts
{"severity":"error","timestamp":"2024-05-28T21:41:24.963Z","logger":"sqlinstance-controller","msg":"error applying desired state","resource":{"name":"sqlinstance-ifae5j5b7defgbhqphmq","namespace":"ifae5j5b7defgbhqphmq"},"error":"summary: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection."}

I enabled this manually in my test project. I believe I will need to enable this in the dependencies to resolve this

In this case, I suggest you use a new network (instead of the default one) instead.

Good call, committed changes. Testing locally now

@tylerreidwaze
Copy link
Collaborator Author

Network is failing to delete because firewalls still exist

  reconcile.go:193: error was not considered transient; chain is [[*fmt.wrapError: Delete call failed: error deleting resource: [{0 Error waiting for Deleting Network: The network resource 'projects/tylerreid-kcc-sandbox/global/networks/computenetwork-musjs6755w6gx36tb3ra' is already being used by 'projects/tylerreid-kcc-sandbox/global/firewalls/computenetwork-musjs6755w6gx36tb3ra-ba5h4uquy4cktbsldj6ba2g3-v6'

@maqiuyujoyce
Copy link
Collaborator

Is the firewall resource implicitly created? Can it be represented via a KCC resource so that we can do an explicit deletion?

@tylerreidwaze
Copy link
Collaborator Author

Is the firewall resource implicitly created? Can it be represented via a KCC resource so that we can do an explicit deletion?

Yes, it seems to be creating it on its own. I am not sure if we can explicitly define them. Looks like there are a LOT being created

https://screenshot.googleplex.com/9GiKRXoXM964fJp

I might be able to rely on the default network. Trying that now

@maqiuyujoyce
Copy link
Collaborator

I might be able to rely on the default network. Trying that now

It might work locally because for a default network, the test will abandon it instead of deleting it. However, we don't allow making changes to the default network because it is shared by multiple test cases - any little change will fail a bunch of other test cases. So I don't suggest that this test case reuses the default network.

You might also find it working by simply marking this ComputeNetwork resource with abandon deletion policy, but this is still not recommended. There is quota limit of 15 networks per project by default, but there are many test cases (more than 15) relying on a new network (default network included). Not being able to delete the network will increase the flakiness rate of other test cases.

What I'm curious about right now is: If you are not using Config Connector, what is the process you need to follow to clean up the network resource in GCP? That might give us the clue why you see this error.

@tylerreidwaze
Copy link
Collaborator Author

Roger that on the default network and the abandon policy. Seems reasonable to not want to cause issues there.

There is a chance that some of these are being created by Waze's folder level policies. Can you re-approve the workflow? I actually think my tests will pass there. I believe this is an issue specific to how Waze is set up.

What I'm curious about right now is: If you are not using Config Connector, what is the process you need to follow to clean up the network resource in GCP? That might give us the clue why you see this error.

To delete the network, I just need to delete the firewall rules from the UI and then I can delete the network.

@tylerreidwaze
Copy link
Collaborator Author

I think my postgres version is causing errors which eventually cause a timeout

{"severity":"info","timestamp":"2024-05-29T01:36:14.325Z","msg":"Wait completed, proceeding to shutdown the manager"}
    harness.go:534: error from mgr.Start: failed waiting for all runnables to end within grace period of 30s: context deadline exceeded
    harness.go:505: controller-runtime manager is shutdown
    unified_test.go:821: subtest timeout after 3m0s

Resolved in latest commit, awaiting test run. I am still blocked from running tests locally until I resolve some org policies creating my firewall rules

@tylerreidwaze
Copy link
Collaborator Author

I am able to run tests locally and they pass. However, they are longer running, so I added them to the regex per

https://github.com/GoogleCloudPlatform/k8s-config-connector/blob/master/README.NewResourceFromTerraform.md#optionally-add-the-test-case-to-long-running-test-list

@tylerreidwaze
Copy link
Collaborator Author

Rebased to fix merge conflicts with the regex variable

@tylerreidwaze
Copy link
Collaborator Author

Not sure why the linter failed, after changing region. My fixture tests passed locally

tylerreid@tylerreid-kcc:~/dev/k8s-config-connector$ RUN_E2E=1 E2E_KUBE_TARGET=envtest E2E_GCP_TARGET=mock go test -test.count=1 -timeout 3600s -v ./tests/e2e -run TestAllInSeries/fixtures/postgresinstance
...
=== NAME  TestAllInSeries
    unified_test.go:108: shutting down manager
--- PASS: TestAllInSeries (45.27s)
    --- PASS: TestAllInSeries/fixtures (45.27s)
        --- PASS: TestAllInSeries/fixtures/postgresinstance (45.12s)
PASS
ok      github.com/GoogleCloudPlatform/k8s-config-connector/tests/e2e   45.558s

@tylerreidwaze
Copy link
Collaborator Author

/approve

@maqiuyujoyce
Copy link
Collaborator

Hi @tylerreidwaze , looks like the latest 3 commits is not part of this PR? Did you add it by mistake or to deal with any presubmit failures?

@tylerreidwaze
Copy link
Collaborator Author

Hi @tylerreidwaze , looks like the latest 3 commits is not part of this PR? Did you add it by mistake or to deal with any presubmit failures?

This was a suggestion from the KCC Self Service chat group to rebase as there were breaking changes for the linters in master

@maqiuyujoyce
Copy link
Collaborator

I think if the PR is rebased properly, irrelevant commits won't show up. Let's discuss offline what's going on there.

@tylerreidwaze
Copy link
Collaborator Author

I think if the PR is rebased properly, irrelevant commits won't show up. Let's discuss offline what's going on there.

resolved. Tests are passing now too!

@maqiuyujoyce
Copy link
Collaborator

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label May 31, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maqiuyujoyce, tylerreidwaze

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit d3d3cdb into GoogleCloudPlatform:master May 31, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants