[Flaking Test] chart-lint-test not stable recently #4917

RainbowMango · 2024-05-08T07:09:26Z

Which jobs are flaking:

chart-lint-test

Which test(s) are flaking:

See the logs below:

https://github.com/karmada-io/karmada/actions/runs/8982272797/job/24669506880 (found on master after #4860)
https://github.com/karmada-io/karmada/actions/runs/8997316919/job/24715232276 (found on master after #4883)
https://github.com/karmada-io/karmada/actions/runs/8948612862/job/24581937332 (found on master after #4887)
https://github.com/karmada-io/karmada/actions/runs/8997316919/job/24715232276 (found on master after #4883)
https://github.com/karmada-io/karmada/actions/runs/9153421301/job/25162342796 (found on master after #4949, at 20/05/2024.)

Reason for failure:

TBD

Anything else we need to know:

RainbowMango · 2024-05-08T07:14:22Z

cc @calvin0327 @chaosi-zju for help

chaosi-zju · 2024-05-09T03:06:27Z

/asign

chaosi-zju · 2024-05-09T03:10:09Z

error initially creating leader election record: namespaces "karmada-system" not found

actually I found this error occasionally for a long time (ಥ_ಥ) .

chaosi-zju · 2024-05-09T13:21:15Z

@calvin0327 do you have any heuristic thinking?

RainbowMango · 2024-05-20T06:13:04Z

One more failed case: https://github.com/karmada-io/karmada/actions/runs/9153421301/job/25162342796

RainbowMango · 2024-05-21T06:17:23Z

Another case: https://github.com/karmada-io/karmada/actions/runs/9168776891/job/25208102953

I0521 04:23:33.859739       1 leaderelection.go:250] attempting to acquire leader lease karmada-system/karmada-scheduler...
E0521 04:23:33.862701       1 leaderelection.go:332] error retrieving resource lock karmada-system/karmada-scheduler: Get "https://karmada-k1ah2k0r9l-apiserver.karmada-k1ah2k0r9l.svc.cluster.local:5443/apis/coordination.k8s.io/v1/namespaces/karmada-system/leases/karmada-scheduler": dial tcp 10.96.241.60:5443: connect: connection refused
E0521 04:23:37.112799       1 leaderelection.go:332] error retrieving resource lock karmada-system/karmada-scheduler: Get "https://karmada-k1ah2k0r9l-apiserver.karmada-k1ah2k0r9l.svc.cluster.local:5443/apis/coordination.k8s.io/v1/namespaces/karmada-system/leases/karmada-scheduler": dial tcp 10.96.241.60:5443: connect: connection refused

chaosi-zju · 2024-05-21T12:55:11Z

Problem locating is in progress, here are some clue:

direct reason for helm failure is CrashLoopBackOff of karmada-controller-manager:

$ kubectl get po -n karmada-system                                                                        
NAMESPACE            NAME                                                          READY   STATUS             RESTARTS        AGE
karmada-nkuq2v3017   etcd-0                                                        1/1     Running            0               6m23s
karmada-nkuq2v3017   karmada-nkuq2v3017-aggregated-apiserver-769fff4f58-9prbb      1/1     Running            5 (3m4s ago)    6m23s
karmada-nkuq2v3017   karmada-nkuq2v3017-apiserver-76b5b8894-6g4vw                  1/1     Running            4 (3m48s ago)   6m23s
karmada-nkuq2v3017   karmada-nkuq2v3017-controller-manager-6d775ffc74-tpk4m        0/1     CrashLoopBackOff   4 (75s ago)     6m23s
karmada-nkuq2v3017   karmada-nkuq2v3017-kube-controller-manager-5877d89f57-mtqk5   1/1     Running            5 (3m41s ago)   6m23s
karmada-nkuq2v3017   karmada-nkuq2v3017-scheduler-df578498f-226dx                  1/1     Running            0               6m23s
karmada-nkuq2v3017   karmada-nkuq2v3017-webhook-5f6fc69445-rfhzf                   1/1     Running            0               6m23s
karmada-system       etcd-0                                                        1/1     Running            0               116m
karmada-system       karmada-aggregated-apiserver-6bf466fdc4-fv86h                 1/1     Running            2 (116m ago)    116m
karmada-system       karmada-apiserver-756b559f84-qf2td                            1/1     Running            0               116m
karmada-system       karmada-controller-manager-7b9f6f5f5-v5bwp                    1/1     Running            3 (116m ago)    116m
karmada-system       karmada-kube-controller-manager-7b6d45cbdf-5kk8d              1/1     Running            2 (116m ago)    116m
karmada-system       karmada-scheduler-64db5cf5d6-bgd85                            1/1     Running            0               116m
karmada-system       karmada-webhook-7b6fc7f575-chqjk                              1/1     Running            0               116m

error logs of karmada-controller-manager:

$ kubectl logs -f karmada-nkuq2v3017-controller-manager-6d775ffc74-tpk4m -n karmada-nkuq2v3017    
I0521 12:47:38.740245       1 feature_gate.go:249] feature gates: &{map[PropagateDeps:false]}
I0521 12:47:38.740443       1 controllermanager.go:139] karmada-controller-manager version: version.Info{GitVersion:"v1.10.0-preview4-130-g53af52e4a", GitCommit:"53af52e4a853ac04efb6c189583b5e63dff3c771", GitTreeState:"clean", BuildDate:"2024-05-21T11:48:26Z", GoVersion:"go1.21.10", Compiler:"gc", Platform:"linux/amd64"}
I0521 12:47:38.781620       1 reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go/informers/factory.go:159
I0521 12:47:38.878755       1 context.go:160] Starting "endpointSlice"
I0521 12:47:38.878799       1 context.go:170] Started "endpointSlice"
I0521 12:47:38.878805       1 context.go:160] Starting "unifiedAuth"
I0521 12:47:38.878826       1 context.go:170] Started "unifiedAuth"
I0521 12:47:38.878830       1 context.go:160] Starting "endpointsliceCollect"
W0521 12:47:38.878840       1 context.go:167] Skipping "endpointsliceCollect"
I0521 12:47:38.878853       1 context.go:160] Starting "cronFederatedHorizontalPodAutoscaler"
I0521 12:47:38.878866       1 context.go:170] Started "cronFederatedHorizontalPodAutoscaler"
W0521 12:47:38.878874       1 context.go:157] "deploymentReplicasSyncer" is disabled
I0521 12:47:38.878878       1 context.go:160] Starting "remedy"
I0521 12:47:38.878893       1 context.go:170] Started "remedy"
I0521 12:47:38.878904       1 context.go:160] Starting "execution"
I0521 12:47:38.878923       1 context.go:170] Started "execution"
I0521 12:47:38.878928       1 context.go:160] Starting "workStatus"
I0521 12:47:38.879028       1 context.go:170] Started "workStatus"
I0521 12:47:38.879038       1 context.go:160] Starting "serviceImport"
I0521 12:47:38.879051       1 context.go:170] Started "serviceImport"
I0521 12:47:38.879056       1 context.go:160] Starting "gracefulEviction"
I0521 12:47:38.879078       1 context.go:170] Started "gracefulEviction"
I0521 12:47:38.879082       1 context.go:160] Starting "federatedHorizontalPodAutoscaler"
I0521 12:47:38.879151       1 context.go:170] Started "federatedHorizontalPodAutoscaler"
I0521 12:47:38.879171       1 context.go:160] Starting "workloadRebalancer"
I0521 12:47:38.879196       1 context.go:170] Started "workloadRebalancer"
I0521 12:47:38.879206       1 context.go:160] Starting "endpointsliceDispatch"
W0521 12:47:38.879213       1 context.go:167] Skipping "endpointsliceDispatch"
I0521 12:47:38.879219       1 context.go:160] Starting "namespace"
I0521 12:47:38.879239       1 context.go:170] Started "namespace"
I0521 12:47:38.879253       1 context.go:160] Starting "serviceExport"
I0521 12:47:38.879326       1 context.go:170] Started "serviceExport"
I0521 12:47:38.879337       1 context.go:160] Starting "federatedResourceQuotaSync"
I0521 12:47:38.879362       1 context.go:170] Started "federatedResourceQuotaSync"
I0521 12:47:38.879376       1 context.go:160] Starting "applicationFailover"
I0521 12:47:38.879397       1 context.go:170] Started "applicationFailover"
I0521 12:47:38.879405       1 context.go:160] Starting "multiclusterservice"
W0521 12:47:38.879412       1 context.go:167] Skipping "multiclusterservice"
W0521 12:47:38.879426       1 context.go:157] "hpaScaleTargetMarker" is disabled
I0521 12:47:38.879436       1 context.go:160] Starting "cluster"
E0521 12:47:38.897338       1 context.go:163] Error starting "cluster"
F0521 12:47:38.897365       1 controllermanager.go:821] error starting controllers: [no matches for kind "ResourceBinding" in version "work.karmada.io/v1alpha2", no matches for kind "ClusterResourceBinding" in version "work.karmada.io/v1alpha2"]

pay attention to error starting controllers: [no matches for kind "ResourceBinding" in version "work.karmada.io/v1alpha2", no matches for kind "ClusterResourceBinding" in version "work.karmada.io/v1alpha2"]

chaosi-zju · 2024-05-21T13:08:00Z

after several times retry, the error log of karmada-controller-manager turned to

E0521 12:57:17.222169       1 cluster_controller.go:206] Error monitoring cluster health: no matches for kind "Cluster" in version "cluster.karmada.io/v1alpha1"

may be the same problem as #4942 submited by @levkp

chaosi-zju · 2024-05-21T14:08:43Z

as for error log just like

E0521 14:04:55.290515       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:04:58.278909       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:05:02.319658       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found

is because this ct install --debug --helm-extra-args "--timeout 800s" installed karmada at karmada-xxxxxxxxxx namespace, so karmada-system exactly not exist

chaosi-zju · 2024-05-21T14:26:19Z

@RainbowMango @XiShanYongYe-Chang

I change ci step ct install --debug --helm-extra-args "--timeout 800s" to ct install --namespace "karmada-system" --debug --helm-extra-args "--timeout 800s", the problem gone.

may be our install logic in helm job like pre-install-job strongly depends on karmada-system.

calvin0327 · 2024-05-23T07:45:42Z

as for error log just like

E0521 14:04:55.290515       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:04:58.278909       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:05:02.319658       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found

is because this ct install --debug --helm-extra-args "--timeout 800s" installed karmada at karmada-xxxxxxxxxx namespace, so karmada-system exactly not exist

the namespace of scheduler logs is the namespace of karmada controlplane but k8s controlplane. the namespace karmada-system should be exist.

chaosi-zju · 2024-05-24T08:52:44Z

Hi @calvin0327, in the past several days, I 've made some new discovery, and I 'd like to discuss with you.

I found three problems, let's elaborate one by one.

Problem 1

As for error: controllermanager.go:821] error starting controllers: [no matches for kind "ResourceBinding" in version "work.karmada.io/v1alpha2", no matches for kind "ClusterResourceBinding" in version "work.karmada.io/v1alpha2"]

The root cause is that crd has not been installed when the controller-manager is installed, and the absence of crd will cause the controller-manager to crash. However, once the controller-manager crashes, it will not execute the post-install-job, that is to say, crd will not be installed. So it's deadlock.

This is also why our previous CI, even if it runs successfully, runs for nearly 15 minutes and is on the verge of timeout. controller-manager crash is inevitable, but in many cases, before crash, a short running state may allow post-install-job to run, thus solving the deadlock.

Our installation needs to be in order, just like: issue cert -> etcd -> karmada-apiserver -> crd -> others. If it is not in order, it will bring another problem: many components have been restarted many times when the installation is successful, which gives users a bad experience.

However, I don't know what good practice is to implement such sequencing in helm, all I know is: pre-install hooks or split sub-charts, do you have more information?

I tried to use pre-install hook to achieve such install sequence, that means put etcd/karmada-apiserver/crd to pre-install stage, this error gone, the installation process was quickly executed successfully, and there was no abnormal restart of the component. However, this operation may be a little tricky.

Problem 2

As for namespaces "karmada-system" not found error, such as:

E0521 14:04:55.290515       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:04:58.278909       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:05:02.319658       1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found

It is because we defined a systemNamespace in values.yaml:

karmada/charts/karmada/values.yaml

Line 44 in 2220124

systemNamespace: "karmada-system"

and in cr of components like scheduler using this namespace as launch param:

karmada/charts/karmada/templates/karmada-scheduler.yaml

Lines 44 to 53 in 2220124

 containers: 

 - name: {{ $name }}-scheduler 

 image: {{ template "karmada.scheduler.image" .}} 

 imagePullPolicy: {{ .Values.scheduler.image.pullPolicy }} 

 command: 

 - /bin/karmada-scheduler 

 - --kubeconfig=/etc/kubeconfig 

 - --bind-address=0.0.0.0 

 - --secure-port=10351 

 - --leader-elect-resource-namespace={{ $systemNamespace }}

I don't make sense since we have .Release.Namespace, why we still need systemNamespace in values.yaml. Supposing we install karmada components at karmada-mj94126o67 namespace, why should --leader-elect-resource-namespace use karmada-system namespace?

I think if we install karmada at karmada-mj94126o67 namespace, we shall unified using this namespace, instead of mixed use karmada-system, is there any potential cause here?

Problem 3

As for error log E0521 12:57:17.222169 1 cluster_controller.go:206] Error monitoring cluster health: no matches for kind "Cluster" in version "cluster.karmada.io/v1alpha1", which also encountered by #4942.

I don't know the root cause yet, but this should also have something to do with the installation sequence. Since I tried to fix the installation sequence, this error will not appear again.

calvin0327 · 2024-05-27T05:16:12Z

@chaosi-zju You are quite right.

For the problem 1: Yes, we should ensure the installation order. Currently, apart from using hooks, I don't have a better way either. However, we can have components that need to watch Karmada CRDs, such as the scheduler and controller components, installed using post-hooks. how do you thinks ?

However, I previously discovered some drawbacks to this approach, but I don't quite remember clearly. I need to research it further.

For problem 2: In the very early days, we did it this way, using Release.Namespace as the system namespace for Karmada. However, many users wanted to use a unified namespace name or a custom namespace. That's why it defined a systemNamespace variable.

For problem3: Sorry, I'm not clear about it either.

chaosi-zju · 2024-05-28T08:00:48Z

As for installation order, I browsed a lot of information and consulted others.

Maybe the best way is still hooks

other feasible but not good way: init-container、wait dependency in main function code

as for sub charts or chart dependency is not feasible.

some reference:

calvin0327 · 2024-05-28T10:17:41Z

The above is an explanation about Helm hooks.

Running the karmada components as a hook doesn't seem very reasonable, the approach you mentioned of usring an init-container seems like a more elegant way. we can use the init-container to detect whether the karmada control plane has installed the CRD resource. this approach is not only suitable for the karmada controller manager but alse for the scheduler.

chaosi-zju · 2024-05-29T09:10:10Z

the approach you mentioned of usring an init-container seems like a more elegant way. we can use the init-container to detect whether the karmada control plane has installed the CRD resource. this approach is not only suitable for the karmada controller manager but alse for the scheduler.

it's okay to me, I can submit a PR, let we see the effect first.

chaosi-zju · 2024-06-01T13:10:21Z

@calvin0327 @RainbowMango

I got another reason for hooks is not the best practice, through my actual testing and refering the official document (https://helm.sh/docs/topics/charts_hooks/#hooks-and-the-release-lifecycle), I learned that:

If I defined deployment1 and deployment2 in pre-install hooks, and the weight of deployment2 is greater, however, as long as the deployment1 is applied, even though it is not available/ready/running, the deployment2 will be applied.

This is not what we expected, we expect that deployment2 be applied after deployment1 is ready running.

This can be referenced from above document link:

What does it mean to wait until a hook is ready? This depends on the resource declared in the hook. If the resource is a Job or Pod kind, Helm will wait until it successfully runs to completion. And if the hook fails, the release will fail. This is a blocking operation, so the Helm client will pause while the Job is run.

For all other kinds, as soon as Kubernetes marks the resource as loaded (added or updated), the resource is considered "Ready".

so, pre-install hook only wait a Job completed or a Pod running, for all others, just wait it applied.

RainbowMango added the kind/flake Categorizes issue or PR as related to a flaky test. label May 8, 2024

chaosi-zju mentioned this issue May 21, 2024

Error monitoring cluster health: no matches for kind "Cluster" #4942

Closed

chaosi-zju linked a pull request Jun 1, 2024 that will close this issue

helm install karmada components in order #5010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaking Test] chart-lint-test not stable recently #4917

[Flaking Test] chart-lint-test not stable recently #4917

RainbowMango commented May 8, 2024 •

edited

RainbowMango commented May 8, 2024

chaosi-zju commented May 9, 2024

chaosi-zju commented May 9, 2024

chaosi-zju commented May 9, 2024

RainbowMango commented May 20, 2024

RainbowMango commented May 21, 2024

chaosi-zju commented May 21, 2024

chaosi-zju commented May 21, 2024

chaosi-zju commented May 21, 2024 •

edited

chaosi-zju commented May 21, 2024 •

edited

calvin0327 commented May 23, 2024 •

edited

chaosi-zju commented May 24, 2024 •

edited

calvin0327 commented May 27, 2024 •

edited

chaosi-zju commented May 28, 2024 •

edited

calvin0327 commented May 28, 2024

chaosi-zju commented May 29, 2024

chaosi-zju commented Jun 1, 2024

[Flaking Test] chart-lint-test not stable recently #4917

[Flaking Test] chart-lint-test not stable recently #4917

Comments

RainbowMango commented May 8, 2024 • edited

Which jobs are flaking:

Which test(s) are flaking:

Reason for failure:

Anything else we need to know:

RainbowMango commented May 8, 2024

chaosi-zju commented May 9, 2024

chaosi-zju commented May 9, 2024

chaosi-zju commented May 9, 2024

RainbowMango commented May 20, 2024

RainbowMango commented May 21, 2024

chaosi-zju commented May 21, 2024

chaosi-zju commented May 21, 2024

chaosi-zju commented May 21, 2024 • edited

chaosi-zju commented May 21, 2024 • edited

calvin0327 commented May 23, 2024 • edited

chaosi-zju commented May 24, 2024 • edited

Problem 1

Problem 2

Problem 3

calvin0327 commented May 27, 2024 • edited

chaosi-zju commented May 28, 2024 • edited

calvin0327 commented May 28, 2024

chaosi-zju commented May 29, 2024

chaosi-zju commented Jun 1, 2024

RainbowMango commented May 8, 2024 •

edited

chaosi-zju commented May 21, 2024 •

edited

chaosi-zju commented May 21, 2024 •

edited

calvin0327 commented May 23, 2024 •

edited

chaosi-zju commented May 24, 2024 •

edited

calvin0327 commented May 27, 2024 •

edited

chaosi-zju commented May 28, 2024 •

edited