Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaniko build hangs waiting for a long-running build that has already finished and pushed the image #8658

Open
mikedld opened this issue Apr 8, 2023 · 3 comments · May be fixed by #9373
Open
Labels
build/kaniko kind/bug Something isn't working priority/p2 May take a couple of releases

Comments

@mikedld
Copy link

mikedld commented Apr 8, 2023

Expected behavior

Cluster (kaniko) build succeeds regardless of how long it takes.

Actual behavior

If build takes considerable time (in my case, more than 35-55 minutes), skaffold hangs after completing and pushing an image.

Information

  • Skaffold version: 2.3.0, 2.1.0, 1.39.7
  • Operating system: alpine:3.17 (amd64)
  • Installed via: downloaded from storage.googleapis.com as indicated in GH release description
  • Contents of skaffold.yaml:
apiVersion: skaffold/v2beta20
kind: Config
build:
  cluster:
    namespace: foobar-ns
    serviceAccount: foobar-sa
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits: {}
    concurrency: 1
    timeout: 40m
  artifacts:
    - image: foobar
      context: foobar-dir
      kaniko:
        logFormat: text
        logTimestamp: true
        verbosity: debug
  tagPolicy:
    sha256: {}
  • Kubernetes version: 1.23 (AWS EKS)
  • Kaniko version: 1.9.2, 1.8.1, 1.7.0
  • Contents of foobar-dir/Dockerfile:
FROM alpine:3.17
RUN sleep $(( 38 * 60 ))

Steps to reproduce the behavior

  1. run build in Jenkins, K8s plugin spins up the pod (definition in log attached) with container that has skaffold preinstalled
  2. skaffold --interactive=false --verbosity=debug build --default-repo=000000000000.dkr.ecr.eu-west-1.amazonaws.com

While trying to troubleshoot, I've cloned the repo, added some logging (patch attached), and built it myself, hence the version in the log doesn't match 2.3.0 release but current main. The log shows that at some point (before the image is built and before the cluster timeout is reached) the pods watcher starts to report events with empty type and nil object in a tight loop, no pod termination event is ever reported.

I've adjusted the cluster timeout for this report so that it comes 2 minutes after sleep ends in Dockerfile, to reduce the log size, otherwise Jenkins kills the build. Increasing the cluster timeout doesn't help, I once waited for 4 hours and nothing happened, it was still there waiting after the "Pushed <image>" message.

Increasing cluster resource requests (for kaniko pod) doesn't help, I tried with 3000m and 8Gi. EKS nodes are r5.xlarge and were idling during the test.

Files:

@mikedld
Copy link
Author

mikedld commented Apr 9, 2023

The below patch seems to fix it for me. Not sure how good it is (my first time dealing with Go ever), but can open a PR. Also note that the same issue affects WaitForDeploymentToStabilize (and probably some other places where Watch is used) but I can't test it so I didn't patch it.

diff --git a/pkg/skaffold/kubernetes/wait.go b/pkg/skaffold/kubernetes/wait.go
index f1d517f6a..11d5e8f09 100644
--- a/pkg/skaffold/kubernetes/wait.go
+++ b/pkg/skaffold/kubernetes/wait.go
@@ -32,6 +32,8 @@ import (
 	"k8s.io/apimachinery/pkg/watch"
 	"k8s.io/client-go/kubernetes"
 	corev1 "k8s.io/client-go/kubernetes/typed/core/v1"
+	"k8s.io/client-go/tools/cache"
+	watchtools "k8s.io/client-go/tools/watch"
 
 	"github.com/GoogleContainerTools/skaffold/v2/pkg/skaffold/output/log"
 )
@@ -61,7 +63,7 @@ func watchUntilTimeout(ctx context.Context, timeout time.Duration, w watch.Inter
 func WaitForPodSucceeded(ctx context.Context, pods corev1.PodInterface, podName string, timeout time.Duration) error {
 	log.Entry(ctx).Infof("Waiting for %s to be complete", podName)
 
-	w, err := pods.Watch(ctx, metav1.ListOptions{})
+	w, err := newPodsWatcher(ctx, pods)
 	if err != nil {
 		return fmt.Errorf("initializing pod watcher: %s", err)
 	}
@@ -101,7 +103,7 @@ func isPodSucceeded(podName string) func(event *watch.Event) (bool, error) {
 func WaitForPodInitialized(ctx context.Context, pods corev1.PodInterface, podName string) error {
 	log.Entry(ctx).Infof("Waiting for %s to be initialized", podName)
 
-	w, err := pods.Watch(ctx, metav1.ListOptions{})
+	w, err := newPodsWatcher(ctx, pods)
 	if err != nil {
 		return fmt.Errorf("initializing pod watcher: %s", err)
 	}
@@ -154,3 +156,16 @@ func WaitForDeploymentToStabilize(ctx context.Context, c kubernetes.Interface, n
 	return false, nil
 	})
 }
+
+func newPodsWatcher(ctx context.Context, pods corev1.PodInterface) (watch.Interface, error) {
+	initList, err := pods.List(ctx, metav1.ListOptions{})
+	if err != nil {
+		return nil, err
+	}
+
+	return watchtools.NewRetryWatcher(initList.GetResourceVersion(), &cache.ListWatch{
+		WatchFunc: func(listOptions metav1.ListOptions) (watch.Interface, error) {
+			return pods.Watch(ctx, listOptions)
+		},
+	})
+}

@renzodavid9 renzodavid9 added kind/bug Something isn't working priority/p2 May take a couple of releases build/kaniko labels Apr 17, 2023
@JRuedas
Copy link

JRuedas commented Jan 30, 2024

Hi, I'm experiencing the exact same problem. In my case the build takes around 40 minutes and even though kaniko's pod finishes succesfully Skaffold is not aware and hangs forever.

@mikedld Were you able to solve the problem in other way?

@mikedld
Copy link
Author

mikedld commented Jan 30, 2024

@JRuedas, my pipeline builds skaffold with this patch applied instead of installing a prebuilt binary. Adds about 3 minutes which is negligible compared to the actual images build (about 3-5 hours). So no, haven't found another way and still waiting for this to be fixed upstream.

@mikedld mikedld linked a pull request Apr 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build/kaniko kind/bug Something isn't working priority/p2 May take a couple of releases
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants