kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161

Earl-chen · 2019-01-17T09:08:12Z

After I installed FfDl according to the prompts, check the status and get the following prompt:

and use helm list,get follow :

Then, for these incorrect pods, I use kubectl describe pods <pods Name> to view the information. The results are as follows:

Is there any friend who can give me some advice? What is the problem with me? I would like to express my heartfelt thanks.

The text was updated successfully, but these errors were encountered:

Tomcli · 2019-01-17T17:37:05Z

Hi @Earl-chen, it looks like the volume configmap is not being created. Can you run the following script to generate the necessary configmap? Thanks.

pushd bin
./create_static_volumes.sh
./create_static_volumes_config.sh
popd

Earl-chen · 2019-01-21T10:11:00Z

Thank you very much for your help. @Tomcli 
I tried the method you said, and also tried to use the command` helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) `After deleting FfDL, use the command of` helm install .` to install FFDL. The final result is still a failure. The specific description is as follows:

When I just installed FfDL, in the first 10 seconds or so, all the STATUS was found by kubectl get pods.

However, after a while, ffdl-trainer-6777dd5756-rjlfl, prometheus-67fb854b59-mxdrn and ffdl-trainingdata-696b99ff5c-2hsc5 will often become CrashLoopBackOff or ERROR.

After about 10 minutes, ffdl-lcm-8d555c7bf-r6fj8, ffdl-trainer-6777dd5756-rjlfl, ffdl-trainingdata-696b99ff5c-2hsc5 and prometheus-67fb854b59-mxdrn are not becoming Running. At the same time, by command: helm status $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | grep STATUS:
The result is also changed from the previous DEPLOYED to FAILED.

See the details by kubectl describe pods as follows:

I can't understand why this is happening?

Tomcli · 2019-01-21T19:06:23Z

Thank you for taking the time to redeploy FfDL. It looks like many of the pods failed liveness probe. Which means those microservices might not able to communicate with each other via the KubeDNS server on your cluster. Can you display some logs from your KubeDNS pod in the kube-system namespace?

Earl-chen · 2019-01-22T01:09:02Z

@Tomcli Thank you for your prompt reply. Unfortunately, I missed KubeDNS. KubeDNS is not installed. I install it now, and then give you feedback.

Eric-Zhang1990 · 2019-01-22T05:56:26Z

@Tomcli I also have the problem above

I run command $kubectl describe pods ffdl-lcm-8d555c7bf-6pg7z --namespace kube-system, the result is:

192.168.110.158 is the node of k8s, I run the command
$pushd bin
$./create_static_volumes.sh
$./create_static_volumes_config.sh
$popd
on 192.168.110.25 (master of k8s) and 192.168.110.158 (node).
How can I solve problem above? Thank you.

Tomcli · 2019-01-22T17:04:38Z

@Eric-Zhang1990 ./create_static_volumes.sh and ./create_static_volumes_config.sh should able to create the static-volumes and v2 configmap for you. Do you still encounter problems with configmap "static-volumes" not found?

Eric-Zhang1990 · 2019-01-23T01:43:01Z

@Tomcli I can get following info, it shows static-volumes and v2 are there:

However, I restart FfDL, and still encounter problems with "SetUp failed for volume "static-volumes-config-volume-v2" : configmap "static-volumes-v2" not found".

And I rerun the command ./create_static_volumes.sh and ./create_static_volumes_config.sh, I got these:

How can I solve it? Thanks.
Another question is: which kind of type should I set for SHARED_VOLUME_STORAGE_CLASS? Thanks.

Tomcli · 2019-01-23T01:52:23Z

@Eric-Zhang1990 It looks like you deployed the static-volumes at the default namespace while FfDL is at kube-system namespace. You could deploy FfDL using helm with the namespace flag (e.g. helm install . --set namespace=default) to deploy FfDL on your default namespace.

Tomcli · 2019-01-23T01:54:01Z

The SHARED_VOLUME_STORAGE_CLASS should be your default storageclass in your Kubernetes cluster. You can check the storageclass with kubectl get storageclass

Eric-Zhang1990 · 2019-01-23T02:53:03Z

@Tomcli Does following state cause the problem above?

The status of static-volume-1 is always in pending, is something wrong?

Eric-Zhang1990 · 2019-01-23T02:55:30Z

The SHARED_VOLUME_STORAGE_CLASS should be your default storageclass in your Kubernetes cluster. You can check the storageclass with kubectl get storageclass

I run command kubectl get storageclass but get nothing.

Eric-Zhang1990 · 2019-01-23T03:26:11Z

It looks like you deployed the static-volumes at the default namespace while FfDL is at kube-system namespace. You could deploy FfDL using helm with the namespace flag (e.g. helm install . --set namespace=default) to deploy FfDL on your default namespace.

Now I change the static-volumes namespace into kube-system namespace, and I also deploy FfDL in kube-system namespace, then pod 'ffdl-lcm' now run ok, but status of pod 'ffdl-trainer' and 'ffdl-trainingdata' are not stable,

Which reason can cause this problem? Thank you.

Eric-Zhang1990 · 2019-01-23T06:58:51Z

@Tomcli After running one more hour, the statues of 'ffdl-trainingdata*' is still changing, sometimes is in status 'running', sometimes is 'CrashLoopBackOff'.

I run command "kubectl describe pods ffdl-trainingdata-74f7cdf66c-lkk2p", get following info:

And I run command "kubectl logs ffdl-trainingdata-74f7cdf66c-lkk2p", logs info are:

time="2019-01-23T07:06:18Z" level=debug msg="Log level set to 'debug'"
time="2019-01-23T07:06:18Z" level=debug msg="Milli CPU is: 60"
time="2019-01-23T07:06:18Z" level=info msg="GetTrainingDataMemInMB() returns 300"
time="2019-01-23T07:06:18Z" level=debug msg="Training Data Mem in MB is: 300"
time="2019-01-23T07:06:18Z" level=debug msg="No config file 'config-dev.yml' found. Using environment variables only."
{"caller_info":"metrics/main.go:36 main -","level":"debug","module":"training-data-service","msg":"function entry","time":"2019-01-23T07:06:18Z"}
{"caller_info":"metrics/main.go:42 main -","level":"debug","module":"training-data-service","msg":"Port is: 8443","time":"2019-01-23T07:06:18Z"}
{"caller_info":"metrics/main.go:44 main -","level":"debug","module":"training-data-service","msg":"Creating dlaas-training-metrics-service","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:147 NewService -","level":"debug","module":"training-data-service","msg":"es address #0: http://elasticsearch:9200","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:885 createIndexWithLogsIfDoesNotExist -","level":"debug","module":"training-data-service","msg":"function entry","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:887 createIndexWithLogsIfDoesNotExist -","level":"info","module":"training-data-service","msg":"calling IndexExists for dlaas_learner_data","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:888 createIndexWithLogsIfDoesNotExist -","error":"Head http://elasticsearch:9200/dlaas_learner_data: dial tcp: lookup elasticsearch on 10.254.0.2:53: read udp 172.17.0.6:53791-\u003e10.254.0.2:53: i/o timeout","level":"error","module":"training-data-service","msg":"IndexExists for dlaas_learner_data failed","time":"2019-01-23T07:06:58Z"}
{"caller_info":"elastic.v5/indices_create.go:31 createIndexWithLogsIfDoesNotExist -","level":"debug","module":"training-data-service","msg":"calling CreateIndex","time":"2019-01-23T07:06:58Z"}
{"caller_info":"service/service_impl.go:907 createIndexWithLogsIfDoesNotExist -","error":"no available connection: no Elasticsearch node available","level":"debug","module":"training-data-service","msg":"CreateIndex failed","time":"2019-01-23T07:06:58Z"}
panic: no available connection: no Elasticsearch node available

goroutine 1 [running]:
github.com/IBM/FfDL/metrics/service.NewService(0xc420479f68, 0xe23640)
/Users/tommyli/go/src/github.com/IBM/FfDL/metrics/service/service_impl.go:167 +0x980
main.main()
/Users/tommyli/go/src/github.com/IBM/FfDL/metrics/main.go:44 +0x16c

Is the problem of "no available connection: no Elasticsearch node available"?
Thanks.

Tomcli · 2019-01-23T17:30:34Z

Thank you for taking time to debug this. Elasticsearch should be part of the storage-0 container. It could be the Elasticsearch service didn't properly enabled. Can you run kubectl get svc to check is the elasticsearch is deployed? Also, you might want to run kubectl logs storage-0 to check is there any error related to elastic search.

Thanks.

Eric-Zhang1990 · 2019-01-24T01:13:53Z

@Tomcli I check elasticsearch is deployed, and logs of storage-0 shows "Failed to find a usable hardware address from the network interfaces; using random bytes: 64:4b:61:9d:da:79:4a:d3", which reason can cause this problem?
Thanks.

Eric-Zhang1990 · 2019-01-25T07:45:17Z

@Tomcli Today I run FfDL again, I can get all compositions are running, but they all get some numbers of RESTARTS, is it all right? Can I use it for training? Thank you.

Tomcli · 2019-01-25T17:34:16Z

Hi @Eric-Zhang1990, Sorry for the late reply. Regrading the elastic search error, you supposed to have the following logs at the end of the storage-0 container.

[2019-01-24T01:17:28,500][WARN ][o.e.b.BootstrapChecks    ] [2cdcQJ-] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
127.0.0.1 - - [24/Jan/2019 01:17:30] "GET / HTTP/1.1" 200 -
2019-01-24T01:17:30:WARNING:infra.pyc: Service "elasticsearch" not yet available, retrying...
[2019-01-24T01:17:31,568][INFO ][o.e.c.s.ClusterService   ] [2cdcQJ-] new_master {2cdcQJ-}{2cdcQJ-PT-OgOS1lVhqU_g}{xT1sK8mWRuiaU5zsT5R0pw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2019-01-24T01:17:31,605][INFO ][o.e.h.n.Netty4HttpServerTransport] [2cdcQJ-] publish_address {127.0.0.1:4560}, bound_addresses {[::1]:4560}, {127.0.0.1:4560}
[2019-01-24T01:17:31,613][INFO ][o.e.n.Node               ] [2cdcQJ-] started
[2019-01-24T01:17:31,635][INFO ][o.e.g.GatewayService     ] [2cdcQJ-] recovered [0] indices into cluster_state
127.0.0.1 - - [24/Jan/2019 01:17:33] "GET / HTTP/1.1" 200 -
Ready.
[2019-01-24T01:17:53,424][INFO ][o.e.c.m.MetaDataCreateIndexService] [2cdcQJ-] [dlaas_learner_data] creating index, cause [api], templates [], shards [5]/[1], mappings []
[2019-01-24T01:17:53,996][INFO ][o.e.c.m.MetaDataMappingService] [2cdcQJ-] [dlaas_learner_data/uZblTWoeQBurTMiFYUU9Ng] create_mapping [logline]
[2019-01-24T01:17:54,039][INFO ][o.e.c.m.MetaDataMappingService] [2cdcQJ-] [dlaas_learner_data/uZblTWoeQBurTMiFYUU9Ng] create_mapping [emetrics]

The above logs will indicate the elastic search schema table is created, then the ffdl-trainingdata service pod should be functional after that.

Since I see all your pods is running today, you can go ahead and starting use it for training. I can follow up on it if you encounter any further question. Thank you.

Eric-Zhang1990 · 2019-01-29T03:14:59Z

@Tomcli Thank you for your patient reply. I check the log of storage-0 container, it shows the same info as yours.

However, the status of these pods are still not stable, like that:

I describe the container prometheus and found that although it is running, but it has error info "Readiness probe failed: ":, does this error info have effect on other pods?

One more thing: I run FfDL on 2 servers, and they are on local area network, does network have effect on deployment of FfDL?
Thank you.

Tomcli · 2019-01-30T17:27:43Z

Hi @Eric-Zhang1990, it looks like some internal connections are either refused or timed out. If you local area network has low bandwidth, I recommend to deploy FfDL without any monitoring service to reduce the network throughput. e.g.

helm install . --set prometheus.deploy=false

Eric-Zhang1990 · 2019-01-31T00:40:22Z

@Tomcli I run command 'helm install . --set prometheus.deploy=false' and find ffdl-trainer is also CrashLoopBackOff or running, and it always shows "Back-off restarting failed container".

I run "kubectl describe po ffdl-trainer-7b44999975-d2b7g" and get this:

I delete the pod ffdl-trainer and it can run correctly for a while.

I find ffdl-lcm is running:

but I run "kubectl describe po ffdl-lcm-7f69876c98-lrqjj" and get this:

Eric-Zhang1990 · 2019-02-01T00:46:29Z

@Tomcli Thanks, it seems like the issue of internal connections, I can run correctly on one server, but on two server, the status is unstable.

Eric-Zhang1990 · 2019-02-19T07:59:07Z

@Tomcli Sorry for bothering you. I have the same problem after I deploy FfDL on other two servers (192.168.110.158 and 192.168.110.76 as node, 192.168.110.25 as master).

And log of ffdl-trainer is:

Is it also the internal connections issue between pods in defferent servers? I don't know where problem is, thanks.

Tomcli · 2019-02-19T17:35:57Z

Hi @Eric-Zhang1990, it looks like some of the services are not reachable between two of your worker nodes. Also, the errors you have before that fails the liveness test also indicates that the GRPC protocols are not reachable between the microservices that are in different nodes.

Since FfDL is using KubeDNS to discover and communicate between each microservice, it could be your KubeDNS wasn't setup correctly. Another reason could be something is blocking the internode communication (e.g. firewall setting, VLAN, etc...).

Eric-Zhang1990 · 2019-02-22T02:56:01Z

@Tomcli Thank you for your kind reply, I also think the issue is communication problem, after many times try, I delete k8s and deploy it in kubeadm tool, and now it runs correctly.

ZepengW · 2020-01-18T06:20:42Z

@Tomcli hello, I have some similar but not same question when I deploy FFdl,
There are three pods is CrashLoopBackOff, and static-volume-1 is pending because of the follow reason

And after I clean up FdDL and rebuild(make deploy-plugin), it shows
Error from server (AlreadyExists): configmaps "static-volumes-v2" already exists

Tomcli · 2020-01-21T17:43:11Z

You can check the list of storageclass on your cluster by running kubectl get storageclass.
Then you can run export SHARED_VOLUME_STORAGE_CLASS="<storageclass>" to use your desire storageclass as FfDL's persistent storage. If you don't have any storageclass, you will need to run export SHARED_VOLUME_STORAGE_CLASS="" and create a static pv using host path. (e.g.

kubectl create -f - <<EOF
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv-volume
spec:
  storageClassName:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/pv"
EOF

Once you completed with the above steps, you can continue with make deploy-plugin and
make quickstart-deploy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161

Earl-chen commented Jan 17, 2019

Tomcli commented Jan 17, 2019

Earl-chen commented Jan 21, 2019

Tomcli commented Jan 21, 2019

Earl-chen commented Jan 22, 2019

Eric-Zhang1990 commented Jan 22, 2019

Tomcli commented Jan 22, 2019

Eric-Zhang1990 commented Jan 23, 2019 •

edited

Tomcli commented Jan 23, 2019

Tomcli commented Jan 23, 2019

Eric-Zhang1990 commented Jan 23, 2019

Eric-Zhang1990 commented Jan 23, 2019

Eric-Zhang1990 commented Jan 23, 2019 •

edited

Eric-Zhang1990 commented Jan 23, 2019 •

edited

Tomcli commented Jan 23, 2019

Eric-Zhang1990 commented Jan 24, 2019

Eric-Zhang1990 commented Jan 25, 2019 •

edited

Tomcli commented Jan 25, 2019

Eric-Zhang1990 commented Jan 29, 2019

Tomcli commented Jan 30, 2019

Eric-Zhang1990 commented Jan 31, 2019 •

edited

Eric-Zhang1990 commented Feb 1, 2019

Eric-Zhang1990 commented Feb 19, 2019

Tomcli commented Feb 19, 2019 •

edited

Eric-Zhang1990 commented Feb 22, 2019

ZepengW commented Jan 18, 2020

Tomcli commented Jan 21, 2020

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161

Comments

Earl-chen commented Jan 17, 2019

Tomcli commented Jan 17, 2019

Earl-chen commented Jan 21, 2019

Tomcli commented Jan 21, 2019

Earl-chen commented Jan 22, 2019

Eric-Zhang1990 commented Jan 22, 2019

Tomcli commented Jan 22, 2019

Eric-Zhang1990 commented Jan 23, 2019 • edited

Tomcli commented Jan 23, 2019

Tomcli commented Jan 23, 2019

Eric-Zhang1990 commented Jan 23, 2019

Eric-Zhang1990 commented Jan 23, 2019

Eric-Zhang1990 commented Jan 23, 2019 • edited

Eric-Zhang1990 commented Jan 23, 2019 • edited

Tomcli commented Jan 23, 2019

Eric-Zhang1990 commented Jan 24, 2019

Eric-Zhang1990 commented Jan 25, 2019 • edited

Tomcli commented Jan 25, 2019

Eric-Zhang1990 commented Jan 29, 2019

Tomcli commented Jan 30, 2019

Eric-Zhang1990 commented Jan 31, 2019 • edited

Eric-Zhang1990 commented Feb 1, 2019

Eric-Zhang1990 commented Feb 19, 2019

Tomcli commented Feb 19, 2019 • edited

Eric-Zhang1990 commented Feb 22, 2019

ZepengW commented Jan 18, 2020

Tomcli commented Jan 21, 2020

Eric-Zhang1990 commented Jan 23, 2019 •

edited

Eric-Zhang1990 commented Jan 23, 2019 •

edited

Eric-Zhang1990 commented Jan 23, 2019 •

edited

Eric-Zhang1990 commented Jan 25, 2019 •

edited

Eric-Zhang1990 commented Jan 31, 2019 •

edited

Tomcli commented Feb 19, 2019 •

edited