Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161

Open
Earl-chen opened this issue Jan 17, 2019 · 26 comments

Comments

@Earl-chen
Copy link

After I installed FfDl according to the prompts, check the status and get the following prompt:
screenshot from 2019-01-17 16-56-34

and use helm list,get follow :
screenshot from 2019-01-17 17-00-32
Then, for these incorrect pods, I use kubectl describe pods <pods Name> to view the information. The results are as follows:
screenshot from 2019-01-17 17-05-50
screenshot from 2019-01-17 17-05-02
screenshot from 2019-01-17 17-04-30
screenshot from 2019-01-17 17-06-39

Is there any friend who can give me some advice? What is the problem with me? I would like to express my heartfelt thanks.

@Tomcli
Copy link
Contributor

Tomcli commented Jan 17, 2019

Hi @Earl-chen, it looks like the volume configmap is not being created. Can you run the following script to generate the necessary configmap? Thanks.

pushd bin
./create_static_volumes.sh
./create_static_volumes_config.sh
popd

@Earl-chen
Copy link
Author

Thank you very much for your help. @Tomcli 
I tried the method you said, and also tried to use the command` helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) `After deleting FfDL, use the command of` helm install .` to install FFDL. The final result is still a failure. The specific description is as follows:

When I just installed FfDL, in the first 10 seconds or so, all the STATUS was found by kubectl get pods.
screenshot from 2019-01-21 17-33-55

However, after a while, ffdl-trainer-6777dd5756-rjlfl, prometheus-67fb854b59-mxdrn and ffdl-trainingdata-696b99ff5c-2hsc5 will often become CrashLoopBackOff or ERROR.

screenshot from 2019-01-21 17-34-18

After about 10 minutes, ffdl-lcm-8d555c7bf-r6fj8, ffdl-trainer-6777dd5756-rjlfl, ffdl-trainingdata-696b99ff5c-2hsc5 and prometheus-67fb854b59-mxdrn are not becoming Running. At the same time, by command: helm status $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | grep STATUS:
The result is also changed from the previous DEPLOYED to FAILED.
screenshot from 2019-01-21 18-06-25
screenshot from 2019-01-21 17-00-30
screenshot from 2019-01-21 17-46-58

See the details by kubectl describe pods as follows:

screenshot from 2019-01-21 18-08-17
screenshot from 2019-01-21 18-08-36
screenshot from 2019-01-21 18-08-55
screenshot from 2019-01-21 18-09-54

I can't understand why this is happening?

@Tomcli
Copy link
Contributor

Tomcli commented Jan 21, 2019

Thank you for taking the time to redeploy FfDL. It looks like many of the pods failed liveness probe. Which means those microservices might not able to communicate with each other via the KubeDNS server on your cluster. Can you display some logs from your KubeDNS pod in the kube-system namespace?

@Earl-chen
Copy link
Author

@Tomcli Thank you for your prompt reply. Unfortunately, I missed KubeDNS. KubeDNS is not installed. I install it now, and then give you feedback.

@Eric-Zhang1990
Copy link

@Tomcli I also have the problem above
_ _20190122134714

I run command $kubectl describe pods ffdl-lcm-8d555c7bf-6pg7z --namespace kube-system, the result is:
_ _20190122135141
_ _20190122135203
_ _20190122135216

192.168.110.158 is the node of k8s, I run the command
$pushd bin
$./create_static_volumes.sh
$./create_static_volumes_config.sh
$popd
on 192.168.110.25 (master of k8s) and 192.168.110.158 (node).
How can I solve problem above? Thank you.

@Tomcli
Copy link
Contributor

Tomcli commented Jan 22, 2019

@Eric-Zhang1990 ./create_static_volumes.sh and ./create_static_volumes_config.sh should able to create the static-volumes and v2 configmap for you. Do you still encounter problems with configmap "static-volumes" not found?

@Eric-Zhang1990
Copy link

Eric-Zhang1990 commented Jan 23, 2019

@Tomcli I can get following info, it shows static-volumes and v2 are there:
_ _20190123083352

However, I restart FfDL, and still encounter problems with "SetUp failed for volume "static-volumes-config-volume-v2" : configmap "static-volumes-v2" not found".
_ _20190123093805

_ _20190123094007

_ _20190123094031

And I rerun the command ./create_static_volumes.sh and ./create_static_volumes_config.sh, I got these:
_ _20190123094449

How can I solve it? Thanks.
Another question is: which kind of type should I set for SHARED_VOLUME_STORAGE_CLASS? Thanks.

@Tomcli
Copy link
Contributor

Tomcli commented Jan 23, 2019

@Eric-Zhang1990 It looks like you deployed the static-volumes at the default namespace while FfDL is at kube-system namespace. You could deploy FfDL using helm with the namespace flag (e.g. helm install . --set namespace=default) to deploy FfDL on your default namespace.

@Tomcli
Copy link
Contributor

Tomcli commented Jan 23, 2019

The SHARED_VOLUME_STORAGE_CLASS should be your default storageclass in your Kubernetes cluster. You can check the storageclass with kubectl get storageclass

@Eric-Zhang1990
Copy link

@Tomcli Does following state cause the problem above?
_ _20190123104829
The status of static-volume-1 is always in pending, is something wrong?

@Eric-Zhang1990
Copy link

The SHARED_VOLUME_STORAGE_CLASS should be your default storageclass in your Kubernetes cluster. You can check the storageclass with kubectl get storageclass

I run command kubectl get storageclass but get nothing.
_ _20190123105514

@Eric-Zhang1990
Copy link

Eric-Zhang1990 commented Jan 23, 2019

It looks like you deployed the static-volumes at the default namespace while FfDL is at kube-system namespace. You could deploy FfDL using helm with the namespace flag (e.g. helm install . --set namespace=default) to deploy FfDL on your default namespace.

Now I change the static-volumes namespace into kube-system namespace, and I also deploy FfDL in kube-system namespace, then pod 'ffdl-lcm' now run ok, but status of pod 'ffdl-trainer' and 'ffdl-trainingdata' are not stable,
_ _20190123111111

_ _20190123111122

_ _20190123112530

Which reason can cause this problem? Thank you.

@Eric-Zhang1990
Copy link

Eric-Zhang1990 commented Jan 23, 2019

@Tomcli After running one more hour, the statues of 'ffdl-trainingdata*' is still changing, sometimes is in status 'running', sometimes is 'CrashLoopBackOff'.
_ _20190123145443
I run command "kubectl describe pods ffdl-trainingdata-74f7cdf66c-lkk2p", get following info:
_ _20190123145718

And I run command "kubectl logs ffdl-trainingdata-74f7cdf66c-lkk2p", logs info are:

time="2019-01-23T07:06:18Z" level=debug msg="Log level set to 'debug'"
time="2019-01-23T07:06:18Z" level=debug msg="Milli CPU is: 60"
time="2019-01-23T07:06:18Z" level=info msg="GetTrainingDataMemInMB() returns 300"
time="2019-01-23T07:06:18Z" level=debug msg="Training Data Mem in MB is: 300"
time="2019-01-23T07:06:18Z" level=debug msg="No config file 'config-dev.yml' found. Using environment variables only."
{"caller_info":"metrics/main.go:36 main -","level":"debug","module":"training-data-service","msg":"function entry","time":"2019-01-23T07:06:18Z"}
{"caller_info":"metrics/main.go:42 main -","level":"debug","module":"training-data-service","msg":"Port is: 8443","time":"2019-01-23T07:06:18Z"}
{"caller_info":"metrics/main.go:44 main -","level":"debug","module":"training-data-service","msg":"Creating dlaas-training-metrics-service","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:147 NewService -","level":"debug","module":"training-data-service","msg":"es address #0: http://elasticsearch:9200","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:885 createIndexWithLogsIfDoesNotExist -","level":"debug","module":"training-data-service","msg":"function entry","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:887 createIndexWithLogsIfDoesNotExist -","level":"info","module":"training-data-service","msg":"calling IndexExists for dlaas_learner_data","time":"2019-01-23T07:06:18Z"}
{"caller_info":"service/service_impl.go:888 createIndexWithLogsIfDoesNotExist -","error":"Head http://elasticsearch:9200/dlaas_learner_data: dial tcp: lookup elasticsearch on 10.254.0.2:53: read udp 172.17.0.6:53791-\u003e10.254.0.2:53: i/o timeout","level":"error","module":"training-data-service","msg":"IndexExists for dlaas_learner_data failed","time":"2019-01-23T07:06:58Z"}
{"caller_info":"elastic.v5/indices_create.go:31 createIndexWithLogsIfDoesNotExist -","level":"debug","module":"training-data-service","msg":"calling CreateIndex","time":"2019-01-23T07:06:58Z"}
{"caller_info":"service/service_impl.go:907 createIndexWithLogsIfDoesNotExist -","error":"no available connection: no Elasticsearch node available","level":"debug","module":"training-data-service","msg":"CreateIndex failed","time":"2019-01-23T07:06:58Z"}
panic: no available connection: no Elasticsearch node available

goroutine 1 [running]:
github.com/IBM/FfDL/metrics/service.NewService(0xc420479f68, 0xe23640)
/Users/tommyli/go/src/github.com/IBM/FfDL/metrics/service/service_impl.go:167 +0x980
main.main()
/Users/tommyli/go/src/github.com/IBM/FfDL/metrics/main.go:44 +0x16c

Is the problem of "no available connection: no Elasticsearch node available"?
Thanks.

@Tomcli
Copy link
Contributor

Tomcli commented Jan 23, 2019

Thank you for taking time to debug this. Elasticsearch should be part of the storage-0 container. It could be the Elasticsearch service didn't properly enabled. Can you run kubectl get svc to check is the elasticsearch is deployed? Also, you might want to run kubectl logs storage-0 to check is there any error related to elastic search.

Thanks.

@Eric-Zhang1990
Copy link

@Tomcli I check elasticsearch is deployed, and logs of storage-0 shows "Failed to find a usable hardware address from the network interfaces; using random bytes: 64:4b:61:9d:da:79:4a:d3", which reason can cause this problem?
Thanks.
_ _20190124090313
_ _20190124090417

@Eric-Zhang1990
Copy link

Eric-Zhang1990 commented Jan 25, 2019

@Tomcli Today I run FfDL again, I can get all compositions are running, but they all get some numbers of RESTARTS, is it all right? Can I use it for training? Thank you.
_ _20190125154135

@Tomcli
Copy link
Contributor

Tomcli commented Jan 25, 2019

Hi @Eric-Zhang1990, Sorry for the late reply. Regrading the elastic search error, you supposed to have the following logs at the end of the storage-0 container.

[2019-01-24T01:17:28,500][WARN ][o.e.b.BootstrapChecks    ] [2cdcQJ-] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
127.0.0.1 - - [24/Jan/2019 01:17:30] "GET / HTTP/1.1" 200 -
2019-01-24T01:17:30:WARNING:infra.pyc: Service "elasticsearch" not yet available, retrying...
[2019-01-24T01:17:31,568][INFO ][o.e.c.s.ClusterService   ] [2cdcQJ-] new_master {2cdcQJ-}{2cdcQJ-PT-OgOS1lVhqU_g}{xT1sK8mWRuiaU5zsT5R0pw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2019-01-24T01:17:31,605][INFO ][o.e.h.n.Netty4HttpServerTransport] [2cdcQJ-] publish_address {127.0.0.1:4560}, bound_addresses {[::1]:4560}, {127.0.0.1:4560}
[2019-01-24T01:17:31,613][INFO ][o.e.n.Node               ] [2cdcQJ-] started
[2019-01-24T01:17:31,635][INFO ][o.e.g.GatewayService     ] [2cdcQJ-] recovered [0] indices into cluster_state
127.0.0.1 - - [24/Jan/2019 01:17:33] "GET / HTTP/1.1" 200 -
Ready.
[2019-01-24T01:17:53,424][INFO ][o.e.c.m.MetaDataCreateIndexService] [2cdcQJ-] [dlaas_learner_data] creating index, cause [api], templates [], shards [5]/[1], mappings []
[2019-01-24T01:17:53,996][INFO ][o.e.c.m.MetaDataMappingService] [2cdcQJ-] [dlaas_learner_data/uZblTWoeQBurTMiFYUU9Ng] create_mapping [logline]
[2019-01-24T01:17:54,039][INFO ][o.e.c.m.MetaDataMappingService] [2cdcQJ-] [dlaas_learner_data/uZblTWoeQBurTMiFYUU9Ng] create_mapping [emetrics]

The above logs will indicate the elastic search schema table is created, then the ffdl-trainingdata service pod should be functional after that.

Since I see all your pods is running today, you can go ahead and starting use it for training. I can follow up on it if you encounter any further question. Thank you.

@Eric-Zhang1990
Copy link

@Tomcli Thank you for your patient reply. I check the log of storage-0 container, it shows the same info as yours.
_ _20190129110221
However, the status of these pods are still not stable, like that:
_ _20190129110343
_ _20190129110233
I describe the container prometheus and found that although it is running, but it has error info "Readiness probe failed: ":, does this error info have effect on other pods?
_ _20190129110316

One more thing: I run FfDL on 2 servers, and they are on local area network, does network have effect on deployment of FfDL?
Thank you.

@Tomcli
Copy link
Contributor

Tomcli commented Jan 30, 2019

Hi @Eric-Zhang1990, it looks like some internal connections are either refused or timed out. If you local area network has low bandwidth, I recommend to deploy FfDL without any monitoring service to reduce the network throughput. e.g.

helm install . --set prometheus.deploy=false

@Eric-Zhang1990
Copy link

Eric-Zhang1990 commented Jan 31, 2019

@Tomcli I run command 'helm install . --set prometheus.deploy=false' and find ffdl-trainer is also CrashLoopBackOff or running, and it always shows "Back-off restarting failed container".
_ _20190131083638
I run "kubectl describe po ffdl-trainer-7b44999975-d2b7g" and get this:
_ _20190131083834

_ _20190131083948
I delete the pod ffdl-trainer and it can run correctly for a while.
_ _20190131083921
I find ffdl-lcm is running:
_ _20190131085257
but I run "kubectl describe po ffdl-lcm-7f69876c98-lrqjj" and get this:
_ _20190131085107

@Eric-Zhang1990
Copy link

@Tomcli Thanks, it seems like the issue of internal connections, I can run correctly on one server, but on two server, the status is unstable.

@Eric-Zhang1990
Copy link

@Tomcli Sorry for bothering you. I have the same problem after I deploy FfDL on other two servers (192.168.110.158 and 192.168.110.76 as node, 192.168.110.25 as master).
screenshot from 2019-02-19 15-51-59
And log of ffdl-trainer is:
screenshot from 2019-02-19 15-52-22

Is it also the internal connections issue between pods in defferent servers? I don't know where problem is, thanks.

@Tomcli
Copy link
Contributor

Tomcli commented Feb 19, 2019

Hi @Eric-Zhang1990, it looks like some of the services are not reachable between two of your worker nodes. Also, the errors you have before that fails the liveness test also indicates that the GRPC protocols are not reachable between the microservices that are in different nodes.

Since FfDL is using KubeDNS to discover and communicate between each microservice, it could be your KubeDNS wasn't setup correctly. Another reason could be something is blocking the internode communication (e.g. firewall setting, VLAN, etc...).

@Eric-Zhang1990
Copy link

@Tomcli Thank you for your kind reply, I also think the issue is communication problem, after many times try, I delete k8s and deploy it in kubeadm tool, and now it runs correctly.

@ZepengW
Copy link

ZepengW commented Jan 18, 2020

@Tomcli hello, I have some similar but not same question when I deploy FFdl,
There are three pods is CrashLoopBackOff, and static-volume-1 is pending because of the follow reason
image

image

And after I clean up FdDL and rebuild(make deploy-plugin), it shows
Error from server (AlreadyExists): configmaps "static-volumes-v2" already exists

@Tomcli
Copy link
Contributor

Tomcli commented Jan 21, 2020

You can check the list of storageclass on your cluster by running kubectl get storageclass.
Then you can run export SHARED_VOLUME_STORAGE_CLASS="<storageclass>" to use your desire storageclass as FfDL's persistent storage. If you don't have any storageclass, you will need to run export SHARED_VOLUME_STORAGE_CLASS="" and create a static pv using host path. (e.g.

kubectl create -f - <<EOF
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv-volume
spec:
  storageClassName:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/pv"
EOF

Once you completed with the above steps, you can continue with make deploy-plugin and
make quickstart-deploy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants