Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash spark-worker-controller pod #20

Open
ghost opened this issue Jun 23, 2016 · 6 comments
Open

Crash spark-worker-controller pod #20

ghost opened this issue Jun 23, 2016 · 6 comments

Comments

@ghost
Copy link

ghost commented Jun 23, 2016

Hello,
I have installed seldon in my local machine and now i am trying to run the Reuters Newswire Recommendation example but i have problems with spark-worker-controller and reuters-import-data pods.
The problems start after running kubectl create -f import-data-job.json command.
PS: I use a proxy to connect to the internet and i have added env variables for http_proxy and https_proxy

Can you help me please? Thank you in advance.
Here are the logs of my pods:
for spark-worker-controller: sed: -e expression #1, char 51: unknown option tos
=== Cannot resolve the DNS entry for spark-master. Has the service been created yet, and is SkyDNS functional?
=== See http://kubernetes.io/v1.1/docs/admin/dns.html for more details on DNS integration.
=== Sleeping 10s before pod exit.`

for reuters-import-data pod it stucks on ContainerCreating:
WARNING:kazoo.client:Connection dropped: socket connection error: Name or service not known Traceback (most recent call last): File "/opt/conda/bin/seldon-cli", line 4, in <module> __import__('pkg_resources').run_script('seldon==2.0.0', 'seldon-cli') File "/opt/conda/lib/python2.7/site-packages/setuptools-18.5-py2.7.egg/pkg_resources/__init__.py", line 742, in run_script File "/opt/conda/lib/python2.7/site-packages/setuptools-18.5-py2.7.egg/pkg_resources/__init__.py", line 1667, in run_script File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/EGG-INFO/scripts/seldon-cli", line 5, in <module> seldon.cli.start_seldoncli() File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/seldon/cli/__init__.py", line 3, in start_seldoncli cli_main.main() File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/seldon/cli/cli_main.py", line 346, in main start_zk_client(opts) File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/seldon/cli/cli_main.py", line 301, in start_zk_client gdata["zk_client"].start() File "/opt/conda/lib/python2.7/site-packages/kazoo/client.py", line 546, in start raise self.handler.timeout_exception("Connection time-out") kazoo.handlers.threading.KazooTimeoutError: Connection time-out connecting to zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181

@ukclivecox
Copy link
Contributor

This looks like a DNS issue.
How are you running Kubernetes?

@ghost
Copy link
Author

ghost commented Jun 24, 2016

Yes I am running kubernetes. Normally if I can tape kubectl get nodes this means that my kubernetes is well running. Isn't it?

For more information, this is the result of kubectl get pods:
kubectl get pods NAME READY STATUS RESTARTS AGE influxdb-grafana-xegq8 2/2 Running 0 1d k8s-etcd-127.0.0.1 1/1 Running 2 8d k8s-master-127.0.0.1 4/4 Running 0 8d k8s-proxy-127.0.0.1 1/1 Running 1 8d kafka-controller-hrrko 1/1 Running 94 16h memcached1-eo1ci 1/1 Running 0 1d memcached2-ol2mt 1/1 Running 0 1d mysql 1/1 Running 0 1d nginx-198147104-yq9p7 1/1 Running 0 8d reuters-import-data-j16r9 1/1 Running 0 43s seldon-control 1/1 Running 0 1d spark-master-controller-y5uoi 1/1 Running 0 16h spark-worker-controller-cqfgn 1/1 Running 166 16h spark-worker-controller-rcd98 1/1 Running 166 16h td-agent-server 1/1 Running 0 1d zookeeper-1 1/1 Running 0 1d zookeeper-2 1/1 Running 0 1d zookeeper-3 1/1 Running 0 1d

And after 1 minute it becomes:

NAME READY STATUS RESTARTS AGE influxdb-grafana-xegq8 2/2 Running 0 1d k8s-etcd-127.0.0.1 1/1 Running 2 8d k8s-master-127.0.0.1 4/4 Running 0 8d k8s-proxy-127.0.0.1 1/1 Running 1 8d kafka-controller-hrrko 1/1 Running 95 16h memcached1-eo1ci 1/1 Running 0 1d memcached2-ol2mt 1/1 Running 0 1d mysql 1/1 Running 0 1d nginx-198147104-yq9p7 1/1 Running 0 8d reuters-import-data-0pkmf 0/1 ContainerCreating 0 44s seldon-control 1/1 Running 0 1d spark-master-controller-y5uoi 1/1 Running 0 16h spark-worker-controller-cqfgn 0/1 CrashLoopBackOff 166 16h spark-worker-controller-rcd98 0/1 CrashLoopBackOff 166 16h td-agent-server 1/1 Running 0 1d zookeeper-1 1/1 Running 0 1d zookeeper-2 1/1 Running 0 1d zookeeper-3 1/1 Running 0 1d

@ukclivecox
Copy link
Contributor

yes, but if you run kubernetes locally via Docker you need to start an internal DNS handler.
Can you tell us how you installed Kubernetes - i.e. which of the ways described at http://kubernetes.io/docs/getting-started-guides/

If it was locally via Docker using http://kubernetes.io/docs/getting-started-guides/docker/ then you may need to setup DNS as described in http://kubernetes.io/docs/getting-started-guides/docker/#deploy-a-dns

@bghit
Copy link

bghit commented Jul 23, 2016

Hi,

I am running Kubernetes on top of Mesos. I've setup SkyDNS and the basic busybox test passes.
However, the spark-workers are not able to resolve spark-master:

=== Cannot resolve the DNS entry for spark-master. Has the service been created yet, and is SkyDNS functional?
=== See http://kubernetes.io/v1.1/docs/admin/dns.html for more details on DNS integration.
=== Sleeping 10s before pod exit.

Do you have suggestions about fixing this issue?

Thanks,
Bogdan

@ukclivecox
Copy link
Contributor

We've not tried running on Mesos yet.
Have you followed the DNS steps in http://kubernetes.io/docs/getting-started-guides/mesos/ ?

@bghit
Copy link

bghit commented Jul 23, 2016

Yes. I had an error in the SkyDNS config files. The workers connect to spark-master, but only workers that are co-located with the master get to run tasks. Remote workers seem to run CoarseGrainedExecutors, but they never execute tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants