Issues while using Horovod (spark) for distributed training #2810
Replies: 7 comments 4 replies
-
Hey @Aishwarya2703, is there a larger stack trace you can share that describes which index is out of range? |
Beta Was this translation helpful? Give feedback.
-
@tgaddair I am also running into a similar issue, where the error message is not clear enough. Following is what I see.
I started with the MNIST example but couldn't tell if it's failing due to the volume of data, so I tried getting it to work with a toy dataset, which also runs into the same issue. I couldn't figure out the location where the logs are being written to(I checked Spark driver and executor logs but didn't see any horovod specific logs). I have also tried redirecting the logs to a file using file handler for the logging module by adding a piece of code to runner.py but that didn't work either. I can prepare a jupyter notebook that can reproduce the issue if that helps. |
Beta Was this translation helpful? Give feedback.
-
Hi, @tgaddair I have exactly the same issue in running keras_spark_mnist.py on a spark cluster .
|
Beta Was this translation helpful? Give feedback.
-
I had similar issues on smaller datasets. Something that helped me was to repartition the input data before fitting
|
Beta Was this translation helpful? Give feedback.
-
Thank you for your response @williambarteck `
|
Beta Was this translation helpful? Give feedback.
-
Hi, I am trying the horovod (spark) example on pytorch using AWS EMR. I am following this notebook (more or less copy and pasting): https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/deep-learning/horovod-spark-estimator-pytorch.html. I am getting the same error when fitting: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. Has anyone found a solution or have ideas on what I can try to get this to work on AWS EMR? |
Beta Was this translation helpful? Give feedback.
-
I solve this problem by using HDFS for my spark cluster and setting data_dir to HDFS cluster. |
Beta Was this translation helpful? Give feedback.
-
I was trying to implement distributed training framework with Horovod (spark , keras ) on MNIST dataset. But facing an issue while fitting the model on training data.
The code is failing at this point of the code -
keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob'])
And giving following error -
IndexError: list index out of range
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 0
Exit code: 127
I have also set following flags:
os.environ.pop('TF_CONFIG', None)
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
Could you help on this ?
Beta Was this translation helpful? Give feedback.
All reactions