Issues while using Horovod (spark) for distributed training #2810

Aishwarya2703 · 2021-04-01T09:30:26Z

Aishwarya2703
Apr 1, 2021

I was trying to implement distributed training framework with Horovod (spark , keras ) on MNIST dataset. But facing an issue while fitting the model on training data.

The code is failing at this point of the code -
keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob'])

And giving following error -
IndexError: list index out of range

RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 0
Exit code: 127

I have also set following flags:
os.environ.pop('TF_CONFIG', None)
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

Could you help on this ?

tgaddair · 2021-04-12T15:23:37Z

tgaddair
Apr 12, 2021
Maintainer

Hey @Aishwarya2703, is there a larger stack trace you can share that describes which index is out of range?

0 replies

srigv · 2021-05-05T00:40:32Z

srigv
May 5, 2021

@tgaddair I am also running into a similar issue, where the error message is not clear enough. Following is what I see.

An error was encountered:
Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 1
Exit code: 1

Traceback (most recent call last):
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 35, in fit
    return super(HorovodEstimator, self).fit(df, params)
  File "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1620096900894_0023/container_e64_1620096900894_0023_01_000001/pyspark.zip/pyspark/ml/base.py", line 132, in fit
    return self._fit(dataset)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/common/estimator.py", line 80, in _fit
    backend, train_rows, val_rows, metadata, avg_row_size, dataset_idx)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/keras/estimator.py", line 302, in _fit_on_prepared_data
    env=env)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/common/backend.py", line 85, in run
    **self._kwargs)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/runner.py", line 299, in run
    _launch_job(use_mpi, use_gloo, settings, driver, env, stdout, stderr)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/runner.py", line 170, in _launch_job
    settings.verbose)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/runner/launch.py", line 704, in run_controller
    gloo_run()
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/runner.py", line 167, in <lambda>
    run_controller(use_gloo, lambda: gloo_run(settings, nics, driver, env, stdout, stderr),
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/spark/gloo_run.py", line 67, in gloo_run
    launch_gloo(command, exec_command, settings, nics, {}, server_ip)
  File "/opt/conda/envs/aa_sv/lib/python3.7/site-packages/horovod/runner/gloo_run.py", line 271, in launch_gloo
    .format(name=name, code=exit_code))
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 1
Exit code: 1

I started with the MNIST example but couldn't tell if it's failing due to the volume of data, so I tried getting it to work with a toy dataset, which also runs into the same issue. I couldn't figure out the location where the logs are being written to(I checked Spark driver and executor logs but didn't see any horovod specific logs). I have also tried redirecting the logs to a file using file handler for the logging module by adding a piece of code to runner.py but that didn't work either. I can prepare a jupyter notebook that can reproduce the issue if that helps.

0 replies

MsAlEhR · 2021-09-07T07:23:38Z

MsAlEhR
Sep 7, 2021

Hi, @tgaddair I have exactly the same issue in running keras_spark_mnist.py on a spark cluster .
Following is what I see.

writing dataframes train_data_path=file:///home/spark/sample_code/horovod_keras_spark_mnist/intermediate_train_data.0 val_data_path=file:///home/spark/sample_code/horovod_keras_spark_mnist/intermediate_val_data.0 train_partitions=80 Traceback (most recent call last): File "keras_spark_mnist.py", line 125, in <module> keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob']) File "/home/spark/anaconda3/lib/python3.8/site-packages/horovod/spark/common/estimator.py", line 35, in fit return super(HorovodEstimator, self).fit(df, params) File "/home/spark/anaconda3/lib/python3.8/site-packages/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/home/spark/anaconda3/lib/python3.8/site-packages/horovod/spark/common/estimator.py", line 68, in _fit with util.prepare_data(backend.num_processes(), File "/home/spark/anaconda3/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen) File "/home/spark/anaconda3/lib/python3.8/site-packages/horovod/spark/common/util.py", line 668, in prepare_data dataset_idx = _get_or_create_dataset(key, store, df, feature_columns, label_columns, File "/home/spark/anaconda3/lib/python3.8/site-packages/horovod/spark/common/util.py", line 605, in _get_or_create_dataset train_rows, val_rows, pq_metadata, avg_row_size = get_simple_meta_from_parquet( File "/home/spark/anaconda3/lib/python3.8/site-packages/horovod/spark/common/util.py", line 490, in get_simple_meta_from_parquet train_data = store.get_parquet_dataset(train_data_path) File "/home/spark/anaconda3/lib/python3.8/site-packages/horovod/spark/common/store.py", line 213, in get_parquet_dataset return pq.ParquetDataset(self.get_localized_path(path), filesystem=self.fs) File "/home/spark/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", line 1364, in __init__ self.validate_schemas() File "/home/spark/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", line 1395, in validate_schemas self.schema = self._pieces[0].get_metadata().schema IndexError: list index out of range

0 replies

williambarteck · 2021-09-30T16:56:05Z

williambarteck
Sep 30, 2021

I had similar issues on smaller datasets. Something that helped me was to repartition the input data before fitting

train_df = train_df.repartition(20)
or maybe trying a multiple of your spark worker count

0 replies

MsAlEhR · 2021-10-02T11:17:11Z

MsAlEhR
Oct 2, 2021

Thank you for your response @williambarteck
It is really strange that I have a test cluster with a 14G memory!
According to your suggestion, I try it but It doesn't work in my case.
I also try to decrease the size of the data frame to 10 rows but the error still remains.

`

df = spark.read.format('libsvm') .option('numFeatures', '784') .load(libsvm_path)

df = df.head(10)

df=spark.createDataFrame(df)`

3 replies

williambarteck Oct 4, 2021

I think the issue is actually the fact that your data set is already so small. Can you instead try to increase the size of the dataset? Maybe just duplicating rows several times. Try with something like 10,000 records

MsAlEhR Oct 4, 2021

I test also with 10,000 records, but the error still remains.

I have some questions that would help to run this example?
1- how do you submit your job ( i have these options to submit my job to cluster and each of them has different errors):

spark-submit keras_spark_mnist.py --master=spark://master:7077 --num-proc=4 --work dir=/home/horovod_keras_spark_mnist --data-dir=/home/horovod_keras_spark_mnist

OR

/usr/lib64/openmpi/bin/mpirun -np 8 -H slave1:4,slave2:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib spark-submit keras_spark_mnist.py --master=spark://master:7077 --num-proc=4 --work-dir=/home/horovod_keras_spark_mnist --data-dir=/home/horovod_keras_spark_mnist/t

MsAlEhR Oct 4, 2021

why when I submit a job on spark-cluster with "mpirun -np 8 -H slave1:4,slave2:4 ......" command, I will see more than one job ( almost 8 jobs) run on my spark cluster?

prem99 · 2021-10-19T19:55:58Z

prem99
Oct 19, 2021

Hi, I am trying the horovod (spark) example on pytorch using AWS EMR. I am following this notebook (more or less copy and pasting): https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/deep-learning/horovod-spark-estimator-pytorch.html.

I am getting the same error when fitting: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. Has anyone found a solution or have ideas on what I can try to get this to work on AWS EMR?

1 reply

MsAlEhR Oct 20, 2021

I found out that this error related to parquet files which are created in the wrong directories on my workers.
in master, parquets are created in the "intermediate_train" folder but on workers, they are created in inner directories "intermediate_train/_temorary/0/...".
when I move parquet files to the main folder and use the fit on parquet function. this error is solved.
But I don't find a general solution!

MsAlEhR · 2021-11-02T11:28:05Z

MsAlEhR
Nov 2, 2021

I solve this problem by using HDFS for my spark cluster and setting data_dir to HDFS cluster.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues while using Horovod (spark) for distributed training #2810

{{title}}

Replies: 7 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issues while using Horovod (spark) for distributed training #2810

Replies: 7 comments · 4 replies

tgaddair Apr 12, 2021 Maintainer

Replies: 7 comments 4 replies

tgaddair
Apr 12, 2021
Maintainer