Ray retraining fails with StopIteration exception when retraining a model with small datasets #3991

vijayi1 · 2024-04-08T23:41:28Z

Describe the bug

When resuming a model train (retraining) with Ray, using a small dataset the following exception occurs -

    2024-04-08 13:13:36,849	WARNING worker.py:1866 -- Traceback (most recent call last):
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 226, in iter_batches
        blocks_owned_by_consumer = self._peek()._plan.execute()._owned_by_consumer
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 1319, in _peek
        first_dataset_gen = next(dataset_iter)
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 732, in __next__
        raise StopIteration
    StopIteration
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor
        return method(__ray_actor, *args, **kwargs)
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
        return method(self, *_args, **_kwargs)
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
        raise skipped from exception_cause(skipped)
      File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
        train_func(*args, **kwargs)
      File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 501, in <lambda>
        lambda config: train_fn(**config),
      File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 215, in train_fn
        results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs)
      File "/data/vijayi/ludwig/ludwig/distributed/base.py", line 157, in wrapped
        res = fn(*args, **kwargs)
      File "/data/vijayi/ludwig/ludwig/trainers/trainer.py", line 1038, in train
        batcher.set_epoch(progress_tracker.epoch, progress_tracker.batch_size)
      File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 355, in set_epoch
        self._fetch_next_epoch()
      File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 380, in _fetch_next_epoch
        self._fetch_next_batch()
      File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 389, in _fetch_next_batch
        self._next_batch = next(self.dataset_batch_iter)
      File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 469, in async_read
        raise batch
      File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 454, in producer
        for batch in pipeline.iter_batches(prefetch_blocks=0, batch_size=batch_size, batch_format="pandas"):
    RuntimeError: generator raised StopIteration

The full exception is attached:
exception_stack_trace.txt

To Reproduce
Steps to reproduce the behavior:

clone the ludwig repo, then cd to the examples/mnist/ folder.

2 run the attached first_run.py in that folder (it uses the config.yaml file from examples/mnist)
first_run.py.txt

retrain the model by running the attached second_run.py (in the same folder)
second_run.py.txt

you should see the error when running second_run.py

Expected behavior
The second run should succeed training.

Environment (please complete the following information):

OS: Redhat 8.6
Python version: Python 3.8
Ludwig version: 0.10.2
other versions:
ray==2.3.1
dask==2023.3.2
torch==2.1.2

Additional context

the problem only happens with backend ray. it does not happen with backend local.
sometimes the second_run.py may pass, in that case please rerun the first_run.py followed by second_run.py a couple of times.
increasing the dataset size reduces the probability of the exception. the attached files limit the dataset to 20 rows. the problem occurs at 50 and 100 rows as well, but less frequently. At 1000 rows it almost never occurs.

The text was updated successfully, but these errors were encountered:

vijayi1 · 2024-04-10T23:13:34Z

I dug into this further. When training is resumed from a non-zero epoch, RayDatasetBatcher (ludwig/data/dataset/ray.py) calls self._fetch_next_epoch() twice, once during the class init method and again in the set_epoch() method, without consuming any batches in between.
The following patch fixes this problem, but I'm not sure whether it's the right fix -

            diff --git a/ludwig/data/dataset/ray.py b/ludwig/data/dataset/ray.py
            index 5ad083fa..ba53ad33 100644
            --- a/ludwig/data/dataset/ray.py
            +++ b/ludwig/data/dataset/ray.py
            @@ -352,7 +352,8 @@ class RayDatasetBatcher(Batcher):
                 def set_epoch(self, epoch, batch_size):
                     self.batch_size = batch_size
                     if epoch != self._epoch:
            -            self._fetch_next_epoch()
            +            if self._step or self._last_batch:
            +                self._fetch_next_epoch()
                         self._epoch = epoch
             
                 @property

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray retraining fails with StopIteration exception when retraining a model with small datasets #3991

Ray retraining fails with StopIteration exception when retraining a model with small datasets #3991

vijayi1 commented Apr 8, 2024

vijayi1 commented Apr 10, 2024

Ray retraining fails with StopIteration exception when retraining a model with small datasets #3991

Ray retraining fails with StopIteration exception when retraining a model with small datasets #3991

Comments

vijayi1 commented Apr 8, 2024

vijayi1 commented Apr 10, 2024