New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray retraining fails with StopIteration exception when retraining a model with small datasets #3991
Comments
I dug into this further. When training is resumed from a non-zero epoch, RayDatasetBatcher (ludwig/data/dataset/ray.py) calls self._fetch_next_epoch() twice, once during the class init method and again in the set_epoch() method, without consuming any batches in between.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When resuming a model train (retraining) with Ray, using a small dataset the following exception occurs -
The full exception is attached:
exception_stack_trace.txt
To Reproduce
Steps to reproduce the behavior:
2 run the attached first_run.py in that folder (it uses the config.yaml file from examples/mnist)
first_run.py.txt
second_run.py.txt
you should see the error when running second_run.py
Expected behavior
The second run should succeed training.
Environment (please complete the following information):
ray==2.3.1
dask==2023.3.2
torch==2.1.2
Additional context
The text was updated successfully, but these errors were encountered: