`Caught IndexError in DataLoader worker process 0` using `pip` installations #22

sgbaird · 2022-06-10T05:21:08Z

Setup

Running on Windows Subsystem for Linux 2 (WSL2).

git clone https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models.git
cd Palette-Image-to-Image-Diffusion-Models
conda create -n pip-palette python==3.9.*
conda activate pip-palette
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt

Config

Same as #21

Directory Structure

Same as #21

Terminal

(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$  cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 36177 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug 
export CUDA_VISIBLE_DEVICES=0
/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True
  warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True')
(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$  cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/pip-palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 41379 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug 
export CUDA_VISIBLE_DEVICES=0
/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True
  warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True')
  0%|                                                     | 0/16 [00:00<?, ?it/s]
Close the Tensorboard SummaryWriter.

Error

Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataset.py", line 471, in __getitem__
    return self.dataset[self.indices[idx]]
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/data/dataset.py", line 54, in __getitem__
    path = self.imgs[index]
IndexError: list index out of range
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 106, in train_step
    for train_data in tqdm.tqdm(self.phase_loader):
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
    train_log = self.train_step()
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
    model.train()
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 92, in <module>
    main_worker(0, 1, opt)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,

sgbaird · 2022-06-10T05:30:26Z

https://stackoverflow.com/a/62550189/13697228 mentions data length needing to be divisible by batch_size. Changed batch_size to 1 everywhere and same issue.

Here's the log:

22-06-09 23:28:39.190 - INFO: Create the log file in directory experiments/debug_inpainting_celebahq_220609_232838.

22-06-09 23:28:39.259 - INFO: Dataset [InpaintDataset() form data.dataset] is created.
22-06-09 23:28:39.260 - INFO: Dataset for train have 48 samples.
22-06-09 23:28:39.260 - INFO: Dataset for val have 2 samples.
22-06-09 23:28:39.780 - INFO: Network [Network() form models.network] is created.
22-06-09 23:28:39.781 - INFO: Network [Network] weights initialize using [kaiming] method.
22-06-09 23:28:40.080 - WARNING: Config is a str, converts to a dict {'name': 'mae'}
22-06-09 23:28:40.459 - INFO: Metric [mae() form models.metric] is created.
22-06-09 23:28:40.459 - WARNING: Config is a str, converts to a dict {'name': 'mse_loss'}
22-06-09 23:28:40.468 - INFO: Loss [mse_loss() form models.loss] is created.
22-06-09 23:28:45.991 - INFO: Beign loading pretrained model [Network] ...
22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network.pth] is not existed, Skip it
22-06-09 23:28:45.992 - INFO: Beign loading pretrained model [Network_ema] ...
22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network_ema.pth] is not existed, Skip it
22-06-09 23:28:46.007 - INFO: Beign loading training states
22-06-09 23:28:46.007 - WARNING: Training state in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190.state] is not existed, Skip it
22-06-09 23:28:46.018 - INFO: Model [Palette() form models.model] is created.
22-06-09 23:28:46.019 - INFO: Begin model train.

Janspiry · 2022-06-21T06:57:52Z

Feel free to reopen the issue if there is any question

sgbaird · 2022-06-21T14:07:09Z

@Janspiry if you close the issue, the person that originally opened it can't reopen the issue.

How do you suggest I fix the error, Caught IndexError in DataLoader worker process 0. so that I can actually run the code in this repository? My colleague @hasan-sayeed and I haven't been able to get Palette running at all, despite spending many hours debugging one issue after another.

Janspiry · 2022-06-21T14:20:26Z

Sorry for the error, I thought you guys had fixed it.
Since the message says Caught IndexError, I suspect that the self.image the dataseat are reading may be incorrect.
You can try printing this variable. Also can you show me the file directory and the contents of train.flist

sgbaird · 2022-06-21T14:32:29Z

@Janspiry thanks for the response. Will take another look and post back.

ani0075saha · 2022-07-20T14:19:02Z

Hi @Janspiry @sgbaird.
I am facing a similar issue when running the test script. Maybe they are related because of the way in which data indexing is implemented.

92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████             | 12/13 [1:00:07<05:00, 30
0.59s/it]                                                                                                                                                                                              
Close the Tensorboard SummaryWriter.                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                     
  File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/run.py", line 92, in <module>                                                                                                       
    main_worker(0, 1, opt)                                                                                                                                                                             
  File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/run.py", line 60, in main_worker                                                                                                    
    model.test()                                                                                                                                                                                       
  File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 190, in test                                                                                                 
    self.writer.save_images(self.save_current_results())                                                                                                                                               
  File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 87, in save_current_results                                                                                  
    ret_path.append('GT_{}'.format(self.path[idx]))                                                                                                                                                    
IndexError: list index out of range

I am running test on 100 images with batch size of 8. As you can see from the logs, there are 13 batches (12 batches with 8 images and the last batch with 4 images). The run fails only on the last batch. The reason is that the line here looks for 8 images (batch size) in the last batch even though there are only 4.

https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models/blob/main/models/model.py#L86

The test script runs fine when I use a multiple of 8 images.
Could you let me know the easiest fix to this? Thanks.

ani0075saha · 2022-07-21T15:37:39Z

I was able to solve the problem by getting the number of images in the batch explicitly.

temp_batch_size = len(self.path)
for idx in range(temp_batch_size):
    ret_path.append('GT_{}'.format(self.path[idx]))
    ret_result.append(self.gt_image[idx].detach().float().cpu())

    ret_path.append('Process_{}'.format(self.path[idx]))
    ret_result.append(self.visuals[idx::temp_batch_size].detach().float().cpu())
    
    ret_path.append('Out_{}'.format(self.path[idx]))
    ret_result.append(self.visuals[idx-temp_batch_size].detach().float().cpu())

Janspiry · 2022-07-22T09:59:49Z

@ani0075, thanks for suggesting this. I will fix it asap.

1808030112 · 2023-12-27T02:43:57Z

Setup

Running on Windows Subsystem for Linux 2 (WSL2).

git clone https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models.git
cd Palette-Image-to-Image-Diffusion-Models
conda create -n pip-palette python==3.9.*
conda activate pip-palette
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt

Config

Same as #21

Directory Structure

Same as #21

Terminal

(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$  cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 36177 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug 
export CUDA_VISIBLE_DEVICES=0
/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True
  warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True')
(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$  cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/pip-palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 41379 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug 
export CUDA_VISIBLE_DEVICES=0
/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True
  warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True')
  0%|                                                     | 0/16 [00:00<?, ?it/s]
Close the Tensorboard SummaryWriter.

Error

Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataset.py", line 471, in __getitem__
    return self.dataset[self.indices[idx]]
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/data/dataset.py", line 54, in __getitem__
    path = self.imgs[index]
IndexError: list index out of range
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 106, in train_step
    for train_data in tqdm.tqdm(self.phase_loader):
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
    train_log = self.train_step()
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
    model.train()
  File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 92, in <module>
    main_worker(0, 1, opt)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,

https://stackoverflow.com/a/62550189/13697228 mentions data length needing to be divisible by batch_size. Changed batch_size to 1 everywhere and same issue.

Here's the log:

22-06-09 23:28:39.190 - INFO: Create the log file in directory experiments/debug_inpainting_celebahq_220609_232838.

22-06-09 23:28:39.259 - INFO: Dataset [InpaintDataset() form data.dataset] is created.
22-06-09 23:28:39.260 - INFO: Dataset for train have 48 samples.
22-06-09 23:28:39.260 - INFO: Dataset for val have 2 samples.
22-06-09 23:28:39.780 - INFO: Network [Network() form models.network] is created.
22-06-09 23:28:39.781 - INFO: Network [Network] weights initialize using [kaiming] method.
22-06-09 23:28:40.080 - WARNING: Config is a str, converts to a dict {'name': 'mae'}
22-06-09 23:28:40.459 - INFO: Metric [mae() form models.metric] is created.
22-06-09 23:28:40.459 - WARNING: Config is a str, converts to a dict {'name': 'mse_loss'}
22-06-09 23:28:40.468 - INFO: Loss [mse_loss() form models.loss] is created.
22-06-09 23:28:45.991 - INFO: Beign loading pretrained model [Network] ...
22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network.pth] is not existed, Skip it
22-06-09 23:28:45.992 - INFO: Beign loading pretrained model [Network_ema] ...
22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network_ema.pth] is not existed, Skip it
22-06-09 23:28:46.007 - INFO: Beign loading training states
22-06-09 23:28:46.007 - WARNING: Training state in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190.state] is not existed, Skip it
22-06-09 23:28:46.018 - INFO: Model [Palette() form models.model] is created.
22-06-09 23:28:46.019 - INFO: Begin model train.

Sorry to bother you, did you reproduce this code in the end

sgbaird changed the title ~~Caught IndexError in DataLoader worker process 0~~ Caught IndexError in DataLoader worker process 0 using pip installations Jun 10, 2022

sgbaird mentioned this issue Jun 10, 2022

Dockerfile example #18

Closed

Janspiry closed this as completed Jun 21, 2022

Janspiry reopened this Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Caught IndexError in DataLoader worker process 0` using `pip` installations #22

`Caught IndexError in DataLoader worker process 0` using `pip` installations #22

sgbaird commented Jun 10, 2022

sgbaird commented Jun 10, 2022

Janspiry commented Jun 21, 2022

sgbaird commented Jun 21, 2022

Janspiry commented Jun 21, 2022

sgbaird commented Jun 21, 2022

ani0075saha commented Jul 20, 2022

ani0075saha commented Jul 21, 2022

Janspiry commented Jul 22, 2022

1808030112 commented Dec 27, 2023

Setup

Config

Directory Structure

Terminal

Error

Caught IndexError in DataLoader worker process 0 using pip installations #22

Caught IndexError in DataLoader worker process 0 using pip installations #22

Comments

sgbaird commented Jun 10, 2022

Setup

Config

Directory Structure

Terminal

Error

sgbaird commented Jun 10, 2022

Janspiry commented Jun 21, 2022

sgbaird commented Jun 21, 2022

Janspiry commented Jun 21, 2022

sgbaird commented Jun 21, 2022

ani0075saha commented Jul 20, 2022

ani0075saha commented Jul 21, 2022

Janspiry commented Jul 22, 2022

1808030112 commented Dec 27, 2023

Setup

Config

Directory Structure

Terminal

Error

`Caught IndexError in DataLoader worker process 0` using `pip` installations #22

`Caught IndexError in DataLoader worker process 0` using `pip` installations #22