Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not run in Pytorch 1.3.1 #14

Open
michaelklachko opened this issue Dec 4, 2019 · 0 comments
Open

Does not run in Pytorch 1.3.1 #14

michaelklachko opened this issue Dec 4, 2019 · 0 comments

Comments

@michaelklachko
Copy link

I just cloned your repo and when I'm launching the command:

CUDA_VISIBLE_DEVICES=2,3,4,5 python imagenet.py -a mobilenetv2 -d /path/to/dataset/ImageNet2012/ --epochs 150 --lr-decay cos --lr 0.05 --wd 4e-5 -c checkpoints --width-mult 1 --input-size 224 -j 12

It gets stuck at this point:

=> creating model 'mobilenetv2'

Epoch: [1 | 150]
Processing

<Ctrl+C pressed after 10 min of nothing happening:>

^CTraceback (most recent call last):
  File "imagenet.py", line 403, in <module>
    main()
  File "imagenet.py", line 224, in main
    train_loss, train_acc = train(train_loader, train_loader_len, model, criterion, optimizer, epoch)
  File "imagenet.py", line 271, in train
    for i, (input, target) in enumerate(train_loader):
  File "/home/michael/mobilenetv2.pytorch/utils/dataloaders.py", line 190, in prefetched_loader
    for next_input, next_target in loader:
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
    idx, data = self._get_data()
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _get_data
    success, data = self._try_get_data()
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Nothing is happening at this point. nvidia-smi shows that a single GPU consumes ~500M of memory, and CPU cores are ~60% busy, but it's not clear what are they doing. I waited for 10 minutes before aborting. I also tried it on a single GPU - same issue.
If I switch to --data-backend dali-cpu (using nvidia-dali version 0.16) it fails with the following error:

=> creating model 'mobilenetv2' Traceback (most recent call last): File "imagenet.py", line 403, in <module> main() File "imagenet.py", line 194, in main train_loader, train_loader_len = get_train_loader(args.data, args.batch_size, workers=args.workers, input_size=args.input_size) TypeError: gdtl() got an unexpected keyword argument 'input_size'

I'm using Pytorch 1.3.1 with 4x Titan Xp cards. The only thing I had to change in your code is to replace cuda(async=True) with cuda(non_blocking=True). Changing tonon_blocking=False does not help.

Can you please try cloning your repo to a clean Pytorch 1.3.1 environment and see if you can run it? Any idea what's going on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant