Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on training #23

Open
JohnBakery opened this issue Feb 19, 2019 · 10 comments
Open

Crashes on training #23

JohnBakery opened this issue Feb 19, 2019 · 10 comments

Comments

@JohnBakery
Copy link

When I run

G:\Users\user\Desktop\cnn>C:\Users\user\AppData\Local\Programs\Python\Python36\python.exe watermarks.py --logdir=save/

The trainer crashes after exactly 17000 TFRecords with the following message

Traceback (most recent call last):
  File "watermarks.py", line 295, in <module>
    tf.app.run()
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "watermarks.py", line 290, in main
    train(sess, globals()[FLAGS.dataset])
  File "watermarks.py", line 188, in train
    min_opacity, max_opacity)
  File "G:\Users\user\Desktop\cnn\dataset.py", line 16, in batch_masks
    for _ in range(FLAGS.batch_size)], 0)
  File "G:\Users\user\Desktop\cnn\dataset.py", line 16, in <listcomp>
    for _ in range(FLAGS.batch_size)], 0)
  File "G:\Users\user\Desktop\cnn\dataset.py", line 39, in create_mask
    mask, tf.random_uniform([], -max_angle, max_angle, tf.float32))  # Costly
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\image\python\ops\image_ops.py", line 75, in rotate
    interpolation=interpolation)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\image\python\ops\image_ops.py", line 170, in transform
    images, transforms, interpolation=interpolation.upper())
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\image\ops\gen_image_ops.py", line 94, in image_projective_transform
    interpolation=interpolation, name=name)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2632, in create_op
    set_shapes_for_outputs(ret)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1911, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 595, in call_cpp_shape_fn
    require_shape_fn)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 654, in _call_cpp_shape_fn_impl
    input_tensors_as_shapes, status)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\contextlib.py", line 88, in __exit__
    next(self.gen)
  File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'ImageProjectiveTransform' in binary running on USER-PC. Make sure the Op and Kernel are registered in the binary running in this process.

Am I doing something wrong?

@marcbelmont
Copy link
Owner

Maybe it's an issue with the version of the packages. Can you do pip freeze?

@JohnBakery
Copy link
Author

bleach==1.5.0
colorama==0.4.1
cycler==0.10.0
decorator==4.3.2
html5lib==0.9999999
ipython==6.0.0
ipython-genutils==0.2.0
jedi==0.13.2
Markdown==3.0.1
matplotlib==2.0.0
numpy==1.12.1
olefile==0.46
parso==0.3.4
pickleshare==0.7.5
Pillow==4.1.0
prompt-toolkit==1.0.15
protobuf==3.6.1
Pygments==2.3.1
pyparsing==2.3.1
python-dateutil==2.8.0
pytz==2018.9
simplegeneric==0.8.1
six==1.12.0
tensorflow==1.3.0
tensorflow-tensorboard==0.1.5
traitlets==4.3.2
wcwidth==0.1.7
Werkzeug==0.14.1

@marcbelmont
Copy link
Owner

Thanks. I don't see anything wrong. Is it consistently crashing after 17000 training steps?

@JohnBakery
Copy link
Author

Yes, exactly at 17000, every single time. If I restart without deleting the tfrecords files, it will crash right away. If I delete them, it will run until 17000. Looking at the files voc-17000.tfrecords is 65MB, while all others are ~105MB. Not sure if that matters.

@marcbelmont
Copy link
Owner

This one is smaller because it is the last one (it contains less images). You can try removing it.

@JohnBakery
Copy link
Author

Deleted the voc-17000, restarted learning and it crashes right away with the same message

@marcbelmont
Copy link
Owner

It looks like a Windows specific issue. tensorflow/tensorflow#9672
Try using tensorflow==1.4.0 instead

@JohnBakery
Copy link
Author

JohnBakery commented Feb 19, 2019

I updated to 1.4.0 and it solved the crashing issue. However, it gets stuck after saying Shuffle buffer filled.

WARNING:tensorflow:From G:\Users\user\Desktop\cnn\dataset.py:110: TFRecordDataset.__init__ (from tensorflow.contrib.data.python.ops.readers) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.TFRecordDataset`.
2019-02-19 16:30:40.833944: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 159 of 10000
2019-02-19 16:30:50.826511: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 320 of 10000
2019-02-19 16:30:57.231075: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.

I thought this could've been due to tenserflow-tensotboard being incompatible, since when I switched to 1.4.0 I got the following message

tensorflow 1.4.0 has requirement tensorflow-tensorboard<0.5.0,>=0.4.0rc1, but you'll have tensorflow-tensorboard 0.1.5 which is incompatible.

so I updated to 0.4.0rc1, but it still hangs at Shuffle buffer filled.

So I let it run and apparently it is doing something

2019-02-19 16:51:05.815504: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 147 of 10000
2019-02-19 16:51:15.779815: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 248 of 10000
2019-02-19 16:51:25.787975: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 379 of 10000
2019-02-19 16:51:29.268296: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.
2019-02-19 18:37:48.078317: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 153 of 10000
2019-02-19 18:37:58.045792: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 307 of 10000
2019-02-19 18:38:05.494099: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.
2019-02-19 20:24:20.950043: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 157 of 10000
2019-02-19 20:24:30.912401: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 317 of 10000
2019-02-19 20:24:38.561894: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.
2019-02-19 22:10:38.268626: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 155 of 10000
2019-02-19 22:10:48.274232: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 314 of 10000
2019-02-19 22:10:55.083987: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\kernels\shuffle_dataset_op.cc:121] Shuffle buffer filled.

For how long should I let it run?

@toyssamurai
Copy link

I want to chime in, too. I also got to "Shuffle buffer filled" and the last one I got is about an hour ago. Did your finish somehow at the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@marcbelmont @toyssamurai @JohnBakery and others