Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNN library is not found issue #130

Open
nezarelkadyy opened this issue Jan 17, 2024 · 2 comments
Open

DNN library is not found issue #130

nezarelkadyy opened this issue Jan 17, 2024 · 2 comments

Comments

@nezarelkadyy
Copy link

nezarelkadyy commented Jan 17, 2024

I have an issue regarding running a training code using CASIA-WebFace Dataset where It always gives me an error as follows:

2024-01-17 15:41:15.073468: E tensorflow/stream_executor/cuda/cuda_dnn.cc:398] Possibly insufficient driver version: 460.106.0
2024-01-17 15:41:15.073521: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:1120 : UNIMPLEMENTED: DNN library is not found.
Traceback (most recent call last):
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/run_train.py", line 23, in <module>
    tt.train(sch, 0)
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 545, in train
    self.train_single_scheduler(**sch, initial_epoch=initial_epoch)
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 525, in train_single_scheduler
    self.__basic_train__(initial_epoch + epoch, initial_epoch=initial_epoch)
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 416, in __basic_train__
    self.model.fit(
  File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'model/0_conv/Conv2D' defined at (most recent call last):
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/run_train.py", line 23, in <module>
      tt.train(sch, 0)
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 545, in train
      self.train_single_scheduler(**sch, initial_epoch=initial_epoch)
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 525, in train_single_scheduler
      self.__basic_train__(initial_epoch + epoch, initial_epoch=initial_epoch)
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 416, in __basic_train__
      self.model.fit(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 915, in __call__
      result = self._call(*args, **kwds)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 980, in _call
      return self._stateless_fn(*args, **kwds)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2452, in __call__
      filtered_flat_args) = self._maybe_define_function(args, kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2711, in _maybe_define_function
      graph_function = self._create_graph_function(args, kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2627, in _create_graph_function
      func_graph_module.func_graph_from_py_func(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1141, in func_graph_from_py_func
      func_outputs = python_func(*func_args, **func_kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 677, in wrapped_fn
      out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1116, in autograph_handler
      return autograph.converted_call(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1312, in run
      return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2888, in call_for_each_replica
      return self._call_for_each_replica(fn, args, kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3689, in _call_for_each_replica
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 889, in train_step
      y_pred = self(x, training=True)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 490, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/functional.py", line 458, in call
      return self._run_internal_graph(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/functional.py", line 596, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py", line 250, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py", line 225, in convolution_op
      return tf.nn.convolution(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 1082, in op_dispatch_handler
      return dispatch_target(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 1150, in convolution_v2
      return convolution_internal(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 1282, in convolution_internal
      return op(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 2756, in _conv2d_expanded_batch
      return gen_nn_ops.conv2d(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 969, in conv2d
      _, _, _op, _outputs = _op_def_library._apply_op_helper(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper
      op = g._create_op_internal(op_type_name, inputs, dtypes=None,
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 694, in _create_op_internal
      return super(FuncGraph, self)._create_op_internal(  # pylint: disable=protected-access
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal
      ret = Operation(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2133, in __init__
      self._traceback = tf_stack.extract_stack_for_node(self._c_op)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/tf_stack.py", line 183, in extract_stack_for_node
      return _tf_stack.extract_stack_for_node(
Node: 'model/0_conv/Conv2D'
DNN library is not found.
	 [[{{node model/0_conv/Conv2D}}]] [Op:__inference_train_function_27816]

============================================================================================
Noting that I have installed cuda11.2, cudnn 8.1, tensorflow 2.9.1, and tensorflow_addons 0.17.0 and the code used for training is as follows:

import tensorflow_addons as tfa
import train, losses, models
import os

data_basic_path = '/home/nezar/Data/v2/faces_webface_112x112'
data_path = data_basic_path + '_112x112_folders'
eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

""" First, Train with `lossTopK = 3` """
basic_model = models.buildin_models("r34", dropout=0, emb_shape=256, output_layer='E')
tt = train.Train(data_path, save_path='TT_resnet34_topk_bs256.h5', eval_paths=eval_paths,
                 basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.1, lr_decay_steps=[20, 30],
                 batch_size=16, random_status=0,
                 # output_wd_multiply=1
                 )

optimizer = tfa.optimizers.SGDW(learning_rate=0.1, weight_decay=5e-4, momentum=0.9)
sch = [
    {"loss": losses.ArcfaceLoss(scale=16), "epoch": 5, "optimizer": optimizer, "lossTopK": 3},
    {"loss": losses.ArcfaceLoss(scale=32), "epoch": 5, "lossTopK": 3},
    {"loss": losses.ArcfaceLoss(scale=64), "epoch": 40, "lossTopK": 3},
]
tt.train(sch, 0)

What could be the potential problem here?

@nezarelkadyy
Copy link
Author

nezarelkadyy commented Jan 17, 2024

The dataset seems to be loaded successfully as well as the model itself as shown in the prints below that is generated by your code but it gives me the error I mentioned earlier in my question after these prints:

>>>> L2 regularizer value from basic_model: 0
>>>> Init type by loss function name...
>>>> Train arcface...
>>>> Init softmax dataset...
>>>> reloaded from dataset backup: faces_webface_112x112_112x112_folders_shuffle.npz
>>>> Loaded data image_names: 490623 image_classes: 490623 embeddings: 0 classes: 10572
>>>> Image length: 490623, Image class length: 490623, classes: 10572
>>>> Use specified optimizer: <tensorflow_addons.optimizers.weight_decay_optimizers.SGDW object at 0x7fb1f39c4ca0>
>>>> Append weight decay callback...
>>>> Add arcface layer, arc_kwargs={'loss_top_k': 3, 'append_norm': False, 'partial_fc_split': 0, 'name': 'arcface'}, vpl_kwargs={'vpl_lambda': 0.15, 'start_iters': -30663, 'allowed_delta': 200}...
>>>> loss_weights: {'arcface': 1}

Learning rate for iter 1 is 0.1
Weight decay is 0.0005000000237487257
Epoch 1/5

@leondgarse
Copy link
Owner

leondgarse commented Jan 18, 2024

  • It's hard to reproduce. Most time it's caused by TF / cuda / cudnn version mismatch. May try if updating them all to newest works, like the colab version the last Test part.
  • May also just try if a basic training works:
    import tensorflow as tf
    from tensorflow import keras
    mm = keras.applications.ResNet50(input_shape=(112, 112, 3), classes=10, weights=None)
    xx, yy = tf.random.uniform([1000, 112, 112, 3]), tf.one_hot(tf.random.uniform([1000], 1, 10, dtype='int32'), 10)
    mm.compile(loss=keras.losses.categorical_crossentropy)
    mm.fit(xx, yy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants