Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drml_model.save('my_model') cannot save model after training #52

Open
AmosLewis opened this issue Jul 8, 2020 · 0 comments
Open

drml_model.save('my_model') cannot save model after training #52

AmosLewis opened this issue Jul 8, 2020 · 0 comments

Comments

@AmosLewis
Copy link

I add one line code to save the model at the end of tf2_examples/dlrm_criteo.py, but the save function does not work. The output is attached. It looks like it required some information in the dataset inside the tensorflow function call. Any idea to fix the bug? I tried TensorFlow gpu version 2.0, 2.1, 2.2, same bug output.
dlrm_model.save("/home/chi/test/my_model", save_format="tf")

Bug output:
~/test/openrec/tf2_examples(master*) » python dlrm_criteo.py
2020-07-07 17:53:00.942611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-07 17:53:00.966263: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:00.966536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:01:00.0
2020-07-07 17:53:00.967845: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-07 17:53:00.983018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-07 17:53:00.991269: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-07 17:53:00.994296: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-07 17:53:01.107993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-07 17:53:01.216752: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-07 17:53:01.223077: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-07 17:53:01.223144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.223415: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.223640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-07 17:53:01.228401: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-07 17:53:01.372497: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2899885000 Hz
2020-07-07 17:53:01.376047: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4e7eca0 executing computations on platform Host. Devices:
2020-07-07 17:53:01.376115: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2020-07-07 17:53:01.485098: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.486519: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x54a7720 executing computations on platform CUDA. Devices:
2020-07-07 17:53:01.486588: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2020-07-07 17:53:01.490833: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.492068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:01:00.0
2020-07-07 17:53:01.492146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-07 17:53:01.492187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-07 17:53:01.492221: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-07 17:53:01.492253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-07 17:53:01.492288: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-07 17:53:01.492323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-07 17:53:01.492359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-07 17:53:01.492514: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.493762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.494894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-07 17:53:01.498147: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-07 17:53:01.504498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-07 17:53:01.504586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-07-07 17:53:01.504610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-07-07 17:53:01.507360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.508746: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-07 17:53:01.509974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6724 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-07-07 17:53:07.765291: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Iter: 0, Loss: 0.27, AUC: 0.4053
Traceback (most recent call last):
File "dlrm_criteo.py", line 72, in
dlrm_model.save('/home/chi/test/my_model', save_format="tf")
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 975, in save
signatures, options)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 115, in save_model
signatures, options)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saved_model/save.py", line 74, in save
save_lib.save(model, filepath, signatures, options)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/saved_model/save.py", line 870, in save
checkpoint_graph_view)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/saved_model/signature_serialization.py", line 64, in find_function_to_export
functions = saveable_view.list_functions(saveable_view.root)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/saved_model/save.py", line 141, in list_functions
self._serialization_cache)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2422, in _list_functions_for_serialization
.list_functions_for_serialization(serialization_cache))
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saved_model/base_serialization.py", line 91, in list_functions_for_serialization
fns = self.functions_to_serialize(serialization_cache)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saved_model/layer_serialization.py", line 79, in functions_to_serialize
serialization_cache).functions_to_serialize)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saved_model/layer_serialization.py", line 94, in _get_serialized_attributes
serialization_cache)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saved_model/model_serialization.py", line 47, in _get_serialized_attributes_internal
default_signature = save_impl.default_save_signature(self.obj)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saved_model/save_impl.py", line 206, in default_save_signature
fn.get_concrete_function()
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 776, in get_concrete_function
self._initialize(args, kwargs, add_initializers_to=initializer_map)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
*args, **kwds))
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
return weak_wrapped_fn().wrapped(*args, **kwds)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/saving_utils.py", line 143, in _wrapped_model
outputs_list = nest.flatten(model(inputs=inputs, training=False))
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 847, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/chi/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
return func(*args, **kwargs)
TypeError: call() missing 2 required positional arguments: 'sparse_features' and 'label'
(tf2_0) --------

By the way, the dlrm_model.save_weights function works well, I can get the checkpoints.
And for tf2.0 2.1, you should not use from tensorflow.data import Dataset,
you should use tf.data.Dataset......
or it will cause this bug:
https://github.com/tensorflow/tensorflow/issues/33022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant