Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

supercharleszhu · 2024-03-11T04:52:40Z

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow
Framework version: 2.11
Horovod version: 0.28.1
MPI version: N/A
CUDA version: 11.2 (tested in CPU version)
NCCL version: 11.2
Python version: 3.10
Spark / PySpark version: N/A
Ray version: N/A
OS and version: CentOS 7
GCC version: 11.2.0
CMake version:

Checklist:

Did you search issues to find if somebody asked this question before?
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

We met an issue after running TF Training w/ horovod in both CPU and GPU execution. The tf saved model is not loadable outside Horovod environment because HorovodAllReduce seems to be saved unexpected.

Ways to reproduce: running the following script for a simple keras model in the test case and saving it

# test.py
import horovod.tensorflow as hvd
import tensorflow as tf
import keras
import numpy as np


hvd.init()
initial_lr = 0.1 * hvd.size()
opt = tf.keras.optimizers.Adam()
opt = hvd.DistributedOptimizer(opt)

def linear_multiplier(epoch):
    return epoch

model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(3,)))
model.add(keras.layers.RepeatVector(3))
model.add(keras.layers.ThresholdedReLU(0.5))
model.compile(loss=keras.losses.mean_squared_error,
                optimizer=opt,
                metrics=[keras.metrics.categorical_accuracy],
                experimental_run_tf_function=False)
x = np.random.random((10, 3))
y = np.random.random((10, 3, 2))


train_history = model.fit(x,
                            y,
                            steps_per_epoch=5,
                            epochs=20)

# test that the metrics average is being respected
loss_metrics = train_history.history["loss"]
loss_metrics_tensor = tf.convert_to_tensor(
    loss_metrics, dtype=tf.float32)
expected_loss_metrics_tensor = hvd.broadcast(
    loss_metrics_tensor, root_rank=0)

if hvd.rank() == 0:
    tf.saved_model.save(model, "test_space/hvd_saved_model_2")

and run python test.py

Then loading the model without horovd being imported

# test_2.py
import tensorflow as tf
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_2")

and run python test_2.py

it will return

Traceback (most recent call last):
  File "/home/chzhu/test_space/test_tf_saved_model.py", line 3, in <module>
    tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_1")
  File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 828, in load
    result = load_partial(export_dir, None, tags, options)["root"]
  File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 961, in load_partial
    raise FileNotFoundError(
FileNotFoundError: Op type not registered 'HorovodAllreduce' in binary running on chzhu-ld4.linkedin.biz. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
 You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.

Note: Reverting to Horovod 0.26 or tf.keras.optimizer.legacy will resolve this issue. But we want to use latest horovod instead.

The text was updated successfully, but these errors were encountered:

supercharleszhu added the bug label Mar 11, 2024

supercharleszhu changed the title ~~Saved model not portable with HorovodAllReduceOps~~ Tensorflow Saved model not portable with HorovodAllReduceOps Mar 11, 2024

supercharleszhu changed the title ~~Tensorflow Saved model not portable with HorovodAllReduceOps~~ Tensorflow Saved model not portable with HorovodAllReduce Ops Mar 11, 2024

supercharleszhu changed the title ~~Tensorflow Saved model not portable with HorovodAllReduce Ops~~ Tensorflow Saved model not portable with latest tf.keras.optimizers Mar 15, 2024

supercharleszhu linked a pull request Mar 15, 2024 that will close this issue

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

supercharleszhu commented Mar 11, 2024 •

edited

Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

Comments

supercharleszhu commented Mar 11, 2024 • edited

supercharleszhu commented Mar 11, 2024 •

edited