Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

Open
supercharleszhu opened this issue Mar 11, 2024 · 0 comments · May be fixed by #4031
Open

Tensorflow Saved model not portable with latest tf.keras.optimizers #4028

supercharleszhu opened this issue Mar 11, 2024 · 0 comments · May be fixed by #4031
Labels

Comments

@supercharleszhu
Copy link
Contributor

supercharleszhu commented Mar 11, 2024

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow
  2. Framework version: 2.11
  3. Horovod version: 0.28.1
  4. MPI version: N/A
  5. CUDA version: 11.2 (tested in CPU version)
  6. NCCL version: 11.2
  7. Python version: 3.10
  8. Spark / PySpark version: N/A
  9. Ray version: N/A
  10. OS and version: CentOS 7
  11. GCC version: 11.2.0
  12. CMake version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

We met an issue after running TF Training w/ horovod in both CPU and GPU execution. The tf saved model is not loadable outside Horovod environment because HorovodAllReduce seems to be saved unexpected.

Ways to reproduce: running the following script for a simple keras model in the test case and saving it

# test.py
import horovod.tensorflow as hvd
import tensorflow as tf
import keras
import numpy as np


hvd.init()
initial_lr = 0.1 * hvd.size()
opt = tf.keras.optimizers.Adam()
opt = hvd.DistributedOptimizer(opt)

def linear_multiplier(epoch):
    return epoch

model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(3,)))
model.add(keras.layers.RepeatVector(3))
model.add(keras.layers.ThresholdedReLU(0.5))
model.compile(loss=keras.losses.mean_squared_error,
                optimizer=opt,
                metrics=[keras.metrics.categorical_accuracy],
                experimental_run_tf_function=False)
x = np.random.random((10, 3))
y = np.random.random((10, 3, 2))


train_history = model.fit(x,
                            y,
                            steps_per_epoch=5,
                            epochs=20)

# test that the metrics average is being respected
loss_metrics = train_history.history["loss"]
loss_metrics_tensor = tf.convert_to_tensor(
    loss_metrics, dtype=tf.float32)
expected_loss_metrics_tensor = hvd.broadcast(
    loss_metrics_tensor, root_rank=0)

if hvd.rank() == 0:
    tf.saved_model.save(model, "test_space/hvd_saved_model_2")

and run python test.py

Then loading the model without horovd being imported

# test_2.py
import tensorflow as tf
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_2")

and run python test_2.py

it will return

Traceback (most recent call last):
  File "/home/chzhu/test_space/test_tf_saved_model.py", line 3, in <module>
    tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_1")
  File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 828, in load
    result = load_partial(export_dir, None, tags, options)["root"]
  File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 961, in load_partial
    raise FileNotFoundError(
FileNotFoundError: Op type not registered 'HorovodAllreduce' in binary running on chzhu-ld4.linkedin.biz. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
 You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.

Note: Reverting to Horovod 0.26 or tf.keras.optimizer.legacy will resolve this issue. But we want to use latest horovod instead.

@supercharleszhu supercharleszhu changed the title Saved model not portable with HorovodAllReduceOps Tensorflow Saved model not portable with HorovodAllReduceOps Mar 11, 2024
@supercharleszhu supercharleszhu changed the title Tensorflow Saved model not portable with HorovodAllReduceOps Tensorflow Saved model not portable with HorovodAllReduce Ops Mar 11, 2024
@supercharleszhu supercharleszhu changed the title Tensorflow Saved model not portable with HorovodAllReduce Ops Tensorflow Saved model not portable with latest tf.keras.optimizers Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

1 participant