Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply_vocabulary lookup table initialization needs to be wrapped inside tf.init_scope #249

Open
EdwardCuiPeacock opened this issue Oct 19, 2021 · 5 comments

Comments

@EdwardCuiPeacock
Copy link

EdwardCuiPeacock commented Oct 19, 2021

We recently encountered scalability issues when trying to apply the vocabularies for multiple (5 to be exact) categorical features. We saw multiple lines of the follwoing warning message:

WARNING:tensorflow:Tables initialized inside a tf.function will be re-initialized on every invocation of the function. This re-initialization can have significant impact on performance. Consider lifting them out of the graph context using `tf.init_scope`.

When using the tft.apply_vocabulary, the job would stuck on the transformation steps for hours, consuming thousands of CPU hours if we do not kill it early.

Creating a custom lookup table initialization function like the following could bypass the proble; 80M rows of data only took 35 min, consuming ~20 hours of CPU time.

def create_file_lookup(filename):
    with tf.init_scope():
        initializer = tf.lookup.TextFileInitializer(
            filename,
            key_dtype=tf.string, 
            key_index=tf.lookup.TextFileIndex.WHOLE_LINE, 
            value_dtype=tf.int64, 
            value_index=tf.lookup.TextFileIndex.LINE_NUMBER,
            value_index_offset=1, # starting from 1
        )
        table = tf.lookup.StaticHashTable(initializer, 0)
        
    return table

Relevant code need to be addressed:

initializer = tf.lookup.TextFileInitializer(

This probably needs to be applied to versions of TFT starting from 1.0

@cyc
Copy link

cyc commented Oct 26, 2021

@EdwardCuiPeacock I don't quite understand how you are getting that warning message, it appears as though tf.init_scope is being used here:

stack.enter_context(tf.init_scope())

@varshaan
Copy link
Contributor

As Chris commented above, we do enter the tf.init_scope in tf_utils. Are you passing a lookup_fn to tft.apply_vocabulary? If that is the case, you would need to lift the table creation inside that lookup_fn as TFT does not have access to the table creation code to do this automatically.

Could you give me an example of what your calls to tft.vocabulary and tft.apply_vocabulary look like so I can take a look and see if we missed something?

@gfkeith
Copy link

gfkeith commented Apr 5, 2022

I've encountered the same warning while using TFX, in my case the init_scope is not used because the check

if isinstance(graph, func_graph.FuncGraph) and isinstance(
asset_filepath, (ops.EagerTensor, str)):
fails as asset_filepath is <class 'tensorflow.python.framework.ops.Tensor'>.

My usage is:

transformed = tft.compute_and_apply_vocabulary(
    input_tensor,
    frequency_threshold=100,
    num_oov_buckets=1,
    vocab_filename='tags'
)

It isn't a big issue for me as my vocabulary is small, so I haven't spent much time looking into it and don't have a proper MRE, but maybe that much is helpful.

@gfkeith
Copy link

gfkeith commented Apr 5, 2022

I've just checked using this notebook: https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/components_keras.ipynb#scrollTo=jHfhth_GiZI9, and the warning appears with it also. Possibly related to being wrapped with tf.function (which occurs in TFX Transform), so the vocab_filename is not an ops.EagerTensor? It gets wrapped here-ish

vocab_filename_tensor = analyzer_nodes.wrap_as_tensor(vocab_filename_node)
into a tensor, so is no longer a string either.

@IzakMaraisTAL
Copy link

Similar to @gfkeith , I also get this warning when using tft.compute_and_apply_vocabulary() inside a TFX Transform component's preprocessing_fn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants