Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale_to_z_score_per_key should give caller control over OOV behavior #252

Open
cyc opened this issue Oct 26, 2021 · 5 comments
Open

scale_to_z_score_per_key should give caller control over OOV behavior #252

cyc opened this issue Oct 26, 2021 · 5 comments

Comments

@cyc
Copy link

cyc commented Oct 26, 2021

Related to #220

Currently, if tft.scale_to_z_score_per_key is used and at inference time key is OOV, the value is returned unscaled. Per the docs:

If the analysis dataset is empty, contains a single distinct value or the computed key vocabulary doesn't have an entry for key, then the input is returned without scaling.

But this may not be the desired behavior for OOV entries. In some use cases, I may want OOV keys to be mapped to 0, or to some large negative number. It seems fairly application-dependent. It would be good to give the caller control over this behavior.

@pindinagesh pindinagesh self-assigned this Nov 2, 2021
@pindinagesh
Copy link

@cyc

In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thanks

@cyc
Copy link
Author

cyc commented Nov 2, 2021

Sure:

"""Simple Example of tf.Transform usage."""

import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils


def main():
  def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    s = inputs['s']
    return {
        'scale_per_key': tft.scale_to_z_score_per_key(x=x, key=s)
    }

  training_data = [
      {'x': 1, 's': 'hello'},
      {'x': 2, 's': 'world'},
      {'x': 3, 's': 'hello'},
      {'x': 4, 's': 'world'},
  ]
  test_data = [
      {'x': 2, 's': 'hello'},
      {'x': 5, 's': 'world'},
      {'x': 1000000, 's': 'foo'},
  ]

  raw_data_metadata = dataset_metadata.DatasetMetadata(
      schema_utils.schema_from_feature_spec({
          's': tf.io.FixedLenFeature([], tf.string),
          'x': tf.io.FixedLenFeature([], tf.float32),
      }))

  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (training_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))
    transformed_test_dataset = (
            ((test_data, raw_data_metadata), transform_fn)
            | tft.beam.TransformDataset())

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable
  transformed_test_data, transformed_metadata = transformed_test_dataset

  pprint.pprint(transformed_data)
  pprint.pprint(transformed_test_data)

if __name__ == '__main__':
  main()

Output:

[{'scale_per_key': -1.0},
 {'scale_per_key': -1.0},
 {'scale_per_key': 1.0},
 {'scale_per_key': 1.0}]
[{'scale_per_key': 0.0}, {'scale_per_key': 2.0}, {'scale_per_key': 1000000.0}]

Note that this is not a bug. This is working as designed and as documented. My point is that it may not be desirable to simply pass through 1000000.0 for an oov key.

@zoyahav
Copy link
Member

zoyahav commented Nov 8, 2021

Thanks for the feedback. I actually thought this behaviour changed already, but I see that this was only the case for scale_by_min_max / scale_by_0_1 per key, perhaps a short-term workaround is to switch to this type of scaling:

If the analysis dataset is empty, contains a single distinct value
or the computed key vocabulary doesn't have an entry for key, then x is
scaled using a sigmoid function.

For mean/var we assume 0 for OOV keys, but perhaps we should scale those too as is suggested here. @iindyk what do you think?

@cyc
Copy link
Author

cyc commented Nov 8, 2021

For my purposes, allowing the user to specify a default value to impute for OOV keys would be most preferable. However, I can understand if it is undesirable to make the function too complicated by passing along too many options. People with more specialized needs can always implement their own mapper that does what they want (which is what I ultimately had to do anyway).

@iindyk
Copy link
Contributor

iindyk commented Dec 3, 2021

sigmoid made sense for scale_by_0_1 because the result is bounded in a specific interval and there's no such interval for scale_to_z_score (we could make one up though). Specifying a default value does seem sound but adds to the list of args and may have rare usage. Moreover, we'd want to be consistent between all *per_key mappers, so we'd need to add the arg to them all.

Adding the arg in a backwards compatible manner will make it super confusing (e.g. if default value is set, then use it, if not - scale in the existing way)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants