scale_to_z_score_per_key should give caller control over OOV behavior #252

cyc · 2021-10-26T16:56:09Z

Related to #220

Currently, if tft.scale_to_z_score_per_key is used and at inference time key is OOV, the value is returned unscaled. Per the docs:

If the analysis dataset is empty, contains a single distinct value or the computed key vocabulary doesn't have an entry for key, then the input is returned without scaling.

But this may not be the desired behavior for OOV entries. In some use cases, I may want OOV keys to be mapped to 0, or to some large negative number. It seems fairly application-dependent. It would be good to give the caller control over this behavior.

The text was updated successfully, but these errors were encountered:

pindinagesh · 2021-11-02T14:43:10Z

@cyc

In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thanks

cyc · 2021-11-02T19:40:31Z

Sure:

"""Simple Example of tf.Transform usage."""

import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils


def main():
  def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    s = inputs['s']
    return {
        'scale_per_key': tft.scale_to_z_score_per_key(x=x, key=s)
    }

  training_data = [
      {'x': 1, 's': 'hello'},
      {'x': 2, 's': 'world'},
      {'x': 3, 's': 'hello'},
      {'x': 4, 's': 'world'},
  ]
  test_data = [
      {'x': 2, 's': 'hello'},
      {'x': 5, 's': 'world'},
      {'x': 1000000, 's': 'foo'},
  ]

  raw_data_metadata = dataset_metadata.DatasetMetadata(
      schema_utils.schema_from_feature_spec({
          's': tf.io.FixedLenFeature([], tf.string),
          'x': tf.io.FixedLenFeature([], tf.float32),
      }))

  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (training_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))
    transformed_test_dataset = (
            ((test_data, raw_data_metadata), transform_fn)
            | tft.beam.TransformDataset())

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable
  transformed_test_data, transformed_metadata = transformed_test_dataset

  pprint.pprint(transformed_data)
  pprint.pprint(transformed_test_data)

if __name__ == '__main__':
  main()

Output:

[{'scale_per_key': -1.0},
 {'scale_per_key': -1.0},
 {'scale_per_key': 1.0},
 {'scale_per_key': 1.0}]
[{'scale_per_key': 0.0}, {'scale_per_key': 2.0}, {'scale_per_key': 1000000.0}]

Note that this is not a bug. This is working as designed and as documented. My point is that it may not be desirable to simply pass through 1000000.0 for an oov key.

zoyahav · 2021-11-08T13:44:17Z

Thanks for the feedback. I actually thought this behaviour changed already, but I see that this was only the case for scale_by_min_max / scale_by_0_1 per key, perhaps a short-term workaround is to switch to this type of scaling:

If the analysis dataset is empty, contains a single distinct value
or the computed key vocabulary doesn't have an entry for key, then x is
scaled using a sigmoid function.

For mean/var we assume 0 for OOV keys, but perhaps we should scale those too as is suggested here. @iindyk what do you think?

cyc · 2021-11-08T15:22:27Z

For my purposes, allowing the user to specify a default value to impute for OOV keys would be most preferable. However, I can understand if it is undesirable to make the function too complicated by passing along too many options. People with more specialized needs can always implement their own mapper that does what they want (which is what I ultimately had to do anyway).

iindyk · 2021-12-03T18:35:41Z

sigmoid made sense for scale_by_0_1 because the result is bounded in a specific interval and there's no such interval for scale_to_z_score (we could make one up though). Specifying a default value does seem sound but adds to the list of args and may have rare usage. Moreover, we'd want to be consistent between all *per_key mappers, so we'd need to add the arg to them all.

Adding the arg in a backwards compatible manner will make it super confusing (e.g. if default value is set, then use it, if not - scale in the existing way)

pindinagesh self-assigned this Nov 2, 2021

pindinagesh added the type:bug label Nov 2, 2021

pindinagesh added the stat:awaiting response label Nov 2, 2021

pindinagesh assigned zoyahav and unassigned pindinagesh Nov 8, 2021

pindinagesh added stat:awaiting tensorflower type:feature and removed stat:awaiting response type:bug labels Nov 8, 2021

zoyahav assigned iindyk Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scale_to_z_score_per_key should give caller control over OOV behavior #252

scale_to_z_score_per_key should give caller control over OOV behavior #252

cyc commented Oct 26, 2021

pindinagesh commented Nov 2, 2021

cyc commented Nov 2, 2021

zoyahav commented Nov 8, 2021

cyc commented Nov 8, 2021

iindyk commented Dec 3, 2021

scale_to_z_score_per_key should give caller control over OOV behavior #252

scale_to_z_score_per_key should give caller control over OOV behavior #252

Comments

cyc commented Oct 26, 2021

pindinagesh commented Nov 2, 2021

cyc commented Nov 2, 2021

zoyahav commented Nov 8, 2021

cyc commented Nov 8, 2021

iindyk commented Dec 3, 2021