Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] TypeError: unhashable type: 'numpy.ndarray' #1871

Open
dking21st opened this issue Nov 21, 2023 · 4 comments
Open

[QST] TypeError: unhashable type: 'numpy.ndarray' #1871

dking21st opened this issue Nov 21, 2023 · 4 comments
Labels
question Further information is requested

Comments

@dking21st
Copy link

What is your question?

I'm trying to use Merlin to build 2 tower NN model. However, when I try to use nvtabular workflow to fit my dataset, It shows an error.

user_features = ([
"user_history_1",
"user_history_2",
"user_gender",
"user_age",
"platform",
"object_section",
"hour"]
>> HashBucket({"user_history_1": 500000, "user_history_2": 100000,
"user_gender": 3, "user_age" : 10, "platform" : 3, "object_section": 6, "hour": 24})
>> TagAsUserFeatures()
)

outputs = user_id + item_id + item_hash_features + item_dense_features + user_features
workflow = nvt.Workflow(outputs)
train_dataset = nvt.Dataset(train_data)
workflow.fit(train_dataset)

and after calling fit method, it returns an error:

TypeError: unhashable type: 'numpy.ndarray'

TypeError Traceback (most recent call last)
Cell In[18], line 1
----> 1 workflow.fit(train_dataset)

Only two features, user_history_1 and user_history_2 are numpy array: contains the itemId that user visited.

e.g.
[1705022, 1806090, 1801039, 1005001]

When I excluded user_history_1 and user_history_2 from input features, fit method worked successfully. Therefore, I suspect these two features as the reason of error message.

As it says numpy.ndarray is unhashable, I converted it to list. However, I still see the same error message.

Does anyone have a suggestion for debugging?

@dking21st dking21st added the question Further information is requested label Nov 21, 2023
@piojanu
Copy link

piojanu commented Nov 22, 2023

I observe a similar problem in Categorify with the hashing of infrequent items. Here is the minimal example:

import nvtabular as nvt
import pandas as pd

df = pd.DataFrame({"items": [[1, 2, 3], [1, 2], [1, 2, 4, 4]]})
dataset = nvt.Dataset(df)

feats = [
    "items",
] >> nvt.ops.Categorify(
    freq_threshold=2,
    num_buckets=10,
)

workflow = nvt.Workflow(feats)
processed_ds = workflow.fit_transform(dataset)

Error:

File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:510, in Categorify.transform(self, col_selector, df)
    508 path = self.categories[storage_name]
--> 510 encoded = _encode(
    511     use_name,
    512     storage_name,
    513     path,
    514     df,
    515     self.cat_cache,
    516     freq_threshold=self.freq_threshold[name]
    517     if isinstance(self.freq_threshold, dict)
    518     else self.freq_threshold,
    519     search_sorted=self.search_sorted,
    520     buckets=self.num_buckets,
    521     encode_type=self.encode_type,
    522     cat_names=column_names,
    523     max_size=self.max_size,
    524     dtype=self.output_dtype,
    525     split_out=(
    526         self.split_out.get(storage_name, 1)
    527         if isinstance(self.split_out, dict)
    528         else self.split_out
    529     ),
    530     single_table=self.single_table,
    531 )
    532 new_df[name] = encoded

File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:1717, in _encode(name, storage_name, path, df, cat_cache, freq_threshold, search_sorted, buckets, encode_type, cat_names, max_size, dtype, split_out, single_table)
   1714 if buckets and storage_name in buckets:
   1715     # apply hashing for "infrequent" categories
   1716     indistinct = (
-> 1717         _hash_bucket(df, buckets, selection_l.names, encode_type=encode_type)
   1718         + bucket_encoding_offset
   1719     )
   1721     if use_collection:
   1722         # Manual broadcast merge

File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:1844, in _hash_bucket(df, num_buckets, col, encode_type)
   1843     nb = num_buckets[col[0]]
-> 1844     encoded = dispatch.hash_series(df[col[0]]) % nb
   1845 elif encode_type == "combo":

File .../lib/python3.10/site-packages/merlin/core/dispatch.py:294, in hash_series(ser)
    288 if isinstance(ser, pd.Series):
    289     # Using pandas hashing, which does not produce the
    290     # same result as cudf.Series.hash_values().  Do not
    291     # expect hash-based data transformations to be the
    292     # same on CPU and CPU.  TODO: Fix this (maybe use
    293     # murmurhash3 manually on CPU).
--> 294     return hash_object_dispatch(ser).values
    295 elif cudf and isinstance(ser, cudf.Series):

File .../lib/python3.10/site-packages/dask/utils.py:642, in Dispatch.__call__(self, arg, *args, **kwargs)
    641 meth = self.dispatch(type(arg))
--> 642 return meth(arg, *args, **kwargs)

File .../lib/python3.10/site-packages/dask/dataframe/backends.py:502, in hash_object_pandas(obj, index, encoding, hash_key, categorize)
    498 @hash_object_dispatch.register((pd.DataFrame, pd.Series, pd.Index))
    499 def hash_object_pandas(
    500     obj, index=True, encoding="utf8", hash_key=None, categorize=True
    501 ):
--> 502     return pd.util.hash_pandas_object(
    503         obj, index=index, encoding=encoding, hash_key=hash_key, categorize=categorize
    504     )

File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:126, in hash_pandas_object(obj, index, encoding, hash_key, categorize)
    125 elif isinstance(obj, ABCSeries):
--> 126     h = hash_array(obj._values, encoding, hash_key, categorize).astype(
    127         "uint64", copy=False
    128     )
    129     if index:

File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:308, in hash_array(vals, encoding, hash_key, categorize)
    303     raise TypeError(
    304         "hash_array requires np.ndarray or ExtensionArray, not "
    305         f"{type(vals).__name__}. Use hash_pandas_object instead."
    306     )
--> 308 return _hash_ndarray(vals, encoding, hash_key, categorize)

File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:346, in _hash_ndarray(vals, encoding, hash_key, categorize)
    340 from pandas import (
    341     Categorical,
    342     Index,
    343     factorize,
    344 )
--> 346 codes, categories = factorize(vals, sort=False)
    347 cat = Categorical(
    348     codes, Index._with_infer(categories), ordered=False, fastpath=True
    349 )

File .../lib/python3.10/site-packages/pandas/core/algorithms.py:822, in factorize(values, sort, na_sentinel, use_na_sentinel, size_hint)
    820             values = np.where(null_mask, na_value, values)
--> 822     codes, uniques = factorize_array(
    823         values,
    824         na_sentinel=na_sentinel_arg,
    825         size_hint=size_hint,
    826     )
    828 if sort and len(uniques) > 0:

File .../lib/python3.10/site-packages/pandas/core/algorithms.py:578, in factorize_array(values, na_sentinel, size_hint, na_value, mask)
    577 table = hash_klass(size_hint or len(values))
--> 578 uniques, codes = table.factorize(
    579     values,
    580     na_sentinel=na_sentinel,
    581     na_value=na_value,
    582     mask=mask,
    583     ignore_na=ignore_na,
    584 )
    586 # re-cast e.g. i8->dt64/td64, uint8->bool

File pandas/_libs/hashtable_class_helper.pxi:5943, in pandas._libs.hashtable.PyObjectHashTable.factorize()

File pandas/_libs/hashtable_class_helper.pxi:5857, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'list'

Do you think it should be a bug report?

EDIT:

  • NVT version: 23.08.00
  • from the docker: nvcr.io/nvidia/merlin/merlin-pytorch:23.08

@dking21st
Copy link
Author

solution worked for me:
although there was gpu existing, my notebook was running with cpu instead and that forced nvtabular to run with cpu mode. After installing RAPID (https://docs.rapids.ai/install#pip) and restart kernel, it started to grab gpu automatically and the issue is resolved.

@piojanu
Copy link

piojanu commented Nov 25, 2023

I can confirm my code only errors out on the CPU too. On the GPU it works fine. Still, this is a bug.

@dking21st dking21st reopened this Dec 13, 2023
@dking21st
Copy link
Author

ok this error is occurring again, even after installing RAPIDS... can someone help?

I ran following codes to check if GPU exists

device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
    print("Found GPU at: {}".format(device_name))
else:
    device_name = "/device:CPU:0"
    print("No GPU, using {}.".format(device_name))

and it returns

Found GPU at: /device:GPU:0

so it has GPU but nvtabular dataset is keep forcing to use CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants