Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

Open
piojanu opened this issue Sep 20, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@piojanu
Copy link

piojanu commented Sep 20, 2023

Describe the bug
ops.Categorify raises ValueError: Column must have no nulls. when num_buckets > 1 and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAi

Steps/Code to reproduce bug

import gc

import dask.dataframe as dd
import numpy as np
import pandas as pd

import nvtabular as nvt

# Generate synthetic data
N_ROWS = 100_000_000
CHUNK_SIZE = 10_000_000

N = N_ROWS // CHUNK_SIZE
dataframes = []
for i in range(N):
    print(f"{i+1}/{N}")
    chunk_data = np.random.lognormal(3., 10., int(CHUNK_SIZE)).astype(np.int32)
    chunk_ddf = dd.from_pandas(pd.DataFrame({'session_id': (chunk_data // 45), 'item_id': chunk_data}), npartitions=1)
    dataframes.append(chunk_ddf)

ddf = dd.concat(dataframes, axis=0)
del dataframes
gc.collect()

# !!! When `shuffle_by_keys` is commented out, the code finishes successfully
dataset = nvt.Dataset(ddf).shuffle_by_keys(keys=["session_id"])

_categorical_feats = [
    "item_id",
] >> nvt.ops.Categorify(
    freq_threshold=5,
    # !!! When `num_buckets=None`, the code finishes successfully
    num_buckets=100,
)

workflow = nvt.Workflow(_categorical_feats)
workflow.fit(dataset)
workflow.output_schema

Expected behavior
Properly fitted op.Categorify when num_buckets > 1 and the dataset is shuffled by keys.

Environment details (please complete the following information):

  • Environment location: JupyterLab in Docker on GCP
  • Method of NVTabular install: Docker

My Dockerfile:

# AFTER https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai
FROM nvcr.io/nvidia/merlin/merlin-pytorch:23.08

# Install Google Cloud SDK
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

# Copy your project to the Docker image
COPY . /project
WORKDIR /project

# Install Python dependencies
RUN pip install -U pip
RUN pip install -r requirements/base.txt

# Run Jupyter Lab by default, with no authentication, on port 8080
EXPOSE 8080
CMD ["jupyter-lab", "--allow-root", "--ip=0.0.0.0", "--port=8080", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'"]

Additional context
I need to call shuffle_by_keys because I then do the GroupBy operation.

@piojanu piojanu added the bug Something isn't working label Sep 20, 2023
@piojanu piojanu changed the title [BUG] ops.Categorify frequency hashing rises RuntimeError when the dataset is shuffled by keys [BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant