Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate between NaN and null in the viewer #2828

Open
2 tasks
polinaeterna opened this issue May 17, 2024 · 6 comments
Open
2 tasks

Differentiate between NaN and null in the viewer #2828

polinaeterna opened this issue May 17, 2024 · 6 comments
Assignees

Comments

@polinaeterna
Copy link
Contributor

polinaeterna commented May 17, 2024

Currently, we don't do this and display and return in response null in both cases.
From the discussion in #2797, this is agreed that it's important to let users know how to correctly treat data with these values.
This would require:

  • Change in how we transform parquet to /first-rows and /rows response. I haven't figured out where exactly, but apparently nan values are somehow replaced with null.
  • Change in response structure and field names in /statistics - for float columns add field nan_count, for other columns rename nan_count to null_count :/// (my bad with the original naming)
@polinaeterna polinaeterna self-assigned this May 17, 2024
@polinaeterna
Copy link
Contributor Author

So apparently it's just that orjson serializes float("nan") as null so it doesn't differentiate between NaN and null:

orjson.dumps([float("nan"), None])
>>> b'[null,null]'

and there is no option to force it to do the opposite. To compare,json.dumps() does serialize NaNs as a dedicated value but orjson is strictly JSON conformant in this.

I don't see an easy solution here, do you have any ideas @huggingface/dataset-viewer ?

@severo
Copy link
Collaborator

severo commented May 27, 2024

it's not possible to override this behavior here?

def orjson_default(obj: Any) -> Any:
if isinstance(obj, bytes):
# see https://stackoverflow.com/a/40000564/7351594 for example
# the bytes are encoded with base64, and then decoded as utf-8
# (ascii only, by the way) to get a string
return base64.b64encode(obj).decode("utf-8")
if isinstance(obj, pd.Timestamp):
return obj.to_pydatetime()
return str(obj)

@albertvillanova
Copy link
Member

albertvillanova commented May 29, 2024

I am afraid the approach above will not work...

Note that float("nan") is an instance of float, which is a supported type by orjson. Supported types are not passed through the default function...

@polinaeterna
Copy link
Contributor Author

yes, i didn't manage to make it work. i think it's not possible and this is intentional, this is from orjson's readme:

has strict JSON conformance in not supporting Nan/Infinity/-Infinity

@severo
Copy link
Collaborator

severo commented May 29, 2024

should we use ujson instead of orjson as in datasets?

@severo
Copy link
Collaborator

severo commented May 31, 2024

Also, in pyarrow doc: https://arrow.apache.org/docs/python/data.html#none-values-and-nan-handling

None values and NAN handling

As mentioned in the above section, the Python object None is always converted to an Arrow null element on the conversion to pyarrow.Array. For the float NaN value which is either represented by the Python object float('nan') or numpy.nan we normally convert it to a valid float value during the conversion. If an integer input is supplied to pyarrow.array that contains np.nan, ValueError is raised.

To handle better compatibility with Pandas, we support interpreting NaN values as null elements. This is enabled automatically on all from_pandas function and can be enabled on the other conversion functions by passing from_pandas=True as a function parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants