Differentiate between `NaN` and `null` in the viewer #2828

polinaeterna · 2024-05-17T12:05:47Z

Currently, we don't do this and display and return in response null in both cases.
From the discussion in #2797, this is agreed that it's important to let users know how to correctly treat data with these values.
This would require:

Change in how we transform parquet to /first-rows and /rows response. I haven't figured out where exactly, but apparently nan values are somehow replaced with null.
Change in response structure and field names in /statistics - for float columns add field nan_count, for other columns rename nan_count to null_count :/// (my bad with the original naming)

The text was updated successfully, but these errors were encountered:

polinaeterna · 2024-05-27T14:40:02Z

So apparently it's just that orjson serializes float("nan") as null so it doesn't differentiate between NaN and null:

orjson.dumps([float("nan"), None])
>>> b'[null,null]'

and there is no option to force it to do the opposite. To compare,json.dumps() does serialize NaNs as a dedicated value but orjson is strictly JSON conformant in this.

I don't see an easy solution here, do you have any ideas @huggingface/dataset-viewer ?

severo · 2024-05-27T15:26:11Z

it's not possible to override this behavior here?

dataset-viewer/libs/libcommon/src/libcommon/utils.py

Lines 24 to 32 in b2c7c36

 def orjson_default(obj: Any) -> Any: 

 if isinstance(obj, bytes): 

 # see https://stackoverflow.com/a/40000564/7351594 for example 

 # the bytes are encoded with base64, and then decoded as utf-8 

 # (ascii only, by the way) to get a string 

 return base64.b64encode(obj).decode("utf-8") 

 if isinstance(obj, pd.Timestamp): 

 return obj.to_pydatetime() 

 return str(obj)

albertvillanova · 2024-05-29T05:37:20Z

I am afraid the approach above will not work...

Note that float("nan") is an instance of float, which is a supported type by orjson. Supported types are not passed through the default function...

polinaeterna · 2024-05-29T12:31:04Z

yes, i didn't manage to make it work. i think it's not possible and this is intentional, this is from orjson's readme:

has strict JSON conformance in not supporting Nan/Infinity/-Infinity

severo · 2024-05-29T20:42:39Z

should we use ujson instead of orjson as in datasets?

severo · 2024-05-31T12:36:25Z

Also, in pyarrow doc: https://arrow.apache.org/docs/python/data.html#none-values-and-nan-handling

None values and NAN handling

As mentioned in the above section, the Python object None is always converted to an Arrow null element on the conversion to pyarrow.Array. For the float NaN value which is either represented by the Python object float('nan') or numpy.nan we normally convert it to a valid float value during the conversion. If an integer input is supplied to pyarrow.array that contains np.nan, ValueError is raised.

To handle better compatibility with Pandas, we support interpreting NaN values as null elements. This is enabled automatically on all from_pandas function and can be enabled on the other conversion functions by passing from_pandas=True as a function parameter.

polinaeterna self-assigned this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiate between `NaN` and `null` in the viewer #2828

Differentiate between `NaN` and `null` in the viewer #2828

polinaeterna commented May 17, 2024 •

edited

polinaeterna commented May 27, 2024

severo commented May 27, 2024

albertvillanova commented May 29, 2024 •

edited

polinaeterna commented May 29, 2024

severo commented May 29, 2024

severo commented May 31, 2024

Differentiate between NaN and null in the viewer #2828

Differentiate between NaN and null in the viewer #2828

Comments

polinaeterna commented May 17, 2024 • edited

polinaeterna commented May 27, 2024

severo commented May 27, 2024

albertvillanova commented May 29, 2024 • edited

polinaeterna commented May 29, 2024

severo commented May 29, 2024

severo commented May 31, 2024

Differentiate between `NaN` and `null` in the viewer #2828

Differentiate between `NaN` and `null` in the viewer #2828

polinaeterna commented May 17, 2024 •

edited

albertvillanova commented May 29, 2024 •

edited