[NeurIPS] How to express data in other binary formats? #679

gcr · 2024-06-05T17:33:54Z

hi! my colleagues and I are putting together a dataset full of encoded depth maps. These need to be encoded with way more bits than default image data loaders can do.

After some technical discussion of various trade-offs, we chose to encode our depth maps as gzipped arrays containing 16-bit floating point numbers (little endian, C-order), where the first two numbers are the (integer) height and width of the map, and the rest are the 2D depth map, like this:

record = "path/to/record.fp16le.gzip"
record = gzip.open(record).read()
arr = np.frombuffer(record, dtype='float16')
h,w = arr[:2].astype('uint32')
depth_map = arr[2:].reshape((h,w))

We found through testing that this format fit our needs much better than 16-bit greyscale PNG, JPEG-XL, etc.

Given that the data's already in this format, what's the best way of specifying these in a Croissant record?

The croissant spec implies that any MIME type is acceptable, however, the code as implemented has a hardcoded list of MIMEtypes and gives an error when datasets contain other MIME formats. Is the spec or the code incorrect?

I feel like application/octet-stream is the most descriptive MIME type if I had to shoehorn it, but fully-specifying our data format would require some small VM encoded as JSON operators or something like that. My initial thought is to just submit a Croissant specification that includes the metadata but no FileSets or Record specifications, to make it clear that people need to use our own data loaders (for now). That defeats some of the purpose of the format though.

This is for NeurIPS' dataset track, so we'd appreciate whatever advice you have.

(Related: #635 , #282 , #649 )

The text was updated successfully, but these errors were encountered:

pierrot0 · 2024-06-12T16:58:25Z

Hi,

Absolutely, submitting an incomplete Croissant file is probably the way to take.
I would still specify the FileObjects if possible, but I see now that the set of supported mime types in the checker should be increased and we should have a mechanism to allow for "unknown mime types" which might not be supported by all tools.

As for the particular case you mention, I think we may want to add np array support in Croissant spec.

gcr changed the title ~~[NeurIPS] How to encode gzipped float16 image data (depth maps)?~~ [NeurIPS] How to express data in other binary formats? Jun 5, 2024

4ndr3aR mentioned this issue Jun 11, 2024

[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

Open

pierrot0 self-assigned this Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeurIPS] How to express data in other binary formats? #679

[NeurIPS] How to express data in other binary formats? #679

gcr commented Jun 5, 2024 •

edited

Loading

pierrot0 commented Jun 12, 2024

[NeurIPS] How to express data in other binary formats? #679

[NeurIPS] How to express data in other binary formats? #679

Comments

gcr commented Jun 5, 2024 • edited Loading

pierrot0 commented Jun 12, 2024

gcr commented Jun 5, 2024 •

edited

Loading