Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeurIPS] How to express data in other binary formats? #679

Open
gcr opened this issue Jun 5, 2024 · 1 comment
Open

[NeurIPS] How to express data in other binary formats? #679

gcr opened this issue Jun 5, 2024 · 1 comment
Assignees

Comments

@gcr
Copy link

gcr commented Jun 5, 2024

hi! my colleagues and I are putting together a dataset full of encoded depth maps. These need to be encoded with way more bits than default image data loaders can do.

After some technical discussion of various trade-offs, we chose to encode our depth maps as gzipped arrays containing 16-bit floating point numbers (little endian, C-order), where the first two numbers are the (integer) height and width of the map, and the rest are the 2D depth map, like this:

record = "path/to/record.fp16le.gzip"
record = gzip.open(record).read()
arr = np.frombuffer(record, dtype='float16')
h,w = arr[:2].astype('uint32')
depth_map = arr[2:].reshape((h,w))

We found through testing that this format fit our needs much better than 16-bit greyscale PNG, JPEG-XL, etc.

Given that the data's already in this format, what's the best way of specifying these in a Croissant record?

The croissant spec implies that any MIME type is acceptable, however, the code as implemented has a hardcoded list of MIMEtypes and gives an error when datasets contain other MIME formats. Is the spec or the code incorrect?

I feel like application/octet-stream is the most descriptive MIME type if I had to shoehorn it, but fully-specifying our data format would require some small VM encoded as JSON operators or something like that. My initial thought is to just submit a Croissant specification that includes the metadata but no FileSets or Record specifications, to make it clear that people need to use our own data loaders (for now). That defeats some of the purpose of the format though.

This is for NeurIPS' dataset track, so we'd appreciate whatever advice you have.

(Related: #635 , #282 , #649 )

@gcr gcr changed the title [NeurIPS] How to encode gzipped float16 image data (depth maps)? [NeurIPS] How to express data in other binary formats? Jun 5, 2024
@pierrot0 pierrot0 self-assigned this Jun 12, 2024
@pierrot0
Copy link
Contributor

Hi,

Absolutely, submitting an incomplete Croissant file is probably the way to take.
I would still specify the FileObjects if possible, but I see now that the set of supported mime types in the checker should be increased and we should have a mechanism to allow for "unknown mime types" which might not be supported by all tools.

As for the particular case you mention, I think we may want to add np array support in Croissant spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants