-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NeurIPS] How to express data in other binary formats? #679
Comments
Hi, Absolutely, submitting an incomplete Croissant file is probably the way to take. As for the particular case you mention, I think we may want to add np array support in Croissant spec. |
hi! my colleagues and I are putting together a dataset full of encoded depth maps. These need to be encoded with way more bits than default image data loaders can do.
After some technical discussion of various trade-offs, we chose to encode our depth maps as gzipped arrays containing 16-bit floating point numbers (little endian, C-order), where the first two numbers are the (integer) height and width of the map, and the rest are the 2D depth map, like this:
We found through testing that this format fit our needs much better than 16-bit greyscale PNG, JPEG-XL, etc.
Given that the data's already in this format, what's the best way of specifying these in a Croissant record?
The croissant spec implies that any MIME type is acceptable, however, the code as implemented has a hardcoded list of MIMEtypes and gives an error when datasets contain other MIME formats. Is the spec or the code incorrect?
I feel like
application/octet-stream
is the most descriptive MIME type if I had to shoehorn it, but fully-specifying our data format would require some small VM encoded as JSON operators or something like that. My initial thought is to just submit a Croissant specification that includes the metadata but no FileSets or Record specifications, to make it clear that people need to use our own data loaders (for now). That defeats some of the purpose of the format though.This is for NeurIPS' dataset track, so we'd appreciate whatever advice you have.
(Related: #635 , #282 , #649 )
The text was updated successfully, but these errors were encountered: