Added new parameter 'compute_key' #390

edwardguil · 2024-01-14T10:53:48Z

Added a new parameter 'compute_key', that allows users to override the default function for computing a sample key (compute_key). This should allow finer control of the output format of the downloaded dataset.

An example use case is the following:

If the dataset had some additional_data which was specified, one of which was a uid across the dataset, a user could simply do the following:

def compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
    return str(additional_columns['uid'])

Then pass this function to the downloader. Hence changing the default output from:

output_folder
- 00000
  - 000000000.jpg
  - 000000000.txt
  - 000000001.jpg
  - ...

To:

output_folder
- 00000
  - some_uid_1.jpg
  - some_uid_1.txt
  - some_uid_2.jpg
  - ...

As far as I am aware this customization still aligns with Web-dataset principles.

…default function for computing a sample key. Added information on this parameter to the README.

…Fixed typo.

rom1504 · 2024-01-14T12:26:29Z

Can you say more on how you build these uuids?
Is it useful to you to precompute them in advance?

I have been thinking to simply generate some uuid during download instead of using these shard id prefixed numbers

edwardguil · 2024-01-14T22:39:19Z

Can you say more on how you build these uuids?

As an example, using the thread safe uuid library:

import uuid
compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
    unique_id = uuid.uuid4()
    return f"{unique_id}"

Or combining this with an additional column:

pairs = {}
compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
    unique_id = uuid.uuid4()
    return f"{additional_columns['someColumn']}_{unique_id}"

I think the point is too allow 'advanced' users to decide what their approach to this is.

Is it useful to you to precompute them in advance?

Yes, pre-computing the uuids in this case is most appropriate. Storing them within the input file (csv etc), then passing them through additional_parameters. This helps avoid the case of race conditions, for people unfamiliar with it.

I have been thinking to simply generate some uuid during download instead of using these shard id prefixed numbers

Yes, that could be an appropriate solution, I do think however your current approach works well, and is clear. In saying this a true UUID would confirm better to webdataset standards, as two separate runs of img2dataset into distinct folders do run the risk of of overlapping basename + key pairs (something I have come across), hence the PR.

If this pr goes ahead, I do think that the documentation should mention the function (compute_key) needs to be thread safe. This could complicate it for users, so an alternative solution would be to pass an additional parameter "suffix" or "prefix" to add to the output keys, or 'uuid' to override the keys, but ultimately this is not as optimal as allowing full key changes.

edwardguil · 2024-03-10T22:59:44Z

Hey @rom1504, when you get the chance. Any feedback on this PR?

edwardguil added 2 commits January 14, 2024 20:45

Added new parameter 'compute_key', that allows users to override the …

2daf299

…default function for computing a sample key. Added information on this parameter to the README.

Removed uncessary parameter passing in default compute_key function. …

b62bad5

…Fixed typo.

edwardguil force-pushed the main branch from d2a9ee9 to b62bad5 Compare January 14, 2024 11:33

edwardguil added 2 commits January 14, 2024 21:34

Fixed linter erros

e811489

Fixed non compliant line len.

55d6df6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new parameter 'compute_key' #390

Added new parameter 'compute_key' #390

edwardguil commented Jan 14, 2024 •

edited

rom1504 commented Jan 14, 2024

edwardguil commented Jan 14, 2024 •

edited

edwardguil commented Mar 10, 2024

Added new parameter 'compute_key' #390

Are you sure you want to change the base?

Added new parameter 'compute_key' #390

Conversation

edwardguil commented Jan 14, 2024 • edited

rom1504 commented Jan 14, 2024

edwardguil commented Jan 14, 2024 • edited

edwardguil commented Mar 10, 2024

edwardguil commented Jan 14, 2024 •

edited

edwardguil commented Jan 14, 2024 •

edited