Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace specific opt-out support with datadiligence package for more general opt-out support #312

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Padge91
Copy link

@Padge91 Padge91 commented May 16, 2023

The logic currently implemented to respect opt outs (checking HTTP headers) is insufficient for the advancing opt-out landscape. Other methods should also be respected (e.g. HaveIBeenTrained (HIBT), Content Authenticity Initiative (CAI), ArtStation opt-outs, etc). The datadiligence package aims to encapsulate and manage these methods as standards are introduced and evolve.

This repository would benefit from shifting the responsibility of respecting opt-outs to a dedicated package. This PR replaces the existing opt-out logic in this repository with calls to the datadiligence package. These changes should free the maintainers of img2dataset to focus on their goals and priorities without needing to revisit this logic in the future.

This PR primarily does a few things:

  1. Replaces the HTTP header validation logic with function calls to the datadiligence package which perform similar logic.
  2. Adds a pre-processing step to optionally call the Spawning API to filter opt-outs (those made through HIBT, Spawning content partners such as ArtStation, etc).
  3. Adds a more general and consistent argument to respect opt-outs, respect_optouts, which is also controllable via the disallowed_header_directives argument for backwards-compatibility. The new argument default is set to True to maintain parity with current behavior.

Performance

In the previous PR for these changes, #218, performance metrics were requested, so I will provide them here as well. This test was run with 1m records from the laion-art dataset. These tests were run in parallel on two separate machines, AWS EC2 m6a.2xlarge (8 vCPU, 32GB RAM).

The command ran on both machines was identical:

time python3 -m img2dataset.main --url_list ./1m.parquet --input-format "parquet" --url_col "URL" \
--caption_col "TEXT" --output_format webdataset --output_folder test --processes_count 8 \
--thread_count 32 --image_size 384 --resize_only_if_bigger=True --resize_mode="keep_ratio" \
 --skip_reencode=True

With Spawning API enabled

This test was run at ~9:20 AM EST May 15th, 2023.

This PR:
real 48m21.964s
user 297m26.488s
sys 22m21.457s

Current head:
real 47m54.071s
user 299m17.610s
sys 22m44.699s

This is less than a 2% increase in overall runtime with the Spawning API (preprocessing step) enabled. As this step only executes when the required environment variable is set, I believe the performance impact is acceptable. A developer must intentionally use this feature to experience the (relatively small) impact.

Without Spawning API

This test was run at ~noon EST May 15th, 2023. In this test, the Spawning API environment variable was not set, and thus the preprocessing step was skipped.

This PR:
real 44m19.208s
user 299m5.399s
sys 22m28.679s

Current head:
real 44m14.070s
user 302m57.648s
sys 23m3.228s

The difference in runtime is negligible, as it's largely performing the same logic as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Triage
Waiting for user input
Development

Successfully merging this pull request may close these issues.

None yet

2 participants