Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of correct TSV file for CC3M? #371

Open
RylanSchaeffer opened this issue Dec 22, 2023 · 3 comments
Open

Clarification of correct TSV file for CC3M? #371

RylanSchaeffer opened this issue Dec 22, 2023 · 3 comments

Comments

@RylanSchaeffer
Copy link

RylanSchaeffer commented Dec 22, 2023

Hi! Your instructions to download CC3M state "Go to https://ai.google.com/research/ConceptualCaptions/download and press download That's a 500MB tsv file"

However, when I go to that URL, there are three download buttons:

image

Your instructions then state to modify cc3m.tsv but none of the three downloaded files are named cc3m.tsv:

  • Training -> Train_GCC-training.tsv (564 MB)
  • Validation -> Validation_GCC-1.1.0-Validation.tsv (2.6 MB)
  • Image Labels -> Image_Labels_Subset_Train_GCC-Labels-training.tsv (1.3 GB)

The instructions do state that the file should be ~500 MB, which roughly matches the file obtained by downloading the training split (Train_GCC-training.tsv)

Could you please clarify how to obtain the correct cc3m.tsv file?

@rom1504
Copy link
Owner

rom1504 commented Dec 24, 2023

Good question, I don't have the answer. Maybe you can download all 3 files and see which one have the right number of sample and the right columns?

You may compare against https://huggingface.co/datasets/pixparse/cc3m-wds

@lishuai-97
Copy link

Hi! Your instructions to download CC3M state "Go to https://ai.google.com/research/ConceptualCaptions/download and press download That's a 500MB tsv file"

However, when I go to that URL, there are three download buttons:

image

Your instructions then state to modify cc3m.tsv but none of the three downloaded files are named cc3m.tsv:

  • Training -> Train_GCC-training.tsv (564 MB)
  • Validation -> Validation_GCC-1.1.0-Validation.tsv (2.6 MB)
  • Image Labels -> Image_Labels_Subset_Train_GCC-Labels-training.tsv (1.3 GB)

The instructions do state that the file should be ~500 MB, which roughly matches the file obtained by downloading the training split (Train_GCC-training.tsv)

Could you please clarify how to obtain the correct cc3m.tsv file?

@RylanSchaeffer Hi! I have the same problem, have you solved it?

@RylanSchaeffer
Copy link
Author

I switched to a different dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants