New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification of correct TSV file for CC3M? #371
Comments
Good question, I don't have the answer. Maybe you can download all 3 files and see which one have the right number of sample and the right columns? You may compare against https://huggingface.co/datasets/pixparse/cc3m-wds |
@RylanSchaeffer Hi! I have the same problem, have you solved it? |
I switched to a different dataset. |
Hi! Your instructions to download CC3M state "Go to https://ai.google.com/research/ConceptualCaptions/download and press download That's a 500MB tsv file"
However, when I go to that URL, there are three download buttons:
Your instructions then state to modify
cc3m.tsv
but none of the three downloaded files are namedcc3m.tsv
:Train_GCC-training.tsv
(564 MB)Validation_GCC-1.1.0-Validation.tsv
(2.6 MB)Image_Labels_Subset_Train_GCC-Labels-training.tsv
(1.3 GB)The instructions do state that the file should be ~500 MB, which roughly matches the file obtained by downloading the training split (
Train_GCC-training.tsv
)Could you please clarify how to obtain the correct
cc3m.tsv
file?The text was updated successfully, but these errors were encountered: