Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The download process goes on forever #343

Open
novruzgurbanov opened this issue Sep 1, 2023 · 6 comments
Open

The download process goes on forever #343

novruzgurbanov opened this issue Sep 1, 2023 · 6 comments

Comments

@novruzgurbanov
Copy link

novruzgurbanov commented Sep 1, 2023

Hi! After downloading the files from laion2b-en with these parameters:

download(
        processes_count=32,
        url_list=parquet_file,
        resize_mode='no',
        output_folder=output_dir,
        output_format='webdataset', # Download files as a files 
        input_format='parquet',
        url_col="URL",
        caption_col="TEXT",
        number_sample_per_shard=50000,
        distributor='multiprocessing',
        )

all files will be downloaded (I think), but then the last iteration goes on forever and I have to stop manually. Could you look at this please?

P.S. I tried this function a month ago, and it worked seamlessly. But now, no matter what I do, no matter how simple parameters I defined, it stucks.

@rom1504
Copy link
Owner

rom1504 commented Sep 1, 2023 via email

@novruzgurbanov
Copy link
Author

@rom1504 I am running the download inside the docker container. Month ago, in the same docker container, it worked seamlessly. But now, I don't know why it cannot stop. I am not a pro about docker images, but if it is possible, maybe I can send you the image and you run a container and try to download some files? (img2dataset already installed)

@rom1504
Copy link
Owner

rom1504 commented Sep 1, 2023 via email

@novruzgurbanov
Copy link
Author

@rom1504 Sorry, I quite didn't get what do you mean. If the container is same, the image is same, what other configs should I check for? If you have suggestion what to check, would appreciate!

@rom1504
Copy link
Owner

rom1504 commented Sep 1, 2023 via email

@novruzgurbanov
Copy link
Author

@rom1504 Interesting.. I downloaded files with the per shard parameter 10K, the download and the process finished on time. I guess, the function or something else cannot handle more shard per sample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants