Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading get stuck at some particular points #21

Open
fgvfgfg564 opened this issue Mar 11, 2024 · 20 comments
Open

Downloading get stuck at some particular points #21

fgvfgfg564 opened this issue Mar 11, 2024 · 20 comments

Comments

@fgvfgfg564
Copy link

I tried to download the dataset and it get stuck when I downloaded 13.1G files. The command line just get stuck with no updates, and network stat shows that it has also stopped. Have no idea about what happened. Perhaps some entry point in the csv file causes this?

I shuffled the csv file several times. Each time the stop point is different, ranging from 11G to 14G.

We have tried downloading on Windows and WSL, both leads to the same error. There's no problem with network or disk.

@tsaishien-chen
Copy link
Contributor

Hi @fgvfgfg564, is there any error messages?

@fgvfgfg564
Copy link
Author

There is no error message. The downloading process just simply got stuck there. We're guessing that perhaps some of the data points were too long and exceeded the upper limit of commands that windows can support, resulting in the error.

@Secant1998
Copy link

Hi @fgvfgfg564, is there any error messages?

Hello, I am the co-worker of the questioner, thanks for your reply. There is no error message during the download and the same situation occurs in Ubuntu 22.04. However, our later tests revealed that CSV file was not the problem.

@kuno989
Copy link

kuno989 commented Mar 19, 2024

@tsaishien-chen
At first I thought it was a CPU usage issue, but it wasn't. [because CPU usage goes up to 80~90% initially]
I tried running it with 64 cores 64GM RAM[instance] and 16 threads (on config file), and the CPU usage decreased from about 40 to 60 minutes before converging to 0%. (Perhaps the program has stuck)
I didn't get any errors in the process.

@AliaksandrSiarohin @Secant1998
Have you guys resolved the issue?

@tsaishien-chen
Copy link
Contributor

Hi @fgvfgfg564, @Secant1998, @kuno989,
Sorry for your inconvenience. Which csv file were you downloading then?
Will the testing and validation set downloading also give you the same problem?
Also, when you killed the stuck process, did you notice which line does the process stuck at?

@kuno989
Copy link

kuno989 commented Mar 20, 2024

Hi @tsaishien-chen
I haven't exactly tested the issue I commented above, but I have logged usage when using video2dataset.
This can be due to too much data or poor memory or CPU management of video2dataset.

Here is a graph of usage when downloading the panda70m_training_full.csv data through Spark.
[OCPU: 64 (128) 158GB Ram] * 10 instance

subsampling: {}

reading:
    yt_args:
        download_size: 720 # I had the same issue at 480 resolution, so I don't think it's a quality issue.
        download_audio: True
        yt_metadata_args:
            writesubtitles:  True
            subtitleslangs: ['en']
            writeautomaticsub: True
            get_info: True
    timeout: 60
    sampler: null

storage:
    number_sample_per_shard: 100
    oom_shard_count: 5
    captions_are_subtitles: False

distribution:
    processes_count: 16
    thread_count: 16
    subjob_size: 10000
    distributor: "pyspark"
스크린샷 2024-03-21 오전 1 37 41

Usage of panda70m_testing.csv in the same configuration.

스크린샷 2024-03-21 오전 1 39 06

@tsaishien-chen
Copy link
Contributor

Hi @kuno989,
Thanks for providing the elaborate investigation!
I am assuming that downloading gets stuck due to hardware overloading. Does this issue also happen when you download the panda70m_testing.csv? And have you tried to reduce number of parallel processes? Could this help?
As it seems like a major issue that lots of people have encountered, I would like to know more about this and document this problem and solution into readme.
Thanks for letting me know the issue and providing very useful information!

@kuno989
Copy link

kuno989 commented Mar 22, 2024

Hi @tsaishien-chen ,
I have not been able to get a proper test using testing.csv, but I have been able to test using full_train.csv.
Here is the CPU usage when using full_train.
스크린샷 2024-03-22 오전 11 44 53

RAM
스크린샷 2024-03-22 오전 11 49 11

As you can see, the CPU usage spikes up to 90% at the beginning, but after a certain time it does very little work.
I'm using 16 threads on config file.

Below is htop when CPU usage drops.

스크린샷 2024-03-22 오전 11 41 00

I'm currently checking and it's working, but it seems to be threadlocked at a certain moment, what do you think?
This is an instance of ubuntu 20.04 with 64 core, 64 GB specs.

here is version information
ffmpeg version 4.2.7-0ubuntu0.1
yt-dlp 2024.03.10

+++++++

I ran a total of 48 hours of testing, with the following results
I think it might be an issue with yt-dlp.
I don't know the exact reason, but when I look at it in tmux, I see that it is stuck working in yt-dlp.
This seems to be caused by excessive CPU or RAM usage.
This issue does not occur when downloading other YouTube-based datasets.

cpu, ram
스크린샷 2024-03-23 오전 12 58 46

스크린샷 2024-03-23 오전 12 59 24

Because when I exit video2dataset with ctrl + c, it works momentarily.

스크린샷 2024-03-23 오전 1 04 07

@tsaishien-chen
Copy link
Contributor

Hi @kuno989,
All test you showed above is the downloading of full train set, right? Have you tried to download test set? Does the same issue occur?
After the downloading gets stuck, how many videos have you downloaded (in term of both the number and capacity of the downloaded videos)?
You mentioned "This issue does not occur when downloading other YouTube-based datasets." Does that also test on the same machine? and which exact datasets have you tried?
Again, sorry for your inconvenience and I am still investigating about this.

@tsaishien-chen
Copy link
Contributor

If the downloading of test set (a smaller subset) can work, I think a way to fix this issue is: split the whole csv file into multiple smaller ones and download them by a bash script.
But before that, I would like to check whether the same issue happens at a smaller dataset (e.g., test set).

@kuno989
Copy link

kuno989 commented Mar 23, 2024

As you can see in the wandb logs, video2dataset processed 5993 ~ 6100 pieces of data in 48 hours.

스크린샷 2024-03-23 오전 11 13 12

++++++
hi @tsaishien-chen,
I have tried working with smaller sets, but the result is the same.
I split it into 64 pieces and the same thing happened after a period of time.

Below are the results
스크린샷 2024-03-23 오후 9 57 42

스크린샷 2024-03-23 오후 9 57 53

@itorone
Copy link

itorone commented Mar 23, 2024

Hi @tsaishien-chen I have encountered the same issue.We have tried three different datasets: training_full, training_2m, and training_10m. They get stuck after downloading some content, until we manually stop them with Ctrl+C. it seems that the problem is not caused by the CPU and RAM.

@kuno989
Copy link

kuno989 commented Mar 28, 2024

hi @tsaishien-chen Is there any update?

@tsaishien-chen
Copy link
Contributor

tsaishien-chen commented Mar 28, 2024

Hi @itorone: When you terminated the processes, did you see at which lines does the code stuck by checking the command window or htop?

Hi @kuno989: For the screenshot below, is it captured after the processes get stuck?
image
If that is the case, I am guessing whether the code stucks when ffmpeg splits the video.
As you mentioned: This issue does not occur when downloading other YouTube-based datasets. What exact YouTube-based datasets you have tried before? If none of the datasets run splitting, ffmpeg might be the reason that causes stuck.

@kuno989
Copy link

kuno989 commented Mar 28, 2024

This is a screenshot of when CPU utilization dropped.
As mentioned above, I'm using ffmpeg version 4.2.7-0ubuntu0.1, if it's possible that ffmpeg is stuck, it could be a version issue, can you tell me what version you're using?

Here's another YouTube dataset that I'm using
youtube-8m, hdvila-100M

@tsaishien-chen
Copy link
Contributor

I used ffmpeg-4.4.1-amd64-static. But since your machine can work for hdvila-100m which also splits the video by ffmpeg, I don't think ffmpeg is the problem.

@tsaishien-chen
Copy link
Contributor

Hi @fgvfgfg564 and @Secant1998, have you solved the problem? Could you please share how you fix the issue?

@Qianjx
Copy link

Qianjx commented Mar 29, 2024

I had the same problem on one of my server and it didn't occur on another server. I find the problem is because after finishing downloading the video with yt_download, some threads occupied this file and gave the a lock thus the main python thread cannot continue reading this video file and waiting for this file free, and then it get stuck. One obvious case is when you try to ctrl C this python process, it would throw a file not found error while this file is already completely downloaded in your tmp dir.

Yet I don't have anything solutions to this problem and I cannot figure out which process occupies the downloaded video in the tmp dir. But I tried another way to solve it : now that the downloading is actually done, I can just skip this read operation here and continue downloading the rest files so the program won't get stuck. After downloading is finished, do the rest spilt and subsample operation.

@https://github.com/snap-research/Panda-70M/blob/main/dataset_dataloading/video2dataset/video2dataset/data_reader.py 267line

@tsaishien-chen
Copy link
Contributor

Hi @Qianjx,
Big thanks for the helpful information! Just to clarify: you found that the video is completely downloaded but the process will get stuck when it reads the video here:

streams[modality] = modality_file.read()

Is that correct? May I know your solution for that? Do you set a timeout so if the video cannot be read within time, just ignore that and continue processing the next video?

Hi @kuno989: Does this information help you solve the issue? And may I know your solution? Thanks!

@kuno989
Copy link

kuno989 commented Apr 1, 2024

Hi @tsaishien-chen,
In my case, I took the idea from the above case and modified it in the following way.
And it's currently downloading, I tested it yesterday for 12 hours with a single instance, and it worked fine, so I'm now doing a real download with spark. I think it should be downloading without problems now, but I'll see.

import portalocker
...
        streams = {}
        for modality, modality_path in modality_paths.items():
            try:
                with portalocker.Lock(modality_path, 'rb', timeout=125) as locked_file:
                    streams[modality] = locked_file.read()
                os.remove(modality_path)
            except portalocker.exceptions.LockException:
                print(f"Timeout occurred trying to lock the file: {modality_path}")
            except IOError as e:
                print(f"Failed to delete the file: {modality_path}. Error: {e}")

And you're right, at line 268, it's still occupying the system, so of course it can't do any more work, so the CPU and RAM utilization will drop over time.

Timeout occurred trying to lock the file: /sparkdata/tmp/a208b46b-d5f1-460f-891a-4606513370e1.mp4
Traceback (most recent call last):
  File "/opt/environment/lib/python3.10/site-packages/video2dataset/workers/download_worker.py", line 233, in download_shard
    subsampled_streams, metas, error_message = broadcast_subsampler(streams, meta)
  File "/opt/environment/lib/python3.10/site-packages/video2dataset/subsamplers/clipping_subsampler.py", line 237, in __call__
    return streams_clips, metadata_clips, None

If you have a better solution, please share it! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants