New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to limit disk and RAM usage by scheduler queue #6085
Comments
Somewhat related to #3237, i.e. future work on that direction may help here. |
The code ( Line 164 in 5b0b002
It is about taking the next request from the scheduler and putting it into the downloader (and also about processing |
Summary
An option to limit RAM and disk usage by the scheduler queue will make the engine to take new requests from the spiders only if there is available space.
Motivation
Recently we've run into the issue of high disk usage by the scheduler queue. We are going through company registry and making a lot of requests. These requests are the only requests spider makes. Sample code:
Currently, generators produce 142,685,210 unique CRNs. Requests can end with 404 (company not found) or 200 (company found).
After reaching ~110k successful requests disk queue occupies 10GB. Meanwhile, RAM usage does not exceed 200MB.
Describe alternatives you've considered
Currently, we resolved the issue with increasing
CONCURRENT_REQUESTS
andCONCURRENT_REQUESTS_PER_DOMAIN
to big enough number, but this helps only if the site can afford so many connections or does not have any rate limiting techniques.Additional context
There is
SCRAPER_SLOT_MAX_ACTIVE_SIZE
setting which is described as follows:"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?
We are also using FIFO queue for this spider, but I do not think this matters.
The text was updated successfully, but these errors were encountered: