CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

mohmad-null · 2024-05-08T21:22:20Z

Scrapy 2.11.1
lxml 5.2.1.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0

I'm using the following settings:

 'AUTOTHROTTLE_MAX_DELAY': 8,
 'AUTOTHROTTLE_START_DELAY': 3,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 3,
 'CONCURRENT_REQUESTS': 1,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
 'CONCURRENT_REQUESTS_PER_IP': 4,
 'COOKIES_ENABLED': False,
 'DEPTH_PRIORITY': 1,
 'DNSCACHE_SIZE': 100000,
 'DNS_RESOLVER': 'scrapy.resolver.CachingHostnameResolver',
 'DNS_TIMEOUT': 120,
 'DOWNLOAD_MAXSIZE': 10000000,
 'DOWNLOAD_TIMEOUT': 100,
 'HTTPPROXY_ENABLED': False,
 'MEMUSAGE_ENABLED': False,
 'REACTOR_THREADPOOL_MAXSIZE': 100,
 'REFERER_ENABLED': False,
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'RETRY_TIMES': 3,
 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
 'SCRAPER_SLOT_MAX_ACTIVE_SIZE': 20000000,

When I added the DNS_RESOLVER line, I start getting every request except the very first one made by a spider resulting in:

2024-05-08 22:15:05 [GSGenericSpider] WARNING: Twisted/Scrapy Error Detected: 	 http://localhost:8070/get_page?url=example.com

The very first query made by the spider works fine, but 100% of the rest are the above error, at a rate of several hundred per second.

I'm using ProxyMiddleware, so I wonder if it's because all the requests are to localhost. Works absolutely fine with the default resolver

Edit: Further testing shows it also fails with the proxy off. It seems to work for the first request to any given domain, but all later requests return the Twisted/Scrapy Error Detected

The text was updated successfully, but these errors were encountered:

Gallaecio · 2024-05-09T06:06:07Z

Where does that message, “Twisted/Scrapy Error Detected”, come from? It does not seem to exist in the Scrapy code base. Is it coming from your own spider code (GSGenericSpider)? From some third-party Scrapy component?

wRAR · 2024-05-09T06:23:24Z

If that means some exception was caught (and silenced) you should at least show that exception instead.

mohmad-null · 2024-05-09T09:17:02Z

Thanks both, you're right, sorry, my bad (its been years since last I touched this codebase), I thought it was a twisted/scrapy error but yes, it's my generic error catch.

[Failure instance: Traceback: <class 'TypeError'>: unhashable type: 'list'
C:\venv\lib\site-packages\scrapy\core\downloader\middleware.py:100:download
C:\venv\lib\site-packages\scrapy\utils\defer.py:81:mustbe_deferred
C:\venv\lib\site-packages\twisted\internet\defer.py:2260:unwindGenerator
C:\venv\lib\site-packages\twisted\internet\defer.py:2172:_cancellableInlineCallbacks
--- <exception caught here> ---
C:\venv\lib\site-packages\twisted\internet\defer.py:2003:_inlineCallbacks
C:\venv\lib\site-packages\scrapy\core\downloader\middleware.py:54:process_request
C:\venv\lib\site-packages\scrapy\core\downloader\__init__.py:146:_enqueue_request
C:\venv\lib\site-packages\scrapy\core\downloader\__init__.py:119:_get_slot
]

However, it's for the best, I got lucky in investigating this exception message, because I've changed one or my other settings since and now it works. Looking further into it, I can confirm:

Breaks with above exception:
CONCURRENT_REQUESTS_PER_IP = 4

Works:
CONCURRENT_REQUESTS_PER_DOMAIN = 3

So it's the combination of CONCURRENT_REQUESTS_PER_IP and the CachingHostnameResolver

wRAR · 2024-05-09T09:56:14Z

Looks like super().getHostByName() produced a list as a resolving result in CachingThreadedResolver.getHostByName().

Gallaecio · 2024-05-09T09:56:23Z

Somewhat related to #3867.

kumar-sanchay · 2024-05-09T12:38:24Z

Let me investigate further

wRAR · 2024-05-09T12:57:09Z

It may be only happening with certain domains.

mohmad-null changed the title ~~CachingHostnameResolver - localhost - Twisted/Scrapy Error Detected~~ CachingHostnameResolver - Twisted/Scrapy Error Detected for subsequent requests May 8, 2024

mohmad-null changed the title ~~CachingHostnameResolver - Twisted/Scrapy Error Detected for subsequent requests~~ CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails May 9, 2024

Gallaecio added the bug label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

mohmad-null commented May 8, 2024 •

edited

Gallaecio commented May 9, 2024

wRAR commented May 9, 2024

mohmad-null commented May 9, 2024 •

edited

wRAR commented May 9, 2024

Gallaecio commented May 9, 2024

kumar-sanchay commented May 9, 2024

wRAR commented May 9, 2024

CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

Comments

mohmad-null commented May 8, 2024 • edited

Gallaecio commented May 9, 2024

wRAR commented May 9, 2024

mohmad-null commented May 9, 2024 • edited

wRAR commented May 9, 2024

Gallaecio commented May 9, 2024

kumar-sanchay commented May 9, 2024

wRAR commented May 9, 2024

mohmad-null commented May 8, 2024 •

edited

mohmad-null commented May 9, 2024 •

edited