Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

Open
mohmad-null opened this issue May 8, 2024 · 7 comments
Open

CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails #6350

mohmad-null opened this issue May 8, 2024 · 7 comments
Labels

Comments

@mohmad-null
Copy link

mohmad-null commented May 8, 2024

Scrapy 2.11.1
lxml 5.2.1.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0

I'm using the following settings:

 'AUTOTHROTTLE_MAX_DELAY': 8,
 'AUTOTHROTTLE_START_DELAY': 3,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 3,
 'CONCURRENT_REQUESTS': 1,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
 'CONCURRENT_REQUESTS_PER_IP': 4,
 'COOKIES_ENABLED': False,
 'DEPTH_PRIORITY': 1,
 'DNSCACHE_SIZE': 100000,
 'DNS_RESOLVER': 'scrapy.resolver.CachingHostnameResolver',
 'DNS_TIMEOUT': 120,
 'DOWNLOAD_MAXSIZE': 10000000,
 'DOWNLOAD_TIMEOUT': 100,
 'HTTPPROXY_ENABLED': False,
 'MEMUSAGE_ENABLED': False,
 'REACTOR_THREADPOOL_MAXSIZE': 100,
 'REFERER_ENABLED': False,
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'RETRY_TIMES': 3,
 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
 'SCRAPER_SLOT_MAX_ACTIVE_SIZE': 20000000,

When I added the DNS_RESOLVER line, I start getting every request except the very first one made by a spider resulting in:

2024-05-08 22:15:05 [GSGenericSpider] WARNING: Twisted/Scrapy Error Detected: 	 http://localhost:8070/get_page?url=example.com

The very first query made by the spider works fine, but 100% of the rest are the above error, at a rate of several hundred per second.

I'm using ProxyMiddleware, so I wonder if it's because all the requests are to localhost. Works absolutely fine with the default resolver

Edit: Further testing shows it also fails with the proxy off. It seems to work for the first request to any given domain, but all later requests return the Twisted/Scrapy Error Detected

@mohmad-null mohmad-null changed the title CachingHostnameResolver - localhost - Twisted/Scrapy Error Detected CachingHostnameResolver - Twisted/Scrapy Error Detected for subsequent requests May 8, 2024
@Gallaecio
Copy link
Member

Where does that message, “Twisted/Scrapy Error Detected”, come from? It does not seem to exist in the Scrapy code base. Is it coming from your own spider code (GSGenericSpider)? From some third-party Scrapy component?

@wRAR
Copy link
Member

wRAR commented May 9, 2024

If that means some exception was caught (and silenced) you should at least show that exception instead.

@mohmad-null
Copy link
Author

mohmad-null commented May 9, 2024

Thanks both, you're right, sorry, my bad (its been years since last I touched this codebase), I thought it was a twisted/scrapy error but yes, it's my generic error catch.

[Failure instance: Traceback: <class 'TypeError'>: unhashable type: 'list'
C:\venv\lib\site-packages\scrapy\core\downloader\middleware.py:100:download
C:\venv\lib\site-packages\scrapy\utils\defer.py:81:mustbe_deferred
C:\venv\lib\site-packages\twisted\internet\defer.py:2260:unwindGenerator
C:\venv\lib\site-packages\twisted\internet\defer.py:2172:_cancellableInlineCallbacks
--- <exception caught here> ---
C:\venv\lib\site-packages\twisted\internet\defer.py:2003:_inlineCallbacks
C:\venv\lib\site-packages\scrapy\core\downloader\middleware.py:54:process_request
C:\venv\lib\site-packages\scrapy\core\downloader\__init__.py:146:_enqueue_request
C:\venv\lib\site-packages\scrapy\core\downloader\__init__.py:119:_get_slot
]

However, it's for the best, I got lucky in investigating this exception message, because I've changed one or my other settings since and now it works. Looking further into it, I can confirm:

Breaks with above exception:
CONCURRENT_REQUESTS_PER_IP = 4

Works:
CONCURRENT_REQUESTS_PER_DOMAIN = 3

So it's the combination of CONCURRENT_REQUESTS_PER_IP and the CachingHostnameResolver

@mohmad-null mohmad-null changed the title CachingHostnameResolver - Twisted/Scrapy Error Detected for subsequent requests CachingHostnameResolver with CONCURRENT_REQUESTS_PER_IP fails May 9, 2024
@Gallaecio Gallaecio added the bug label May 9, 2024
@wRAR
Copy link
Member

wRAR commented May 9, 2024

Looks like super().getHostByName() produced a list as a resolving result in CachingThreadedResolver.getHostByName().

@Gallaecio
Copy link
Member

Somewhat related to #3867.

@kumar-sanchay
Copy link
Contributor

Let me investigate further

@wRAR
Copy link
Member

wRAR commented May 9, 2024

It may be only happening with certain domains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants