New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The first rule in a robots.txt with BOM will be ignored #6195
Comments
Also, in researching this bug, I came across this issue which has since been resolved. However in testing, I found that links weren't being properly extracted when using the example given there, which mine was based off. It could have been user error and I didn't dive deep into that or create an MRE because that was not my concern at the moment, but it can be replicated, by serving content with random links, using the server as supplied in that issue, and changing the spider to parse links. for link in self.link_extractor.extract_links(response):
yield scrapy.Request(link.url, callback=self.parse) I leave this (mostly unrelated) comment here in case anyone reads this and wants to pick up on that thread as well. |
Description
When a robots.txt is encountered that incluces a BOM, not all files are respected. This is due to the BOM being included in the content passed to protego. When the content of robots.txt is passed to protego, the user-agent with the BOM is not a valid user-agent, so it is ignored.
One could argue that protego should handle that, but it seems more likely that only the content without the BOM should be passed to protego.
Steps to Reproduce
Expected behavior:
No pages should be crawled as they should be blocked by robots.txt.
Actual behavior: [What actually happens]
A page is scanned because the robots.txt rule is ignored.
Versions
Scrapy : 2.9.0
lxml : 4.9.2.0
libxml2 : 2.9.14
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.1
Twisted : 22.10.0
Python : 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.1 30 May 2023)
cryptography : 41.0.1
Platform : macOS-14.1.2-x86_64-i386-64bit
Additional context
server.py
spider.py
The text was updated successfully, but these errors were encountered: