Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow crawling of items outside of domain #14

Merged
merged 3 commits into from
Nov 30, 2023
Merged

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Nov 10, 2023

TODO:

docs/setup.rst Outdated
@@ -70,6 +70,11 @@ The following additional settings are recommended:
``"zyte_crawlers.middlewares.CrawlingLogsMiddleware": 1000``, to log crawl
data in JSON format for debugging purposes.

- Update :setting:`SPIDER_MIDDLEWARES <scrapy:SPIDER_MIDDLEWARES>` to include
``"zyte_crawlers.middlewares.ItemOffsiteMiddleware": 500`` and remove
``"scrapy.spidermiddlewares.offsite.OffsiteMiddleware"``. This allows for the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this to Scrapy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a proposal for Scrapy in scrapy/scrapy#6151

@codecov-commenter
Copy link

codecov-commenter commented Nov 10, 2023

Codecov Report

Merging #14 (3342da2) into main (ca4dfb9) will decrease coverage by 0.18%.
Report is 3 commits behind head on main.
The diff coverage is 91.66%.

❗ Current head 3342da2 differs from pull request most recent head a616617. Consider uploading reports for the commit a616617 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #14      +/-   ##
==========================================
- Coverage   98.56%   98.39%   -0.18%     
==========================================
  Files           9        9              
  Lines         488      498      +10     
==========================================
+ Hits          481      490       +9     
- Misses          7        8       +1     
Files Coverage Δ
zyte_spider_templates/spiders/base.py 100.00% <100.00%> (ø)
zyte_spider_templates/middlewares.py 88.46% <88.88%> (-0.18%) ⬇️

@@ -49,6 +49,10 @@ class BaseSpiderParams(BaseModel):
"widget": "request-limit",
},
)
allow_items_outside_domains: Optional[bool] = Field(
description="When set to True, items outside of the domains will be crawled.",
default=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's feasible NOT to add this option, but make it a default instead? It looks safe enough, at least for "navigation" and "pagination_only" strategies. Maybe it's fine for "full" as well. We might need to do some experiment to ensure it's safe though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some analysis, it turns out we can enabled crawling products outside of the domain by default. I guess the ultimate switch by users here would be enabling the AllowOffsiteMiddleware in settings.

@BurnzZ BurnzZ changed the title add new param: allow_items_outside_domains Allow crawling of items outside of domain Nov 22, 2023
docs/setup.rst Outdated Show resolved Hide resolved
Co-authored-by: Adrián Chaves <[email protected]>
@BurnzZ BurnzZ merged commit 918e9b4 into main Nov 30, 2023
8 checks passed
@wRAR wRAR deleted the outside-domain-item branch February 9, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants