Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dougwithseismic
Copy link

Problem

The existing crawler logic adds any found URLs to the queue without checking if they belong to the same domain as the starting URL. This could result in the crawler venturing off into unrelated domains, especially social links, which may not be desirable for the scope of the crawl.

Solution

Introduced a hostname check in the addNewUrlsToQueue function. This ensures that only URLs that have the same hostname as the starting URL are added to the queue for crawling. This feature helps in restricting the crawler within the scope of the initial domain, thereby making the crawl more focused and efficient.

Type of Change

New feature (non-breaking change which adds functionality)

Test Plan

Run the crawler on a starting URL that contains both internal and external links.
Verify that only URLs belonging to the same hostname as the starting URL are added to the queue.
Optionally, you can print the queue or keep logs to verify that the URLs are indeed from the same hostname.

Added hostname check before adding a url to the queue to make sure we don't follow social links to the end of the earth if editing the URLs we're crawling.

If the found URL has the same hostname as the starting URL, it will add that URL to the queue.

Thanks for the project!
@dougwithseismic
Copy link
Author

Tested and works like a charm. I'd also like to extend crawler to take an optional regex, what do you think?

@@ -1,5 +1,5 @@
import cheerio from 'cheerio';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your linter shouldnt be making changes for this PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting now.

Copy link
Author

@dougwithseismic dougwithseismic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR now only brings in a hostname check

@@ -80,8 +102,20 @@ class Crawler {

private extractUrls(html: string, baseUrl: string): string[] {
const $ = cheerio.load(html);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor observation: This will give a 'deprecated' warning. You can do import { load } from 'cheerio'; and use the load by itself instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants