Update crawler.ts: Add hostname check to keep crawler on the same domain. #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The existing crawler logic adds any found URLs to the queue without checking if they belong to the same domain as the starting URL. This could result in the crawler venturing off into unrelated domains, especially social links, which may not be desirable for the scope of the crawl.
Solution
Introduced a hostname check in the addNewUrlsToQueue function. This ensures that only URLs that have the same hostname as the starting URL are added to the queue for crawling. This feature helps in restricting the crawler within the scope of the initial domain, thereby making the crawl more focused and efficient.
Type of Change
New feature (non-breaking change which adds functionality)
Test Plan
Run the crawler on a starting URL that contains both internal and external links.
Verify that only URLs belonging to the same hostname as the starting URL are added to the queue.
Optionally, you can print the queue or keep logs to verify that the URLs are indeed from the same hostname.