Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

dougwithseismic · 2023-09-15T12:02:02Z

Problem

The existing crawler logic adds any found URLs to the queue without checking if they belong to the same domain as the starting URL. This could result in the crawler venturing off into unrelated domains, especially social links, which may not be desirable for the scope of the crawl.

Solution

Introduced a hostname check in the addNewUrlsToQueue function. This ensures that only URLs that have the same hostname as the starting URL are added to the queue for crawling. This feature helps in restricting the crawler within the scope of the initial domain, thereby making the crawl more focused and efficient.

Type of Change

New feature (non-breaking change which adds functionality)

Test Plan

Run the crawler on a starting URL that contains both internal and external links.
Verify that only URLs belonging to the same hostname as the starting URL are added to the queue.
Optionally, you can print the queue or keep logs to verify that the URLs are indeed from the same hostname.

Added hostname check before adding a url to the queue to make sure we don't follow social links to the end of the earth if editing the URLs we're crawling. If the found URL has the same hostname as the starting URL, it will add that URL to the queue. Thanks for the project!

dougwithseismic · 2023-09-16T09:23:24Z

Tested and works like a charm. I'd also like to extend crawler to take an optional regex, what do you think?

HarounAns · 2023-09-24T05:56:10Z

src/app/api/crawl/crawler.ts

@@ -1,5 +1,5 @@
-import cheerio from 'cheerio';


your linter shouldnt be making changes for this PR

Sorting now.

dougwithseismic

This PR now only brings in a hostname check

athrael-soju · 2023-10-09T14:15:54Z

src/app/api/crawl/crawler.ts

@@ -80,8 +102,20 @@ class Crawler {

 private extractUrls(html: string, baseUrl: string): string[] {
 const $ = cheerio.load(html);


Minor observation: This will give a 'deprecated' warning. You can do import { load } from 'cheerio'; and use the load by itself instead.

HarounAns suggested changes Sep 24, 2023

View reviewed changes

dougwithseismic added 2 commits September 25, 2023 10:18

Remove Linting Changes

1b7e03f

Removed additional Code

e934889

dougwithseismic commented Sep 25, 2023

View reviewed changes

This was referenced Oct 9, 2023

Update crawler.ts: Add hostname check to keep crawler on the same domain. athrael-soju/Iridium-AI#32

Closed

32-update-crawlerts-add-hostname-check-to-keep-crawler-on-the-same-domain athrael-soju/Iridium-AI#33

Merged

athrael-soju reviewed Oct 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

dougwithseismic commented Sep 15, 2023

dougwithseismic commented Sep 16, 2023

HarounAns Sep 24, 2023

dougwithseismic Sep 25, 2023

dougwithseismic left a comment

athrael-soju Oct 9, 2023 •

edited

		@@ -80,8 +102,20 @@ class Crawler {

		private extractUrls(html: string, baseUrl: string): string[] {
		const $ = cheerio.load(html);

Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

Are you sure you want to change the base?

Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

Conversation

dougwithseismic commented Sep 15, 2023

Problem

Solution

Type of Change

Test Plan

dougwithseismic commented Sep 16, 2023

HarounAns Sep 24, 2023

Choose a reason for hiding this comment

dougwithseismic Sep 25, 2023

Choose a reason for hiding this comment

dougwithseismic left a comment

Choose a reason for hiding this comment

athrael-soju Oct 9, 2023 • edited

Choose a reason for hiding this comment

athrael-soju Oct 9, 2023 •

edited