Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. #1417

jitsmaster · 2023-12-20T18:24:40Z

No description provided.

Added more console logging for web crawl to better indicating progress

…control of web crawling

…ng dupes and not valid docs

HenryHengZJ · 2023-12-20T21:47:43Z

packages/server/src/index.ts

@@ -1,67 +1,66 @@
+import axios from 'axios'


are you using smtg like formatter that will rearrange the imports by alphabetical order?

It's a VS code extension. If it's causing trouble, I can manually change them back.

ah okay, if you can remove the changes that are not part of the actual web scraping changes, that'd be great!

HenryHengZJ · 2023-12-20T21:48:30Z

many thanks for the PR! Can you highlight some of the cases where the current solution doesnt work/or has limitation? that in this PR we manage to solve that?

jitsmaster · 2024-01-10T18:35:39Z

Sorry for my late response.

Thank you very much for spending your valuable time reviewing this PR.

The purpose of this PR is to allow scraping large set of web pages without node out of memory.

All the filtering mechanisms and re-enforcement of max count is for that purpose.

Since we can never know how much memory it will actually take to store all the scraped content, there will be quite a bit of trials and errors to figure out the max size that is acceptable.

My original intent is to use a piece by piece model to store any content scraped right away, but LangChain doesn't support it, and seems to have no intention to adopt this model.

I do plan for a second stage changes based on the current model. That is to store the content on disk instead of memory, and before text split. That will be another PR.

Cheers!

luc4t · 2024-05-01T21:26:23Z

this would be really helpful, I just read this PR after posting here: #1566 (comment)

Arnold Wang added 8 commits December 14, 2023 10:28

Fix bug of cheerio loader eventually failing with not iterable error

dd2f359

Added more console logging for web crawl to better indicating progress

exclude archives upfront to prevent time out and not interable error

10f850a

Fix issue of bypassing cheero crawl limit due to incorrect logic

9ad89b8

Fix cheerio limit not controlling actual documents to upsert issue

ce307f8

Added inclusion and exclusion settings for web crawl. Allowing finer …

02d0da9

…control of web crawling

Fix cheerio for not filtering out loaded docs and excluded doc, causi…

86bc78a

…ng dupes and not valid docs

prefix and ex prefix url checking to handle empty listing scenario

4db5ac3

change port back to 3000

fd3bf9c

jitsmaster changed the title ~~Merge from main repo~~ Add url filtering to Cheerio scraper. Also fix issue of link limit not getting enforced. Dec 20, 2023

jitsmaster changed the title ~~Add url filtering to Cheerio scraper. Also fix issue of link limit not getting enforced.~~ Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. Dec 20, 2023

HenryHengZJ reviewed Dec 20, 2023

View reviewed changes

Merge branch 'main' into main

2161c92

HenryHengZJ requested a review from chungyau97 January 14, 2024 15:10

Merge branch 'main' into main

17605ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. #1417

Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. #1417

jitsmaster commented Dec 20, 2023

HenryHengZJ Dec 20, 2023

jitsmaster Jan 10, 2024

HenryHengZJ Jan 14, 2024 •

edited

HenryHengZJ commented Dec 20, 2023

jitsmaster commented Jan 10, 2024 •

edited

luc4t commented May 1, 2024

Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. #1417

Are you sure you want to change the base?

Add url filtering to Cheerio scraper. Also fix multiple issues of link limit enforcement. #1417

Conversation

jitsmaster commented Dec 20, 2023

HenryHengZJ Dec 20, 2023

Choose a reason for hiding this comment

jitsmaster Jan 10, 2024

Choose a reason for hiding this comment

HenryHengZJ Jan 14, 2024 • edited

Choose a reason for hiding this comment

HenryHengZJ commented Dec 20, 2023

jitsmaster commented Jan 10, 2024 • edited

luc4t commented May 1, 2024

HenryHengZJ Jan 14, 2024 •

edited

jitsmaster commented Jan 10, 2024 •

edited