"Recursive URL" Document loader load useless documents #21204

beethogedeon · 2024-05-02T15:43:26Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

url = "https://www.example.com/"
loader = RecursiveUrlLoader(
    url=url, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use "Recursive URL" Document loaders from "langchain_community.document_loaders.recursive_url_loader" to process load all URLs under a root directory but css or js links are also processed

System Info

System Information

OS: Linux
OS Version: #1 SMP Tue Dec 19 13:14:11 UTC 2023
Python Version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]

Package Information

langchain_core: 0.1.48
langchain: 0.1.17
langchain_community: 0.0.36
langsmith: 0.1.52
langchain_cohere: 0.1.4
langchain_text_splitters: 0.0.1

Siddhesh-Agarwal · 2024-05-04T11:20:30Z

Hey, @beethogedeon can you provide the URL where you are facing the problem?

For the URL currently given by you (https://example.com/), the problem lies in the extractor. you have used a very basic extractor and the code can be changed to:

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

def text_extractor(r_text: str) -> str:
    soup = Soup(r_text, "html.parser")
    return " ".join(soup.text.split())

url = "https://www.example.com/"
loader = RecursiveUrlLoader(
    url=url,
    max_depth=2,
    extractor=text_extractor,
)
docs = loader.load()

Siddhesh-Agarwal · 2024-05-04T11:21:26Z

PS: This only solves the problem of extra whitespaces.

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Recursive URL" Document loader load useless documents #21204

"Recursive URL" Document loader load useless documents #21204

beethogedeon commented May 2, 2024

Siddhesh-Agarwal commented May 4, 2024

Siddhesh-Agarwal commented May 4, 2024

"Recursive URL" Document loader load useless documents #21204

"Recursive URL" Document loader load useless documents #21204

Comments

beethogedeon commented May 2, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Siddhesh-Agarwal commented May 4, 2024

Siddhesh-Agarwal commented May 4, 2024