Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Recursive URL" Document loader load useless documents #21204

Open
5 tasks done
beethogedeon opened this issue May 2, 2024 · 2 comments
Open
5 tasks done

"Recursive URL" Document loader load useless documents #21204

beethogedeon opened this issue May 2, 2024 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@beethogedeon
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

url = "https://www.example.com/"
loader = RecursiveUrlLoader(
    url=url, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use "Recursive URL" Document loaders from "langchain_community.document_loaders.recursive_url_loader" to process load all URLs under a root directory but css or js links are also processed

System Info

System Information

OS: Linux
OS Version: #1 SMP Tue Dec 19 13:14:11 UTC 2023
Python Version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]

Package Information

langchain_core: 0.1.48
langchain: 0.1.17
langchain_community: 0.0.36
langsmith: 0.1.52
langchain_cohere: 0.1.4
langchain_text_splitters: 0.0.1

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 2, 2024
@Siddhesh-Agarwal
Copy link

Hey, @beethogedeon can you provide the URL where you are facing the problem?

For the URL currently given by you (https://example.com/), the problem lies in the extractor. you have used a very basic extractor and the code can be changed to:

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

def text_extractor(r_text: str) -> str:
    soup = Soup(r_text, "html.parser")
    return " ".join(soup.text.split())

url = "https://www.example.com/"
loader = RecursiveUrlLoader(
    url=url,
    max_depth=2,
    extractor=text_extractor,
)
docs = loader.load()

@Siddhesh-Agarwal
Copy link

PS: This only solves the problem of extra whitespaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

2 participants