Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

Open
1 task done
cpbotha opened this issue Apr 15, 2024 · 6 comments
Open
1 task done
Labels
enhancement New feature or request

Comments

@cpbotha
Copy link

cpbotha commented Apr 15, 2024

What happened?

The Google Search tool sends only tiny fractions of the search result pages. This does not give the LLM much to work with.

Steps to Reproduce

Configure and add the Google Search plugin to either the Plugins or Assistants (my preference) modes.

Ask a question that will necessitate a web search. Open the result that is sent back to the LLM: This is the raw Google Search results JSON, which only includes page titles and snippets (a tiny extract of the page), but not the actual contents of the search result pages.

PR

I've modified the Google Search tool to extract page contents using Readability.js, and to return all of that to the LLM. See #2419

Code of Conduct

  • I agree to follow this project's Code of Conduct
@cpbotha cpbotha added the bug Something isn't working label Apr 15, 2024
@danny-avila danny-avila added enhancement New feature or request and removed bug Something isn't working labels Apr 15, 2024
@danny-avila
Copy link
Owner

danny-avila commented Apr 15, 2024

This is expected. I appreciate the effort in addressing this, but I'm not sure I would expect the tool to scrape the search results.

@cpbotha
Copy link
Author

cpbotha commented Apr 15, 2024

Personally I've not been able to come up with another way of giving the LLM access to page contents.

Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.

Please let me know how you would like to proceed. :)

@danny-avila
Copy link
Owner

Personally I've not been able to come up with another way of giving the LLM access to page contents.

Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.

Please let me know how you would like to proceed. :)

scraping is fine just allow some way to configure proxies (including socks5) for however the scraping is done. Personally I wouldn’t want to host any LLM scraping without rotating proxies at work.

@cpbotha
Copy link
Author

cpbotha commented Apr 20, 2024

Do you want the scraping logic to support the rotation internally (i.e. get list of proxies from configuration, rotate / randomize over them), or are you OK with a proxy being configurable? (in which case users will have to make use of a proxy-proxy service that rotates the upstreams)

Note to self: Look into https://github.com/TooTallNate/proxy-agents/tree/main/packages/proxy-agent

@danny-avila
Copy link
Owner

danny-avila commented Apr 20, 2024

No need, it just needs to handle simple proxy configuration, ideally both regular and SOCKS5 proxies. A lot of proxy services do the rotating for you and doing that here, internally, might be beyond the scope of the project

@cpbotha
Copy link
Author

cpbotha commented Apr 21, 2024

I've added proxy-agent which honours the standard environment variables for using proxies, and then selects the right {http,https,socks}-proxy-agent and activated this for the axios.get of the page contents.

How would you like to proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants