With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

cpbotha · 2024-04-15T14:14:22Z

What happened?

The Google Search tool sends only tiny fractions of the search result pages. This does not give the LLM much to work with.

Steps to Reproduce

Configure and add the Google Search plugin to either the Plugins or Assistants (my preference) modes.

Ask a question that will necessitate a web search. Open the result that is sent back to the LLM: This is the raw Google Search results JSON, which only includes page titles and snippets (a tiny extract of the page), but not the actual contents of the search result pages.

PR

I've modified the Google Search tool to extract page contents using Readability.js, and to return all of that to the LLM. See #2419

Code of Conduct

I agree to follow this project's Code of Conduct

danny-avila · 2024-04-15T14:35:07Z

This is expected. I appreciate the effort in addressing this, but I'm not sure I would expect the tool to scrape the search results.

cpbotha · 2024-04-15T14:58:33Z

Personally I've not been able to come up with another way of giving the LLM access to page contents.

Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.

Please let me know how you would like to proceed. :)

danny-avila · 2024-04-16T11:57:29Z

Personally I've not been able to come up with another way of giving the LLM access to page contents.

Of course we could give it a PageContentsFetch tool (which one should consider, very helpful when you want to ask it questions about a specific page), but this would come down to the same, just take a bit longer.

Please let me know how you would like to proceed. :)

scraping is fine just allow some way to configure proxies (including socks5) for however the scraping is done. Personally I wouldn’t want to host any LLM scraping without rotating proxies at work.

cpbotha · 2024-04-20T12:27:43Z

Do you want the scraping logic to support the rotation internally (i.e. get list of proxies from configuration, rotate / randomize over them), or are you OK with a proxy being configurable? (in which case users will have to make use of a proxy-proxy service that rotates the upstreams)

Note to self: Look into https://github.com/TooTallNate/proxy-agents/tree/main/packages/proxy-agent

danny-avila · 2024-04-20T18:45:42Z

No need, it just needs to handle simple proxy configuration, ideally both regular and SOCKS5 proxies. A lot of proxy services do the rotating for you and doing that here, internally, might be beyond the scope of the project

cpbotha · 2024-04-21T18:50:51Z

I've added proxy-agent which honours the standard environment variables for using proxies, and then selects the right {http,https,socks}-proxy-agent and activated this for the axios.get of the page contents.

How would you like to proceed?

cpbotha added the bug Something isn't working label Apr 15, 2024

cpbotha mentioned this issue Apr 15, 2024

Google search page contents #2419

Draft

10 tasks

danny-avila added enhancement New feature or request and removed bug Something isn't working labels Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

cpbotha commented Apr 15, 2024 •

edited

danny-avila commented Apr 15, 2024 •

edited

cpbotha commented Apr 15, 2024 •

edited

danny-avila commented Apr 16, 2024

cpbotha commented Apr 20, 2024

danny-avila commented Apr 20, 2024 •

edited

cpbotha commented Apr 21, 2024 •

edited

With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

With the Google Search tool, only the page snippets are sent to the LLM / GPT #2420

Comments

cpbotha commented Apr 15, 2024 • edited

What happened?

Steps to Reproduce

PR

Code of Conduct

danny-avila commented Apr 15, 2024 • edited

cpbotha commented Apr 15, 2024 • edited

danny-avila commented Apr 16, 2024

cpbotha commented Apr 20, 2024

danny-avila commented Apr 20, 2024 • edited

cpbotha commented Apr 21, 2024 • edited

cpbotha commented Apr 15, 2024 •

edited

danny-avila commented Apr 15, 2024 •

edited

cpbotha commented Apr 15, 2024 •

edited

danny-avila commented Apr 20, 2024 •

edited

cpbotha commented Apr 21, 2024 •

edited