Auto-GPT Performance 📈 #5190

Pwuts · 2023-09-10T22:52:55Z

Note
This issue is a work in progress. It will be expanded and elaborated further based on advancing insight and questions (so feel free to ask!).

Performance: what makes or breaks it?

Q: What makes a generalist agent such as Auto-GPT perform or fail?
A: all of the below, and more.

Task processing ⚙️

Comprehension

First of all, the agent has to understand the task it is given. Otherwise, there is no way it can be executed correctly, regardless of how well the rest of the application works.

Conversion

Once the task is understood, the agent may convert it to a format that is usable in the rest of the program flow. Examples are a high-level plan, a to-do list, or just a clarified task description.

Adherence

It is paramount that the agent sticks to the given task and its scope, and does not alter or expand them without the user's involvement, during setup and over the course of a session.

Self-correction

When the execution of the task does not go according to plan, this should be recognized by the agent and dealt with appropriately. It is not uncommon for agents not to recognize that an action did not have the intended result and continue executing as if all is fine, which can lead to hallucinations.

Implement Tree of Thought prompting method #4450

Prompting 💬

There are many factors that influence the efficacy of a prompt, but it should be noted that LLMs are not deterministic, linear or time-invariant: changing one word may have unpredictable and seemingly unrelated effects, and LLMs may return different completions when prompted multiple times with the same prompt.

Any prompt must be evaluated for its average performance, and the system as a whole must be designed to correct for negative deviations.

For a guide on how to write good prompts for OpenAI's LLMs, see their GPT best practices.

A few basic principles:

Unnecessary complexity or bloat in the prompt can lead to decreased performance. Vice versa, making a prompt more focused can lead to better performance.
Especially for GPT-3.5, detailed instructions help comprehension and adherence a lot.

Whereas LLMs have a limit to their context length, the basic agent concept does not. This calls for solutions to manage the variable-length parts of the prompt, such as the execution history. The simplest approach is to compress and/or truncate the execution history in order to fit more in the prompt. Another is to use a semantic document store and to select and inject items based on their current relevance.

Tools 🛠️

Suitable tools should be available to the agent
Obviously, without the right tools, an agent won't be able to do the job. With good task processing, they can sometimes get close though, using the tools that are available to achieve a partial solution.
The available tools must be suitable for use with an LLM-powered agent
The input, output, and side effects of a tool must be very well defined to the LLM, e.g. in the system prompt or the output message of an action with side effects. Also, when a tool fails in some way, the error message should allow the agent to understand the issue, so that it can be dealt with appropriately.

Cost / Speed / Performance 📊

Aside from an agent's capacity to fulfil tasks, its efficiency in terms of time and money should also be considered as a part of its total performance. This comes down to efficient use of resources, and proper choice of LLMs to use for different internal processes of the agent.

Example considerations:

Running a memory system will increase the consumption of API tokens, and may only increase performance in long runs. So in a portion of usage scenarios, the cost of running the memory system will not outweigh the performance benefits.
Using multiple smaller prompts as a part of an agent's internal process (e.g. plan-execute-evaluate) will significantly increase overall token usage while mostly benefiting the agent's performance on complex tasks.
GPT-4 is slower and more expensive, but more able to process complex tasks and more capable of dealing with setbacks.

Measuring performance 📈

To measure the impact of changes and intended improvements on performance, we use our Benchmark. This benchmark is also used and recognized by various other agent developers (see the README).

Notes:

As with prompts, the performance of an agent as a whole for a given challenge must be expressed as an average performance number, in this case a plain success rate. The agent may behave differently between tries, leading to varying results.
- When a challenge gets a success rate between 30% and 80%, a >95% success rate is probably a matter of minor prompt optimization.
Most if not all of the current benchmark challenges have a time cut-off. If an agent is capable but too slow, it can still fail the challenge.

Example of verified improvement over a number of revisions:

Boostrix · 2023-10-04T17:47:44Z

Not sure if we are supposed to comment here or not (sorry if this is the wrong place, feel free to delete/move my comment as needed):

Longevity: context length management

I ran into this when I tinkered with recursively splitting up plans using a "compile_plan" command to come up with atomic todo lists that would eventually end up resembing BIFs #4107

My idea at the time was to artificially constrain each step to end up being mappable to a built-in-function AND fit into the context window, and if not divide & conquer. Each would then be recursively executed by a sub-agent with little/no outer dependencies in terms of outer context, resembling a call stack, as mentioned here: #3933 (comment)

From that perspective, tackling "task management" like this also means managing our context window better, by identifying independent sub-tasks to delegate those to sub-agents: #70

Pwuts added the meta Meta-issue about a topic that multiple issues already exist for label Sep 10, 2023

Pwuts self-assigned this Sep 10, 2023

Pwuts pinned this issue Sep 10, 2023

Swiftyos added the AutoGPT Agent label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-GPT Performance 📈 #5190

Auto-GPT Performance 📈 #5190

Pwuts commented Sep 10, 2023 •

edited

Boostrix commented Oct 4, 2023 •

edited

Auto-GPT Performance 📈 #5190

Auto-GPT Performance 📈 #5190

Comments

Pwuts commented Sep 10, 2023 • edited

Performance: what makes or breaks it?

Task processing ⚙️

Comprehension

Conversion

Adherence

Self-correction

Prompting 💬

Longevity: context length management

Tools 🛠️

Cost / Speed / Performance 📊

Measuring performance 📈

Boostrix commented Oct 4, 2023 • edited

Pwuts commented Sep 10, 2023 •

edited

Boostrix commented Oct 4, 2023 •

edited