New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-GPT Performance 📈 #5190
Comments
Not sure if we are supposed to comment here or not (sorry if this is the wrong place, feel free to delete/move my comment as needed):
I ran into this when I tinkered with recursively splitting up plans using a "compile_plan" command to come up with atomic todo lists that would eventually end up resembing BIFs #4107 My idea at the time was to artificially constrain each step to end up being mappable to a built-in-function AND fit into the context window, and if not divide & conquer. Each would then be recursively executed by a sub-agent with little/no outer dependencies in terms of outer context, resembling a call stack, as mentioned here: #3933 (comment) From that perspective, tackling "task management" like this also means managing our context window better, by identifying independent sub-tasks to delegate those to sub-agents: #70 |
Performance: what makes or breaks it?
Q: What makes a generalist agent such as Auto-GPT perform or fail?
A: all of the below, and more.
Task processing ⚙️
Comprehension
First of all, the agent has to understand the task it is given. Otherwise, there is no way it can be executed correctly, regardless of how well the rest of the application works.
Conversion
Once the task is understood, the agent may convert it to a format that is usable in the rest of the program flow. Examples are a high-level plan, a to-do list, or just a clarified task description.
Adherence
It is paramount that the agent sticks to the given task and its scope, and does not alter or expand them without the user's involvement, during setup and over the course of a session.
Self-correction
When the execution of the task does not go according to plan, this should be recognized by the agent and dealt with appropriately. It is not uncommon for agents not to recognize that an action did not have the intended result and continue executing as if all is fine, which can lead to hallucinations.
Prompting 💬
There are many factors that influence the efficacy of a prompt, but it should be noted that LLMs are not deterministic, linear or time-invariant: changing one word may have unpredictable and seemingly unrelated effects, and LLMs may return different completions when prompted multiple times with the same prompt.
Any prompt must be evaluated for its average performance, and the system as a whole must be designed to correct for negative deviations.
For a guide on how to write good prompts for OpenAI's LLMs, see their GPT best practices.
A few basic principles:
Related:
Longevity: context length management
Whereas LLMs have a limit to their context length, the basic agent concept does not. This calls for solutions to manage the variable-length parts of the prompt, such as the execution history. The simplest approach is to compress and/or truncate the execution history in order to fit more in the prompt. Another is to use a semantic document store and to select and inject items based on their current relevance.
Tools 🛠️
Suitable tools should be available to the agent
Obviously, without the right tools, an agent won't be able to do the job. With good task processing, they can sometimes get close though, using the tools that are available to achieve a partial solution.
The available tools must be suitable for use with an LLM-powered agent
The input, output, and side effects of a tool must be very well defined to the LLM, e.g. in the system prompt or the output message of an action with side effects. Also, when a tool fails in some way, the error message should allow the agent to understand the issue, so that it can be dealt with appropriately.
Cost / Speed / Performance 📊
Aside from an agent's capacity to fulfil tasks, its efficiency in terms of time and money should also be considered as a part of its total performance. This comes down to efficient use of resources, and proper choice of LLMs to use for different internal processes of the agent.
Example considerations:
Measuring performance 📈
To measure the impact of changes and intended improvements on performance, we use our Benchmark. This benchmark is also used and recognized by various other agent developers (see the README).
Notes:
Example of verified improvement over a number of revisions:
The text was updated successfully, but these errors were encountered: