GitHub - marklysze/AutoGenCodeTesting: Testing AutoGen Code generation with alt-models (local LLMs)

Testing Code Generation using alt-models (Local LLMs)

Two tests so far - Fibonacci function and stock market chart. More to come.

Code Executors

Juypter Notebook (local)

Fibonacci

Prompt for LLM: Write Python code to calculate the 14th Fibonacci number.

Code should generate: 233 or 377

Note: Fn for n = 14 is 377. If we start at n = 0 the 14th number is 233, so n = 14 is actually the 15th Fibonacci number in the sequence. Therefore, some LLM's code will generate 233 and some 377. We will accept both as the question is ambiguous.

See the results folder for code outputs.

See what ChatGPT 3.5 generated (which output 377).

	Key
✅	Valid code and formatting
❌	Invalid code, formatting, or didn't understand task

Model	Run 1	Run 2
CodeLlama 7b Python	❌	❌
CodeLlama 13b Python	❌	❌
CodeLlama 34b Instruct	✅	✅
CodeLlama 34b Python	❌	❌
DeepSeek Coder 6.7b	✅	✅
Llama2 7b Chat	✅	❌
Llama2 13b Chat	✅	✅
Mistral 7b 0.2 Instruct	✅	✅
Mixtral 8x7b Q4	✅	✅
Mixtral 8x7b Q5	✅	✅
Neural Chat 7b Chat	✅	✅
Nexus Raven	❌	✅
OpenHermes 7b Mistral	✅	✅
Orca2 13b	✅	❌
Phi-2	✅	❌
Phind-CodeLlama34b	✅	✅
Qwen 14b	✅	✅
Solar 10.7b Instruct	❌	✅
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat	✅	✅

Stock market chart

Prompt for LLM: Today is {today}. Write Python code to plot TSLA's and META's stock prices YTD.

Code should generate: An image for each stock OR an image including both stocks. If it tries to display it (Jupyter notebook) it's a bonus (but doesn't affect whether it's correct or not).

See the results folder for code outputs.

See what ChatGPT 3.5 generated (which was correct).

	Key
✅	Valid code and formatting
🔶	Almost correct
❌	Invalid code, formatting, or didn't understand task

Model	Run 1	Run 2	Notes
CodeLlama 7b Python	❌	❌
CodeLlama 13b Python	🔶	❌	Just missing "import pandas as pd"
CodeLlama 34b Instruct	✅	✅
CodeLlama 34b Python	❌	❌	Expected a CSV file
DeepSeek Coder 6.7b	🔶	✅	Out of date library usage
Llama2 7b Chat	❌	🔶	Close, showed price change instead of price
Llama2 13b Chat	❌	❌
Mistral 7b 0.2 Instruct	🔶	🔶	Minor string formatting error, used "FB" instead of "META" as ticker symbol
Mixtral 8x7b Q4	✅	❌
Mixtral 8x7b Q5	✅	🔶	Not quite right on second run
Neural Chat 7b Chat	❌	❌
Nexus Raven	❌	❌	First run no code, second is a blank chart
OpenHermes 7b Mistral	❌	❌
Orca2 13b	❌	❌	No code
Phi-2	❌	❌	Off the track
Phind-CodeLlama34b	✅	❌
Qwen 14b	🔶	❌	Timeframe out
Solar 10.7b Instruct	❌	❌
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat	❌	❌	Old retired library use

Function Calling

Prompt for LLM: Draw two agents chatting with each other with an example dialog. Don't add plt.show().

Code should generate: Python code that creates a diagram with two agents on it (E.g. two circles and speech bubbles)

See the results folder for code outputs.

See what OpenAI's API generated.

	Key
✅	Valid code and formatting
🔶	Almost correct
❌	Invalid code, formatting, or didn't understand task

Model	Run 1	Run 2	Notes
CodeLlama 7b Python	❌	❌
CodeLlama 13b Python	❌	❌
CodeLlama 34b Instruct	❌	🔶	Coded an animation with speech bubbles, bug in function parameters, once fixed produced animated text.
CodeLlama 34b Python	❌	❌
DeepSeek Coder 6.7b	❌	❌	Produced a chat conversation
Llama2 7b Chat	❌	❌	Produced a chat conversation
Llama2 13b Chat	❌	❌	Produced a chat conversation
Mistral 7b 0.2 Instruct	❌	❌	Produced a chat conversation
Mixtral 8x7b Q4	❌	❌	Produced a chat conversation
Mixtral 8x7b Q5	❌	❌	Produced a chat conversation
Neural Chat 7b Chat	❌	❌	Tried to create visual animation but failed and ended up producing a chat conversation
Nexus Raven	❌	❌
OpenHermes 7b Mistral	❌	❌	Produced a chat conversation
Orca2 13b	❌	❌
Phi-2	❌	❌
Phind-CodeLlama34b	🔶	🔶	Valid visual generation of two agents' text chat (no imagery just text on a figure)
Qwen 14b	❌	❌	Produced a chat conversation
Solar 10.7b Instruct	❌	❌	Produced a chat conversation
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat	🔶	🔶	Close to a valid drawing, outdated libraries

Non-coding tests

Termination word

Background: Tests the ability for an LLM to incorporate a termination word into their response.

Scenario: A Group Chat with a Story_writer and a Product_manager. Story_writer is to write some story ideas and the Product_manager is to review and terminate when satisified by outputting a specific word (e.g. "TERMINATE", "BAZINGA", etc.).

Store_writer's system message: An ideas person, loves coming up with ideas for kids books.

Product_manager's system message: Great in evaluating story ideas from your writers and determining whether they would be unique and interesting for kids. Reply with suggested improvements if they aren't good enough, otherwise reply '{termination_word}' at the end when you're satisfied there's one good story idea.

Prompt for the chat manager: Come up with 3 story ideas for Grade 3 kids.

See the results folder for code outputs.

Note 1: TERMINATE is the standard used by AutoGen. Note 2: Some LLMs included the terminating word but the quality of the full response was not perfect.

	Key
✅	Included termination word correctly
❌	Performed task, didn't output termination word
👎	Didn't understand/participate in task

There were two runs for each word.

Model	TERMINATE	ACBDEGFHIKJL	AUTOGENDONE	BAZINGA!	DONESKI	Notes
CodeLlama 7b Python	👎 👎	👎 👎	👎 👎	👎 👎	👎 👎
CodeLlama 13b Python	👎 👎	👎 👎	👎 👎	👎 👎	👎 👎
CodeLlama 34b Instruct	✅ ✅	❌ ✅	❌ ✅	✅ ✅	✅ ❌
CodeLlama 34b Python	👎 👎	👎 👎	👎 👎	👎 👎	👎 👎
DeepSeek Coder 6.7b	❌ 👎	👎 👎	👎 👎	👎 👎	👎 👎
Llama2 7b Chat	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Llama2 13b Chat	❌ ✅	❌ ✅	✅ ✅	✅ ✅	✅ ✅
Mistral 7b 0.2 Instruct	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Mixtral 8x7b Q4	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Mixtral 8x7b Q5	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Neural Chat 7b Chat	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Nexus Raven	👎 👎	👎 👎	👎 👎	👎 👎	👎 👎	Tried to call a python function to create the stories
OpenHermes 7b Mistral	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Orca2 13b	✅ ✅	❌ ✅	❌ ✅	❌ ✅	✅ ✅
Phi-2	👎 👎	👎 ❌	❌ ❌	❌ ❌	❌ 👎
Phind-CodeLlama34b	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Qwen 14b	❌ ❌	✅ ✅	✅ ✅	✅ ✅	✅ ✅
Solar 10.7b Instruct	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat	✅ ✅	✅ ✅	✅ ✅	✅ ✅	✅ ✅

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
document_examples		document_examples
function_calling		function_calling
groupchat_resume		groupchat_resume
groupchat_resume_pr		groupchat_resume_pr
maintests_alt_models		maintests_alt_models
results		results
.gitignore		.gitignore
README.md		README.md
code_exec_cmdline.py		code_exec_cmdline.py
code_exec_jupyter.py		code_exec_jupyter.py
code_exec_jupyter_llm_fib.py		code_exec_jupyter_llm_fib.py
code_exec_jupyter_llm_functioncall.py		code_exec_jupyter_llm_functioncall.py
code_exec_jupyter_llm_stocks.py		code_exec_jupyter_llm_stocks.py
group_chat_terminate.py		group_chat_terminate.py

marklysze/AutoGenCodeTesting

Folders and files

Latest commit

History

Repository files navigation

Code Executors

Juypter Notebook (local)

Fibonacci

Stock market chart

Function Calling

Non-coding tests

Termination word

About

Resources

Stars

Watchers

Forks

Languages