Skip to content

marklysze/AutoGenCodeTesting

Repository files navigation

Testing Code Generation using alt-models (Local LLMs)

Two tests so far - Fibonacci function and stock market chart. More to come.

Code Executors

Juypter Notebook (local)

Basis for testing code here

Fibonacci

Prompt for LLM: Write Python code to calculate the 14th Fibonacci number.

Code should generate: 233 or 377

Note: Fn for n = 14 is 377. If we start at n = 0 the 14th number is 233, so n = 14 is actually the 15th Fibonacci number in the sequence. Therefore, some LLM's code will generate 233 and some 377. We will accept both as the question is ambiguous.

See the results folder for code outputs.

See what ChatGPT 3.5 generated (which output 377).

Key
Valid code and formatting
Invalid code, formatting, or didn't understand task
Model Run 1 Run 2 Notes
CodeLlama 7b Python
CodeLlama 13b Python
CodeLlama 34b Instruct
CodeLlama 34b Python
DeepSeek Coder 6.7b
Llama2 7b Chat
Llama2 13b Chat
Mistral 7b 0.2 Instruct
Mixtral 8x7b Q4
Mixtral 8x7b Q5
Neural Chat 7b Chat
Nexus Raven
OpenHermes 7b Mistral
Orca2 13b
Phi-2
Phind-CodeLlama34b
Qwen 14b
Solar 10.7b Instruct
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat

Stock market chart

Prompt for LLM: Today is {today}. Write Python code to plot TSLA's and META's stock prices YTD.

Code should generate: An image for each stock OR an image including both stocks. If it tries to display it (Jupyter notebook) it's a bonus (but doesn't affect whether it's correct or not).

See the results folder for code outputs.

See what ChatGPT 3.5 generated (which was correct).

Key
Valid code and formatting
🔶 Almost correct
Invalid code, formatting, or didn't understand task
Model Run 1 Run 2 Notes
CodeLlama 7b Python
CodeLlama 13b Python 🔶 Just missing "import pandas as pd"
CodeLlama 34b Instruct
CodeLlama 34b Python Expected a CSV file
DeepSeek Coder 6.7b 🔶 Out of date library usage
Llama2 7b Chat 🔶 Close, showed price change instead of price
Llama2 13b Chat
Mistral 7b 0.2 Instruct 🔶 🔶 Minor string formatting error, used "FB" instead of "META" as ticker symbol
Mixtral 8x7b Q4
Mixtral 8x7b Q5 🔶 Not quite right on second run
Neural Chat 7b Chat
Nexus Raven First run no code, second is a blank chart
OpenHermes 7b Mistral
Orca2 13b No code
Phi-2 Off the track
Phind-CodeLlama34b
Qwen 14b 🔶 Timeframe out
Solar 10.7b Instruct
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat Old retired library use

Function Calling

Prompt for LLM: Draw two agents chatting with each other with an example dialog. Don't add plt.show().

Code should generate: Python code that creates a diagram with two agents on it (E.g. two circles and speech bubbles)

See the results folder for code outputs.

See what OpenAI's API generated.

Key
Valid code and formatting
🔶 Almost correct
Invalid code, formatting, or didn't understand task
Model Run 1 Run 2 Notes
CodeLlama 7b Python
CodeLlama 13b Python
CodeLlama 34b Instruct 🔶 Coded an animation with speech bubbles, bug in function parameters, once fixed produced animated text.
CodeLlama 34b Python
DeepSeek Coder 6.7b Produced a chat conversation
Llama2 7b Chat Produced a chat conversation
Llama2 13b Chat Produced a chat conversation
Mistral 7b 0.2 Instruct Produced a chat conversation
Mixtral 8x7b Q4 Produced a chat conversation
Mixtral 8x7b Q5 Produced a chat conversation
Neural Chat 7b Chat Tried to create visual animation but failed and ended up producing a chat conversation
Nexus Raven
OpenHermes 7b Mistral Produced a chat conversation
Orca2 13b
Phi-2
Phind-CodeLlama34b 🔶 🔶 Valid visual generation of two agents' text chat (no imagery just text on a figure)
Qwen 14b Produced a chat conversation
Solar 10.7b Instruct Produced a chat conversation
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat 🔶 🔶 Close to a valid drawing, outdated libraries

Non-coding tests

Termination word

Background: Tests the ability for an LLM to incorporate a termination word into their response.

Scenario: A Group Chat with a Story_writer and a Product_manager. Story_writer is to write some story ideas and the Product_manager is to review and terminate when satisified by outputting a specific word (e.g. "TERMINATE", "BAZINGA", etc.).

Store_writer's system message: An ideas person, loves coming up with ideas for kids books.

Product_manager's system message: Great in evaluating story ideas from your writers and determining whether they would be unique and interesting for kids. Reply with suggested improvements if they aren't good enough, otherwise reply '{termination_word}' at the end when you're satisfied there's one good story idea.

Prompt for the chat manager: Come up with 3 story ideas for Grade 3 kids.

See the results folder for code outputs.

Note 1: TERMINATE is the standard used by AutoGen. Note 2: Some LLMs included the terminating word but the quality of the full response was not perfect.

Key
Included termination word correctly
Performed task, didn't output termination word
👎 Didn't understand/participate in task

There were two runs for each word.

Model TERMINATE ACBDEGFHIKJL AUTOGENDONE BAZINGA! DONESKI Notes
CodeLlama 7b Python 👎 👎 👎 👎 👎 👎 👎 👎 👎 👎
CodeLlama 13b Python 👎 👎 👎 👎 👎 👎 👎 👎 👎 👎
CodeLlama 34b Instruct ✅ ✅ ❌ ✅ ❌ ✅ ✅ ✅ ✅ ❌
CodeLlama 34b Python 👎 👎 👎 👎 👎 👎 👎 👎 👎 👎
DeepSeek Coder 6.7b ❌ 👎 👎 👎 👎 👎 👎 👎 👎 👎
Llama2 7b Chat ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Llama2 13b Chat ❌ ✅ ❌ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Mistral 7b 0.2 Instruct ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Mixtral 8x7b Q4 ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Mixtral 8x7b Q5 ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Neural Chat 7b Chat ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Nexus Raven 👎 👎 👎 👎 👎 👎 👎 👎 👎 👎 Tried to call a python function to create the stories
OpenHermes 7b Mistral ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Orca2 13b ✅ ✅ ❌ ✅ ❌ ✅ ❌ ✅ ✅ ✅
Phi-2 👎 👎 👎 ❌ ❌ ❌ ❌ ❌ ❌ 👎
Phind-CodeLlama34b ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Qwen 14b ❌ ❌ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
Solar 10.7b Instruct ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
StarCoder2 3b
StarCoder2 7b
StarCoder2 15b
Yi-34b Chat ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅

About

Testing AutoGen Code generation with alt-models (local LLMs)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published