Skip to content

Commit

Permalink
Better image understanding without gpt-vision using LLAVA
Browse files Browse the repository at this point in the history
  • Loading branch information
Kav-K committed Dec 31, 2023
1 parent 800c81f commit 0548c4e
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 22 deletions.
16 changes: 8 additions & 8 deletions .github/workflows/black-and-deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
password: ${{ secrets.SSH_PASS }}
port: 22
source: gpt3discord.py
target: /home/kaveen/GPTDiscord
target: /home/GPTDiscord
- name: copy file via ssh password
uses: appleboy/scp-action@master
with:
Expand All @@ -47,7 +47,7 @@ jobs:
password: ${{ secrets.SSH_PASS }}
port: 22
source: conversation_starter_pretext.txt
target: /home/kaveen/GPTDiscord
target: /home/GPTDiscord
- name: copy file via ssh password
uses: appleboy/scp-action@master
with:
Expand All @@ -56,44 +56,44 @@ jobs:
password: ${{ secrets.SSH_PASS }}
port: 22
source: image_optimizer_pretext.txt
target: /home/kaveen/GPTDiscord/
target: /home/GPTDiscord/
- name: Copy via ssh
uses: garygrossgarten/github-action-scp@release
with:
local: cogs
remote: /home/kaveen/GPTDiscord/cogs
remote: /home/GPTDiscord/cogs
host: lab.kaveenk.dev
username: root
password: ${{ secrets.SSH_PASS }}
- name: Copy via ssh
uses: garygrossgarten/github-action-scp@release
with:
local: models
remote: /home/kaveen/GPTDiscord/models
remote: /home/GPTDiscord/models
host: lab.kaveenk.dev
username: root
password: ${{ secrets.SSH_PASS }}
- name: Copy via ssh
uses: garygrossgarten/github-action-scp@release
with:
local: openers
remote: /home/kaveen/GPTDiscord/openers
remote: /home/GPTDiscord/openers
host: lab.kaveenk.dev
username: root
password: ${{ secrets.SSH_PASS }}
- name: Copy via ssh
uses: garygrossgarten/github-action-scp@release
with:
local: services
remote: /home/kaveen/GPTDiscord/services
remote: /home/GPTDiscord/services
host: lab.kaveenk.dev
username: root
password: ${{ secrets.SSH_PASS }}
- name: Restart bot!
uses: fifsky/ssh-action@master
with:
command: |
cd /home/kaveen/GPTDiscord
cd /home/GPTDiscord
kill -9 $(cat bot.pid)
rm bot.pid
screen -dmS GPTBot python3.9 gpt3discord.py
Expand Down
2 changes: 1 addition & 1 deletion auto_restarter.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def restart_service():
Restarts the service by running a series of system commands.
"""
commands = [
"cd /home/kaveen/GPTDiscord",
"cd /home/GPTDiscord",
"kill -9 $(cat bot.pid)",
"rm bot.pid",
"screen -dmS GPTBot python3.9 gpt3discord.py",
Expand Down
4 changes: 2 additions & 2 deletions conversation_starter_pretext.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ Human: I'm making a discord bot <|endofstatement|>

Sometimes, users will upload images during a conversation, when that happens, you will already have an understanding of what that image is, you will know what the image is denoted by "Image Info-Caption". The piece of information starting with "Image Info-QA" contains an attempted direct answer to what the user originally asked about the image input. There is another versin of Info-QA called "Revised Image Info-QA" which is a more important and accurate answer to the question based on multimodal understanding. You should prioritize using the information from Image Info-OCR and Revised Image Info-QA. The results of Optical Character Recognition of the image will be provided, named "Image Info-OCR", image OCR data is usually more objective.
For example:
Human: Image Info-Caption: a sign that says ayr, ohio\nInfo-QA: ayr, ohio\nRevised Image Info-QA: This is a town in Ayr, Ontario\nImage Info-OCR: AYR,\nLONTARIO \nWhere is this? <|endofstatement|>
Human: Image Info-Caption: a sign that says ayr, ohio\nImage Info-QA: This is a town in Ayr, Ontario\nImage Info-OCR: AYR,\nLONTARIO\nEND IMAGE 1\nWhere is this? <|endofstatement|>
<yourname>: This is an image of the town Ayr, Ontario <|endofstatement|>
Human: Image Info-Caption: a landscape with a river and trees\nImage Info-QA: yes\nRevised Image Info-QA: This is a beautiful river and tree landscape, it is in a cartoony art style\nImage Info-OCR: \nWhat is this image? Is it cartoony? <|endofstatement|>
Human: Image Info-Caption: a landscape with a river and trees\nImage Info-QA: This is a beautiful river and tree landscape, it is indeed a cartoony art style\nImage Info-OCR:\nEND IMAGE 1\nWhat is this image? Is it cartoony? <|endofstatement|>
<yourname>: This is a landscape with a river and trees, it is indeed cartoony! <|endofstatement|>
...
When there are multiple images, do not pay any attention to Image Info-QA or Revised Image Info-QA. Only pay attention to the captions and OCR data. This image info will always be given to you, you do not need to ever respond with it.
Expand Down
2 changes: 1 addition & 1 deletion gpt3discord.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
from models.openai_model import Model


__version__ = "12.3.6"
__version__ = "12.3.7"


PID_FILE = Path("bot.pid")
Expand Down
7 changes: 7 additions & 0 deletions models/image_understanding_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,13 @@ def ask_image_question(self, prompt, filepath):
)
return output

def get_llava_answer(self, prompt, filepath):
output = replicate.run(
"yorickvp/llava-13b:e272157381e2a3bf12df3a8edd1f38d1dbd736bbb7437277c8b34175f8fce358",
input={"image": open(filepath, "rb"), "prompt": prompt, "temperature": 0.2, "top_p": 1, "max_tokens": 1024},
)
return output

def get_minigpt_answer(self, prompt, filepath):
output = replicate.run(
"daanelson/minigpt-4:b96a2f33cc8e4b0aa23eacfce731b9c41a7d9466d9ed4e167375587b54db9423",
Expand Down
16 changes: 6 additions & 10 deletions services/text_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -812,31 +812,27 @@ async def process_conversation_message(
try:
(
image_caption,
image_qa,
minigpt_output,
llava_output,
image_ocr,
) = await asyncio.gather(
asyncio.to_thread(
image_understanding_model.get_image_caption,
temp_file.name,
),
asyncio.to_thread(
image_understanding_model.ask_image_question,
prompt,
temp_file.name,
),
asyncio.to_thread(
image_understanding_model.get_minigpt_answer,
image_understanding_model.get_llava_answer,
prompt,
temp_file.name,
),
image_understanding_model.do_image_ocr(
temp_file.name
),
)
llava_output = ''.join(list(llava_output))

add_prompt = (
f"BEGIN IMAGE {num} DATA\nImage Info-Caption: {image_caption}\nImage Info-QA: {image_qa}\nRevised Image "
f"Info-QA: {minigpt_output}\nImage Info-OCR: {image_ocr}\nEND IMAGE {num} DATA\n"
f"BEGIN IMAGE {num} DATA\nImage Info-Caption: {image_caption}\nImage "
f"Info-QA: {llava_output}\nImage Info-OCR: {image_ocr}\nEND IMAGE {num}\n DATA\n"
)
add_prompts.append(add_prompt)
try:
Expand Down

0 comments on commit 0548c4e

Please sign in to comment.