Better image understanding without gpt-vision using LLAVA

Kav-K · Dec 31, 2023 · 0548c4e · 0548c4e
1 parent 800c81f
commit 0548c4e
Show file tree

Hide file tree

Showing 6 changed files with 25 additions and 22 deletions.
diff --git a/.github/workflows/black-and-deploy.yml b/.github/workflows/black-and-deploy.yml
@@ -38,7 +38,7 @@ jobs:
  password: ${{ secrets.SSH_PASS }}
  port: 22
  source: gpt3discord.py
- target: /home/kaveen/GPTDiscord
+ target: /home/GPTDiscord
  - name: copy file via ssh password
  uses: appleboy/scp-action@master
  with:
@@ -47,7 +47,7 @@ jobs:
  password: ${{ secrets.SSH_PASS }}
  port: 22
  source: conversation_starter_pretext.txt
- target: /home/kaveen/GPTDiscord
+ target: /home/GPTDiscord
  - name: copy file via ssh password
  uses: appleboy/scp-action@master
  with:
@@ -56,44 +56,44 @@ jobs:
  password: ${{ secrets.SSH_PASS }}
  port: 22
  source: image_optimizer_pretext.txt
- target: /home/kaveen/GPTDiscord/
+ target: /home/GPTDiscord/
  - name: Copy via ssh
  uses: garygrossgarten/github-action-scp@release
  with:
  local: cogs
- remote: /home/kaveen/GPTDiscord/cogs
+ remote: /home/GPTDiscord/cogs
  host: lab.kaveenk.dev
  username: root
  password: ${{ secrets.SSH_PASS }}
  - name: Copy via ssh
  uses: garygrossgarten/github-action-scp@release
  with:
  local: models
- remote: /home/kaveen/GPTDiscord/models
+ remote: /home/GPTDiscord/models
  host: lab.kaveenk.dev
  username: root
  password: ${{ secrets.SSH_PASS }}
  - name: Copy via ssh
  uses: garygrossgarten/github-action-scp@release
  with:
  local: openers
- remote: /home/kaveen/GPTDiscord/openers
+ remote: /home/GPTDiscord/openers
  host: lab.kaveenk.dev
  username: root
  password: ${{ secrets.SSH_PASS }}
  - name: Copy via ssh
  uses: garygrossgarten/github-action-scp@release
  with:
  local: services
- remote: /home/kaveen/GPTDiscord/services
+ remote: /home/GPTDiscord/services
  host: lab.kaveenk.dev
  username: root
  password: ${{ secrets.SSH_PASS }}
  - name: Restart bot!
  uses: fifsky/ssh-action@master
  with:
  command: |
- cd /home/kaveen/GPTDiscord
+ cd /home/GPTDiscord
  kill -9 $(cat bot.pid)
  rm bot.pid
  screen -dmS GPTBot python3.9 gpt3discord.py

diff --git a/auto_restarter.py b/auto_restarter.py
@@ -25,7 +25,7 @@ def restart_service():
  Restarts the service by running a series of system commands.
  """
  commands = [
- "cd /home/kaveen/GPTDiscord",
+ "cd /home/GPTDiscord",
  "kill -9 $(cat bot.pid)",
  "rm bot.pid",
  "screen -dmS GPTBot python3.9 gpt3discord.py",

diff --git a/conversation_starter_pretext.txt b/conversation_starter_pretext.txt
@@ -26,9 +26,9 @@ Human: I'm making a discord bot <|endofstatement|>
 
 Sometimes, users will upload images during a conversation, when that happens, you will already have an understanding of what that image is, you will know what the image is denoted by "Image Info-Caption". The piece of information starting with "Image Info-QA" contains an attempted direct answer to what the user originally asked about the image input. There is another versin of Info-QA called "Revised Image Info-QA" which is a more important and accurate answer to the question based on multimodal understanding. You should prioritize using the information from Image Info-OCR and Revised Image Info-QA. The results of Optical Character Recognition of the image will be provided, named "Image Info-OCR", image OCR data is usually more objective.
 For example:
-Human: Image Info-Caption: a sign that says ayr, ohio\nInfo-QA: ayr, ohio\nRevised Image Info-QA: This is a town in Ayr, Ontario\nImage Info-OCR: AYR,\nLONTARIO \nWhere is this? <|endofstatement|>
+Human: Image Info-Caption: a sign that says ayr, ohio\nImage Info-QA: This is a town in Ayr, Ontario\nImage Info-OCR: AYR,\nLONTARIO\nEND IMAGE 1\nWhere is this? <|endofstatement|>
 <yourname>: This is an image of the town Ayr, Ontario <|endofstatement|>
-Human: Image Info-Caption: a landscape with a river and trees\nImage Info-QA: yes\nRevised Image Info-QA: This is a beautiful river and tree landscape, it is in a cartoony art style\nImage Info-OCR: \nWhat is this image? Is it cartoony? <|endofstatement|>
+Human: Image Info-Caption: a landscape with a river and trees\nImage Info-QA: This is a beautiful river and tree landscape, it is indeed a cartoony art style\nImage Info-OCR:\nEND IMAGE 1\nWhat is this image? Is it cartoony? <|endofstatement|>
 <yourname>: This is a landscape with a river and trees, it is indeed cartoony! <|endofstatement|>
 ...
 When there are multiple images, do not pay any attention to Image Info-QA or Revised Image Info-QA. Only pay attention to the captions and OCR data. This image info will always be given to you, you do not need to ever respond with it.

diff --git a/gpt3discord.py b/gpt3discord.py
@@ -34,7 +34,7 @@
 from models.openai_model import Model
 
 
-__version__ = "12.3.6"
+__version__ = "12.3.7"
 
 
 PID_FILE = Path("bot.pid")

diff --git a/models/image_understanding_model.py b/models/image_understanding_model.py
@@ -32,6 +32,13 @@ def ask_image_question(self, prompt, filepath):
  )
  return output
 
+ def get_llava_answer(self, prompt, filepath):
+ output = replicate.run(
+ "yorickvp/llava-13b:e272157381e2a3bf12df3a8edd1f38d1dbd736bbb7437277c8b34175f8fce358",
+ input={"image": open(filepath, "rb"), "prompt": prompt, "temperature": 0.2, "top_p": 1, "max_tokens": 1024},
+ )
+ return output
+
  def get_minigpt_answer(self, prompt, filepath):
  output = replicate.run(
  "daanelson/minigpt-4:b96a2f33cc8e4b0aa23eacfce731b9c41a7d9466d9ed4e167375587b54db9423",

diff --git a/services/text_service.py b/services/text_service.py
@@ -812,31 +812,27 @@ async def process_conversation_message(
  try:
  (
  image_caption,
- image_qa,
- minigpt_output,
+ llava_output,
  image_ocr,
  ) = await asyncio.gather(
  asyncio.to_thread(
  image_understanding_model.get_image_caption,
  temp_file.name,
  ),
  asyncio.to_thread(
- image_understanding_model.ask_image_question,
- prompt,
- temp_file.name,
- ),
- asyncio.to_thread(
- image_understanding_model.get_minigpt_answer,
+ image_understanding_model.get_llava_answer,
  prompt,
  temp_file.name,
  ),
  image_understanding_model.do_image_ocr(
  temp_file.name
  ),
  )
+ llava_output = ''.join(list(llava_output))
+
  add_prompt = (
- f"BEGIN IMAGE {num} DATA\nImage Info-Caption: {image_caption}\nImage Info-QA: {image_qa}\nRevised Image "
- f"Info-QA: {minigpt_output}\nImage Info-OCR: {image_ocr}\nEND IMAGE {num} DATA\n"
+ f"BEGIN IMAGE {num} DATA\nImage Info-Caption: {image_caption}\nImage "
+ f"Info-QA: {llava_output}\nImage Info-OCR: {image_ocr}\nEND IMAGE {num}\n DATA\n"
  )
  add_prompts.append(add_prompt)
  try: