Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Multimodal agents demo #320

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

chenllliang
Copy link

@chenllliang chenllliang commented Oct 19, 2023

add MultiModalPrompt class and an example

Description

This PR introduces a new class, MultiModalPrompt, aimed at facilitating the transfer of information between multimodal agents. The class encapsulates both text prompts and additional multimodal data, thereby allowing seamless integration and interchangeability.

The updated src file is camel/prompts/multimodal.py and camel/prompts/__init__.py.

An example is added to examples/multimodal/formating_example.py

Key Features:

  1. Multimodal Data Handling: The class can handle both text prompts (from the TextPrompt class) and multimodal information.
  2. Flexible Modality Support: The class comes with a predefined list of modalities (MODALITIES), and it can validate the provided modalities against this list.
  3. Dynamic Formatting: The format method allows the formatting of both text prompts and multimodal information in tandem. It can also distinguish between keyword arguments meant for the text prompt and those intended for multimodal information.
  4. Customizable Model Input Conversion: With the to_model_format method, the prompt can be converted into a model-understandable format. By default, it uses the default_to_model_format method, but custom methods can also be provided.

Code Changes:

  • Added MultiModalPrompt class with methods for initializing, formatting, and converting to a model-understandable format.
  • Included a helper function, default_to_model_format, which serves as the default method to format multimodal prompts for models.

Example Description for Pull Request

MultiModalPrompt Example Demonstrations

In the attached example examples/multimodal/formating_example.py, it demonstrates the capabilities and practical use-cases of the newly added MultiModalPrompt class for various multimodal scenarios.

  1. Single Image VQA (Visual Question-Answering) Prompt:

    • We start by initializing a prompt for a basic VQA scenario that involves a single image.
    • This demonstration uses the default model input format.
    • Two different questions are formatted with two separate images: "camel.jpg" and "llama.jpg".
    • Model-understandable formats for both prompts are printed out.
  2. Multi-Image Question with Custom Model Input Format:

    • A more complex scenario is demonstrated where we have multiple images corresponding to a single question.
    • A custom input format, multi_image_input_format, is implemented which labels images in the prompt with numbers. This indexing format is inspired by MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning.
      • Special tokens <Image{i}> are introduced in the textual prompt to indicate image positions.
      • [Image{i}] acts as the visual placeholder for the i-th image in the prompt.
    • The text prompt is dynamically updated based on the number of images provided in the multimodal information.
    • Finally, the complete formatted prompt and the associated images are printed out.

This example serves as a practical guide to:

  • Show how the MultiModalPrompt can be seamlessly integrated with existing prompts.
  • Handle various complexities like multiple images.
  • Customize the input format for different multimodal models.

The described example not only showcases the ease of use and flexibility of the MultiModalPrompt class but also demonstrates its applicability across various real-world scenarios, emphasizing its potential utility for developers and researchers in the multimodal domain.

Future Work

  • Planning to extend the functionality with MultiModalPromptDict.
  • If the Prompt Class is setteled, I hope to add some demos with multiple LLM and VLLM agents.

Please review the changes and provide feedback.

Motivation and Context

Why is this change required? What problem does it solve?
close #317 Feature Request

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Implemented Tasks

  • Subtask 1
  • Subtask 2
  • Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@ZIYU-DEEP ZIYU-DEEP assigned ZIYU-DEEP and unassigned ZIYU-DEEP Oct 21, 2023
@ZIYU-DEEP ZIYU-DEEP self-requested a review October 21, 2023 12:30
@chenllliang chenllliang self-assigned this Oct 21, 2023
@ZIYU-DEEP
Copy link
Contributor

The high-level class design looks good to me! Left one minor comment in the code.

@chenllliang
Copy link
Author

chenllliang commented Oct 23, 2023

The high-level class design looks good to me! Left one minor comment in the code.

Hi, thanks for your review, yet I can't see your comment in the code. Could you post it again?
@ZIYU-DEEP

@chenllliang
Copy link
Author

TODO:

  • Write document for multimodal function @chenllliang
  • Give an example using the multimodal prompt class @ZIYU-DEEP

@ZIYU-DEEP
Copy link
Contributor

TODO:

* Write document for multimodal function @chenllliang

* Give an example using the multimodal prompt class @ZIYU-DEEP

Just to leave a note here – i was planning to combine this with the huggingface agent, yet i encountered the SSLError, w/ or w/o firewall restrictions (cf., #268, #17611); you may check if you have the same issue @chenllliang. i plan alternatively make a minimal example with the interpreter.

@chenllliang
Copy link
Author

chenllliang commented Oct 31, 2023

I have updated the documentation of multimdoal prompt class. (It could be merged I think)

@chenllliang
Copy link
Author

currently developming multimodal role-playing demo

@chenllliang
Copy link
Author

I design a pipeline for a possible application of multimodal agents' collaboartion. It's called "Scientific Graph Painter", which is used to generate python code to draw a figure from in scientific papers.

It has 3 roles and possible models:

  • Draft : GPT4V , generate the draft code for drawing the graph
  • Critic: GPT4V , compare the draft and target graph, give revision suggestions
  • Polisher: GPT4 , following the suggestions, revise the code for drawing the graph

The pipeline graph is listed below, sry I am too occupied with other stuffs currently in Feb. 2024. Anyone feels interested in the topic can implement or discuss.

I think something need to be done first: add image information in agents' message. I change the PR from "add multimodal prompt class" to "Multimodal agents demo".

whiteboard_exported_image

@chenllliang chenllliang changed the title [Feature] Add MultiModalPrompt class [Feature] Multimodal agents demo Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Stuck
Development

Successfully merging this pull request may close these issues.

None yet

2 participants