Clarification and supplement to the online docs example #1904

paulcx · 2024-05-16T06:13:44Z

System Info

docs[main]: https://huggingface.co/docs/text-generation-inference/basic_tutorials/visual_language_models
vlm: https://huggingface.co/llava-hf/llava-v1.6-34b-hf

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

In current docs, there a few examples about how to query vlm model. for example:

from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:3000")
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
prompt = f"![]({image})What is this a picture of?\n\n"
for token in client.text_generation(prompt, max_new_tokens=16, stream=True):
    print(token)

Expected behavior

However, there is no example of how to deal with the default chat template. For example, the chat template of llava-hf/llava-v1.6-34b-hf is following:

"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\n<your_text_prompt_here><|im_end|><|im_start|>assistant\n"

Should we ignore it and use the tgi format as showed above? and how to deal with the multi-turn queries? Any examples would be appreciated.

The text was updated successfully, but these errors were encountered:

drbh · 2024-05-30T00:17:28Z

Hi @paulcx thanks for pointing this out, we should be more clear about generation and templates in the docs.

In TGI the chat_template is applied when the chat endpoint is used, in the example above the generate endpoint is used and no template is applied.

Chat can used with the chat_completion method like below.

from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:3000")

chat = client.chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    seed=42,
    max_tokens=100,
)

print(chat)
# ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content=" The image you've provided features an anthropomorphic rabbit in spacesuit attire. This rabbit is depicted with human-like posture and movement, standing on a rocky terrain with a vast, reddish-brown landscape in the background. The spacesuit is detailed with mission patches, circuitry, and a helmet that covers the rabbit's face and ear, with an illuminated red light on the chest area.\n\nThe artwork style is that of a", name=None, tool_calls=None), logprobs=None)], created=1714589614, id='', model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=ChatCompletionOutputUsage(completion_tokens=100, prompt_tokens=2943, total_tokens=3043))

Note that when using the chat endpoint images are sent as typed messages rather than markdown format.

I hope this helps clarify! please let me know if you have any questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification and supplement to the online docs example #1904

Clarification and supplement to the online docs example #1904

paulcx commented May 16, 2024 •

edited

drbh commented May 30, 2024

Clarification and supplement to the online docs example #1904

Clarification and supplement to the online docs example #1904

Comments

paulcx commented May 16, 2024 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

drbh commented May 30, 2024

paulcx commented May 16, 2024 •

edited