Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always have same response #21

Open
kehanlu opened this issue Jul 20, 2023 · 5 comments
Open

Always have same response #21

kehanlu opened this issue Jul 20, 2023 · 5 comments

Comments

@kehanlu
Copy link

kehanlu commented Jul 20, 2023

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

@chatsci
Copy link

chatsci commented Jul 22, 2023

How did you get the tokenizer?

Regarding your problem, I think maybe it's because you are using model.llm, which is just the llama part? In this case, seems the whisper and clip part are not used.

From what I understand, we can run the model by:

model.eval()
with torch.no_grad():
    generate_ids = model(data_item)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_texts)

@lyuchenyang
Copy link
Owner

Hi, thanks for sharing the infomation. We are currently checking it.

@kehanlu
Copy link
Author

kehanlu commented Jul 24, 2023

Hi @chatsci,
My code is modified from llm_trainer.py and modeling.py.

Macaw-LLM/llm_trainer.py

Lines 466 to 489 in d03e59d

inputs = {'videos': all_video_frames.half(),
'audios': all_audio_mels.half(),
'images': all_images.half(),
'input_ids': input_ids,
# 'attention_mask': torch.tensor([1] * seq_len, dtype=torch.int).reshape(bs, -1).contiguous(),
# 'labels': None,
'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
}
inputs = {k: inputs[k].to(device) for k in inputs}
inputs['inference'] = True
try:
generate_ids = model(inputs)
except Exception as e:
continue
input_text = tokenizer.batch_decode(input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Macaw-LLM/modeling.py

Lines 952 to 963 in d03e59d

text_embeddings, attention_mask, labels = self.prepare_inputs_for_generation(inputs)
if 'inference' in inputs and inputs['inference'] is True:
# generate_ids = self.llm.generate(input_ids=inputs['input_ids'], inputs_embeds=text_embeddings, max_new_tokens=128)
# generate_ids = self.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128)
# The code below will possibly trigger an error in : https://github.com/microsoft/DeepSpeed/issues/3156 (the solution only partially resolves the bug for me)
generate_ids = self.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)
return generate_ids
outputs = self.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
return outputs

I call the functions inside model() forward to test it more easily. The function prepare_inputs_for_generation will prepare the multi-modal tokens for the following LLM (encode the multi-modal features and concatenate with the text instruction).

I'm pretty sure that the input tokens for LLM contain image tokens. While conducting tests, I noticed that the model appears to disregard the image input and only generates responses based on the text portion.

@lyuchenyang
Copy link
Owner

Hi, thanks for sharing this information with us. I think the possible reasons could be some incompatibility issues within the code. As I'm currently on traveling, I will look into it as soon as possible when travel is finished. Would you mind sending the code you used to my email: [email protected] for me to take a look?

@dbountouridis
Copy link

Hey @lyuchenyang , I have been experiencing the same issue during inference. Are there any updates on this? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants