Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grounding DINO is now available in 馃 Transformers! #321

Open
NielsRogge opened this issue Apr 12, 2024 · 7 comments
Open

Grounding DINO is now available in 馃 Transformers! #321

NielsRogge opened this issue Apr 12, 2024 · 7 comments

Comments

@NielsRogge
Copy link

Hi folks!

Grounding DINO is now available in the Transformers library, enabling easy inference in a few lines of code.

Here's how to use it:

from transformers import AutoProcessor, GroundingDinoForObjectDetection
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "a cat."

processor = AutoProcessor.from_pretrained("IDEA-Research/grounding-dino-tiny")
model = GroundingDinoForObjectDetection.from_pretrained("IDEA-Research/grounding-dino-tiny")

inputs = processor(images=image, text=text, return_tensors="pt")
outputs = model(**inputs)

# convert outputs (bounding boxes and class logits) to COCO API
target_sizes = torch.tensor([image.size[::-1]])
results = processor.image_processor.post_process_object_detection(
    outputs, threshold=0.35, target_sizes=target_sizes
)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {label.item()} with confidence " f"{round(score.item(), 3)} at location {box}")

Demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Grounding%20DINO/Inference_with_Grounding_DINO_for_zero_shot_object_detection.ipynb

Checkpoints are on the hub: https://huggingface.co/models?other=grounding-dino

@NielsRogge
Copy link
Author

Relevant for #88

@Lycus99
Copy link

Lycus99 commented Apr 26, 2024

Thanks for your work!
I found an issue where the same model produces different results. Take the image of '000000039769.jpg' as an example. The results of the official code are significantly better than those of the transformer library

The results you report are
Detected 1 with confidence 0.45 at location [344.8, 23.2, 637.4, 373.8]
Detected 1 with confidence 0.41 at location [11.9, 51.6, 316.6, 472.9]

My results based on the transformer package
Code:
from transformers import AutoProcessor, GroundingDinoForObjectDetection
from PIL import Image
import requests
import torch

import matplotlib.pyplot as plt
import matplotlib.patches as patches

image = Image.open('000000039769.jpg')

text = "cat"
device = 'cpu'
processor = AutoProcessor.from_pretrained("IDEA-Research/grounding-dino-tiny")
model = GroundingDinoForObjectDetection.from_pretrained("IDEA-Research/grounding-dino-tiny").to(device)
inputs = processor(images=image, text=text, return_tensors="pt").to(device)
outputs = model(**inputs)

target_sizes = torch.tensor([image.size[::-1]])
results = processor.image_processor.post_process_object_detection(
outputs, threshold=0.2, target_sizes=target_sizes
)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {label.item()} with confidence " f"{round(score.item(), 3)} at location {box}")

Results:
Detected 1 with confidence 0.26 at location [40.29, 72.75, 175.84, 117.19]

The results based on the official code and colab.

import os

CONFIG_PATH = os.path.join(HOME, "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py")
print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH))
WEIGHTS_NAME = "groundingdino_swint_ogc.pth"
WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME)
print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

from groundingdino.util.inference import load_model, load_image, predict, annotate

model = load_model(CONFIG_PATH, WEIGHTS_PATH)
import os
import supervision as sv

IMAGE_NAME = "000000039769.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "cat"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
model=model,
image=image,
caption=TEXT_PROMPT,
box_threshold=BOX_TRESHOLD,
text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline
sv.plot_image(annotated_frame, (16, 16))

image

@NielsRogge
Copy link
Author

Pinging @EduardoPach here

@EduardoPach
Copy link

Thanks for your work! I found an issue where the same model produces different results. Take the image of '000000039769.jpg' as an example. The results of the official code are significantly better than those of the transformer library

The results you report are Detected 1 with confidence 0.45 at location [344.8, 23.2, 637.4, 373.8] Detected 1 with confidence 0.41 at location [11.9, 51.6, 316.6, 472.9]

My results based on the transformer package Code: from transformers import AutoProcessor, GroundingDinoForObjectDetection from PIL import Image import requests import torch

import matplotlib.pyplot as plt import matplotlib.patches as patches

image = Image.open('000000039769.jpg')

text = "cat" device = 'cpu' processor = AutoProcessor.from_pretrained("IDEA-Research/grounding-dino-tiny") model = GroundingDinoForObjectDetection.from_pretrained("IDEA-Research/grounding-dino-tiny").to(device) inputs = processor(images=image, text=text, return_tensors="pt").to(device) outputs = model(**inputs)

target_sizes = torch.tensor([image.size[::-1]]) results = processor.image_processor.post_process_object_detection( outputs, threshold=0.2, target_sizes=target_sizes )[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): box = [round(i, 2) for i in box.tolist()] print(f"Detected {label.item()} with confidence " f"{round(score.item(), 3)} at location {box}")

Results: Detected 1 with confidence 0.26 at location [40.29, 72.75, 175.84, 117.19]

The results based on the official code and colab.

import os

CONFIG_PATH = os.path.join(HOME, "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py") print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH)) WEIGHTS_NAME = "groundingdino_swint_ogc.pth" WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME) print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

from groundingdino.util.inference import load_model, load_image, predict, annotate

model = load_model(CONFIG_PATH, WEIGHTS_PATH) import os import supervision as sv

IMAGE_NAME = "000000039769.jpg" IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "cat" BOX_TRESHOLD = 0.35 TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict( model=model, image=image, caption=TEXT_PROMPT, box_threshold=BOX_TRESHOLD, text_threshold=TEXT_TRESHOLD )

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline sv.plot_image(annotated_frame, (16, 16))

image

Actually, the output is the same :D, there's only one catch in your example.

The original implementation takes your text prompt and adds a . at the end so what is happening is that for the transformers example, you're passing "cat" and in the original, you're passing "cat." and this causes the difference you're seeing

@MinGiSa
Copy link

MinGiSa commented May 8, 2024

it's possible to train?

@DAAworld
Copy link

Think you!

@EduardoPach
Copy link

it's possible to train?

In theory, you can, it wasn't extensively tested, but if you find any problems open an issue in the transformers repo and tag me there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants