Add ViTPose #30530

NielsRogge · 2024-04-28T20:01:37Z

What does this PR do?

This PR adds ViTPose as introduced in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.

Here's a demo notebook - note that the API might change: https://colab.research.google.com/drive/15_3gjcC0wtKSH85k76zewt81eUJIEWWA?usp=sharing.

To do:

get rid of cv2 dependency (?)

HuggingFaceDocBuilderDev · 2024-04-29T08:52:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SangbumChoi · 2024-04-30T01:48:07Z

@NielsRogge Hi Niels! Does this current PR works properly? (I want to do some test for this)

amyeroberts · 2024-05-08T19:26:27Z

Thanks for working on this @NielsRogge!

I can see there's a few bits unfinished e.g. tests. Is there a particular bit of code you'd like me to look at for a maintainers perspective?

For the issue description, there's a related model request is here: #24915

amyeroberts · 2024-05-14T08:38:04Z

@NielsRogge I'm unsubscribing atm, so that I don't get notifications on every new push. You just need to ping me again with my username when it's ready for review and I'll get notified

NielsRogge · 2024-05-16T06:23:54Z

It would be great to have a first round of review as the PR is in a ready state. @amyeroberts

SangbumChoi · 2024-05-16T13:23:03Z

src/transformers/models/vitpose/convert_vitpose_to_hf.py

+
+
+name_to_path = {
+ "vitpose-base-simple": "/Users/nielsrogge/Documents/ViTPose/vitpose-b-simple.pth",


Suggested change

"vitpose-base-simple": "/Users/nielsrogge/Documents/ViTPose/vitpose-b-simple.pth",

"vitpose-base-simple": "https:\/\/4mjpca.sn.files.1drv.com\/y4mip6jbupeZ3YzICoNJYUb6yGEheWXkicKj0tvp1Sfq8BztlH8ieD63z2ZRYiTBzvDxKXFqd_wa5m8NHnBsURmpClZySMSJjS3hxrU2bFArawJ5mAVZsni4LmsfWs_K1dnIzDumXXuanSopYKm0O-Bx5z4JerIfGoE6riAtY_ni5_paFl46jGTE82U8J10Cm3gxHv2DSfOkrgV7SkmUKvnjg\/vitpose-b-simple.pth?download&psid=1",

import requests def download_file(url, local_filename): # Sending requests with stream=True allows downloading large files with requests.get(url, stream=True) as response: response.raise_for_status() # Raise an exception if the request fails with open(local_filename, 'wb') as file: # Efficiently download large files by iterating over content in chunks for chunk in response.iter_content(chunk_size=8192): file.write(chunk) return local_filename # Given URL and the filename to save url = "https://4mjpca.sn.files.1drv.com/y4mip6jbupeZ3YzICoNJYUb6yGEheWXkicKj0tvp1Sfq8BztlH8ieD63z2ZRYiTBzvDxKXFqd_wa5m8NHnBsURmpClZySMSJjS3hxrU2bFArawJ5mAVZsni4LmsfWs_K1dnIzDumXXuanSopYKm0O-Bx5z4JerIfGoE6riAtY_ni5_paFl46jGTE82U8J10Cm3gxHv2DSfOkrgV7SkmUKvnjg/vitpose-b-simple.pth?download&psid=1" local_filename = "vitpose-b-simple.pth" # Execute file download download_file(url, local_filename) print(f"File downloaded as {local_filename}")

I think we can convert this code with the cloud uploaded weight!

amyeroberts

Thanks for adding this model!

I've only done a first high-level pass. Normally I'd ask for the backbone to be added in a separate PR, but as the modeling files are relatively small, I think it's OK.

Main comments are about the image processing: poat-processing should take and return torch tensors; there should be more tests to make sure the pre and post processing work on batched inputs and outputs are as expected, particularly for the custom transforms; cv2 logic should be removed

amyeroberts · 2024-05-21T17:49:12Z

src/transformers/models/vitpose/image_processing_vitpose.py

+if is_cv2_available():
+ # TODO get rid of cv2?
+ import cv2
+


amyeroberts · 2024-05-21T17:54:02Z

src/transformers/models/vitpose/image_processing_vitpose.py

+ cv2_image = (
+ image
+ if input_data_format == ChannelDimension.LAST
+ else to_channel_dimension_format(image, ChannelDimension.LAST, input_data_format)
+ )
+ image = cv2.warpAffine(cv2_image, transformation, size, flags=cv2.INTER_LINEAR)


All cv2 logic should be removed

amyeroberts · 2024-05-21T17:54:40Z

src/transformers/models/vitpose/image_processing_vitpose.py

+ # transform image back to input_data_format
+ image = to_channel_dimension_format(image, input_data_format, ChannelDimension.LAST)


This isn't needed data_format is always set

amyeroberts · 2024-05-21T17:56:10Z

src/transformers/models/vitpose/modeling_vitpose.py

+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+ Classification (or regression if config.num_labels==1) loss.
+ heatmaps (`torch.FloatTensor` of shape `(batch_size, num_keypoints, height, width)`):
+ Heatmaps.


This could do with a bit of expanding "heatmaps = heatmaps" is a bit redundant

amyeroberts · 2024-05-21T17:57:23Z

src/transformers/models/vitpose/modeling_vitpose.py

+ config.backbone_hidden_size = self.backbone.config.hidden_size
+ config.image_size = self.backbone.config.image_size
+ config.patch_size = self.backbone.config.patch_size


These attributes should be checked for and an exception raised if they're not present

amyeroberts · 2024-05-21T18:40:36Z

src/transformers/models/vitpose/configuration_vitpose.py

+ def __init__(
+ self,
+ backbone_config: PretrainedConfig = None,
+ backbone=None,


Args below should all have typing too

amyeroberts · 2024-05-21T18:41:12Z

src/transformers/models/vitpose/configuration_vitpose.py

+ initializer_range (`float`, *optional*, defaults to 0.02):
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+ scale_factor (`int`, *optional*, defaults to 4):
+ Factor to upscale te feature maps coming from the ViT backbone.


Suggested change

Factor to upscale te feature maps coming from the ViT backbone.

Factor to upscale the feature maps coming from the ViT backbone.

amyeroberts · 2024-05-21T18:43:10Z

src/transformers/image_transforms.py

@@ -533,6 +533,62 @@ def _center_to_corners_format_tf(bboxes_center: "tf.Tensor") -> "tf.Tensor":
 return bboxes_corners


+def coco_to_pascal_voc(bboxes: np.ndarray) -> np.ndarray:


Lets just keep these in the vitpose image processor module.

amyeroberts · 2024-05-21T18:44:43Z

src/transformers/image_transforms.py

+ elif width < aspect_ratio * height:
+ width = height * aspect_ratio
+
+ # pixel std is 200.0


amyeroberts · 2024-05-21T18:45:22Z

docs/source/en/model_doc/vitpose.md

+
+## Overview
+
+The ViTPose model was proposed in [ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation](https://arxiv.org/abs/2204.12484) by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao. ViTPose employs a standard [Vision Transformer](vit) as backbone for the task of keypoint estimation.


Could you expand this model explanation a bit please?

NielsRogge and others added 30 commits May 25, 2022 12:23

First draft

03e4321

Make fixup

84ac7fe

Make forward pass worké

90018b0

Improve code

5ce0b8b

Fix merge

3009a8a

More improvements

8f39773

More improvements

067f593

Make predictions match

7360c22

More improvements

a1b154a

Improve image processor

4bd07c3

Fix model tests

44f694a

Add classic decoder

41c1778

Merge remote-tracking branch 'upstream/main' into add_vitpose

1773f8d

Convert classic decoder

ceb3d3c

Verify image processor

fedf2cc

Fix classic decoder logits

38dedcd

Clean up

4cdbc03

Add post_process_pose_estimation

95aae6d

Improve post_process_pose_estimation

2531c19

Use AutoBackbone

e06d678

Add support for MoE models

c4a7df1

Fix tests, improve num_experts%

b09592c

Improve variable names

04930ec

Fix merge

3432448

Make fixup

676aa5c

More improvements

547d0da

Improve post_process_pose_estimation

4435fd6

Compute centers and scales

db0e72b

Improve postprocessing

027100d

More improvements

fc8e5e0

NielsRogge added 5 commits April 28, 2024 20:07

Fix ViTPoseBackbone tests

6e7afac

Add docstrings, fix image processor tests

8888c15

Update index

0290de5

Use is_cv2_available

d33cb01

Add model to toctree

6af97a9

Add cv2 to doc tests

c4ccdb6

NielsRogge added 5 commits May 6, 2024 15:16

Fix merge

3eb3865

Remove script

e09aa53

Improve conversion script

2203538

Add coco_to_pascal_voc

ee5f191

Add box_to_center_and_scale to image_transforms

dcd4401

NielsRogge requested a review from amyeroberts May 8, 2024 11:27

Update tests

97a0e09

NielsRogge added 3 commits May 11, 2024 14:23

Add integration test

d579009

Fix merge

4cfa299

Fix merge

9b8b4d1

SangbumChoi reviewed May 16, 2024

View reviewed changes

amyeroberts reviewed May 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ViTPose #30530

Add ViTPose #30530

NielsRogge commented Apr 28, 2024 •

edited

HuggingFaceDocBuilderDev commented Apr 29, 2024

SangbumChoi commented Apr 30, 2024

amyeroberts commented May 8, 2024

amyeroberts commented May 14, 2024

NielsRogge commented May 16, 2024 •

edited

SangbumChoi May 16, 2024

SangbumChoi May 16, 2024

SangbumChoi May 16, 2024

amyeroberts left a comment

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024

amyeroberts May 21, 2024



		name_to_path = {
		"vitpose-base-simple": "/Users/nielsrogge/Documents/ViTPose/vitpose-b-simple.pth",

	"vitpose-base-simple": "/Users/nielsrogge/Documents/ViTPose/vitpose-b-simple.pth",
	"vitpose-base-simple": "https:\/\/4mjpca.sn.files.1drv.com\/y4mip6jbupeZ3YzICoNJYUb6yGEheWXkicKj0tvp1Sfq8BztlH8ieD63z2ZRYiTBzvDxKXFqd_wa5m8NHnBsURmpClZySMSJjS3hxrU2bFArawJ5mAVZsni4LmsfWs_K1dnIzDumXXuanSopYKm0O-Bx5z4JerIfGoE6riAtY_ni5_paFl46jGTE82U8J10Cm3gxHv2DSfOkrgV7SkmUKvnjg\/vitpose-b-simple.pth?download&psid=1",

		# transform image back to input_data_format
		image = to_channel_dimension_format(image, input_data_format, ChannelDimension.LAST)

	Factor to upscale te feature maps coming from the ViT backbone.
	Factor to upscale the feature maps coming from the ViT backbone.

		@@ -533,6 +533,62 @@ def _center_to_corners_format_tf(bboxes_center: "tf.Tensor") -> "tf.Tensor":
		return bboxes_corners


		def coco_to_pascal_voc(bboxes: np.ndarray) -> np.ndarray:


		## Overview

		The ViTPose model was proposed in [ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation](https://arxiv.org/abs/2204.12484) by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao. ViTPose employs a standard [Vision Transformer](vit) as backbone for the task of keypoint estimation.

Add ViTPose #30530

Are you sure you want to change the base?

Add ViTPose #30530

Conversation

NielsRogge commented Apr 28, 2024 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 29, 2024

SangbumChoi commented Apr 30, 2024

amyeroberts commented May 8, 2024

amyeroberts commented May 14, 2024

NielsRogge commented May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge commented Apr 28, 2024 •

edited

NielsRogge commented May 16, 2024 •

edited