Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot replicate results from object detection task guide #30557

Open
2 of 4 tasks
adam-homeboost opened this issue Apr 29, 2024 · 9 comments
Open
2 of 4 tasks

Cannot replicate results from object detection task guide #30557

adam-homeboost opened this issue Apr 29, 2024 · 9 comments
Labels
Examples Which is related to examples in general Vision

Comments

@adam-homeboost
Copy link

System Info

  • transformers version: 4.40.1
  • Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
  • Python version: 3.10.14
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

I am following the examples given in https://huggingface.co/docs/transformers/en/tasks/object_detection

I am following it as closely as I possibly can. The only difference is that I am not pushing the training results up to hugging face and am instead saving (and reloading them) locally.

When I run the evaluation I get terrible results that look nothing like what the examples do. Instead of mAPs in the 0.3 - 0.7 range, I am getting results well under 0.1.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.053
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.096
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.044
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.039
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.018
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.078
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.214
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.281
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.128
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.157
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.263

Instead of the expected:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.352
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.681
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.292
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.208
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.274
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.484
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.501
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

This much difference makes for a "tuned" model that does not work at all.

If I use the tuned dataset on huggingface as documented (uploaded by the author?) then I get expected results. For some reason, following the same steps, I cannot get to a tuned model with anywhere near the performance the author did.

I have tried this on different hardware and gotten different results as well. The above numbers are from nvidia gpus. I tried this on mac m3 gpus and got even worse numbers (all less than 0.001).

I am new to machine learning and this toolset. I would appreciate any suggestions or guidelines as to what I could be doing wrong here. I also do not understand why running this on different hardware and different cpu vs gpu mixes ends up with different scores.

Any help or suggestions appreciated!

Who can help?

@amyeroberts

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow task code exactly, except:

  1. After trainer.train(), do trainer.save_model()

  2. in eval steps, instantiate AutoImageProcessor and AutoModelForObjectDetection from_pretrained from the saved model in step 1.

Expected behavior

Expect same or reasonably close scores in the eval step.

@NielsRogge
Copy link
Contributor

Relevant to #29964 and cc @qubvel

@qubvel
Copy link
Member

qubvel commented Apr 30, 2024

Hi @adam-homeboost, thanks for reporting the problem! I am working on refining object detection examples.
You could take a look at #30422. In this PR new examples are going to be provided which gives even better metrics.

@amyeroberts amyeroberts added Examples Which is related to examples in general Vision labels Apr 30, 2024
@adam-homeboost
Copy link
Author

Thank you @qubvel . I definitely appreciate the work you are doing to update the example!

Based on your comments, I ran the original example's training out to 100 epochs instead of the documented 10 and got much better results that matched close enough. So, definitely a documentation issue there. I see that your new example properly show the correct number of epochs.

As a beginner in ML, can I offer a couple of suggestions for your new examples? These are questions I have:

  1. It is unclear to me how you decide what specific image transformations are needed. Although I have poked into the original documentation/code on the detr model you are using, I didn't find anything as clear or obvious as the transformer you set up. I think it would be helpful to have a note on methodology there, especially for those cases where a different model is being adapted. What should people look for when picking a different base model?

  2. I see your two examples of trainer vs not trainer. It would be helpful to understand why someone would choose to forego the trainer. What is the tradeoff there?

@g1y5x3
Copy link

g1y5x3 commented Apr 30, 2024

It is unclear to me how you decide what specific image transformations are needed. Although I have poked into the original documentation/code on the detr model you are using, I didn't find anything as clear or obvious as the transformer you set up. I think it would be helpful to have a note on methodology there, especially for those cases where a different model is being adapted. What should people look for when picking a different base model?

If you are wondering about detr architecture, it should be the same as here and you can find the paper in the documentation page as well.

I see your two examples of trainer vs not trainer. It would be helpful to understand why someone would choose to forego the trainer. What is the tradeoff there?

My understanding is Trainer is a much more flexible API that can enable you to switch between models very quickly for rapid development. The no-trainer leverages the accelerate API which allows you to perform multigpu training to speed up the process once you have decided which model you want to invest a bit more. And it requires a few more configurations in the code as you can probably tell.

@qubvel
Copy link
Member

qubvel commented Apr 30, 2024

@adam-homeboost great questions, it will help to improve examples!

@g1y5x3 thanks for the answers, I will add more here:

It is unclear to me how you decide what specific image transformations are needed. Although I have poked into the original documentation/code on the detr model you are using, I didn't find anything as clear or obvious as the transformer you set up. I think it would be helpful to have a note on methodology there, especially for those cases where a different model is being adapted. What should people look for when picking a different base model?

We might consider two types of transformations:

  1. Image "standardization". It is usually resizing[+padding] the image to some particular defined size + normalization. The resizing strategy could be dataset-specific and may depend on image size, aspect ratios, and size of objects. Normalization is usually taken the same as in the original pretrained model (very often imagenet_mean, imagenet_std).
  2. Image augmentations. It is a way to enlarge dataset variability. Also could be very dataset-specific, and should be chosen based on metrics. As a starting point, you could choose some that do not significantly change data distribution (image/annotations are not looking "strange" after applying augmentations for this specific task). Here is a space where you can try different augmentations, change their parameters and inspect the result.

I see your two examples of trainer vs not trainer. It would be helpful to understand why someone would choose to forego the trainer. What is the tradeoff there?

I have a bit different opinion rather than @g1y5x3. Both APIs support multigpu training. The Trainer provides you with a simple API, it has a training/evaluation loop already implemented, but its less flexible. Despite Trainer having a large number of training arguments at some point you might want to implement some custom logic (e.g. custom learning rate scheduling or data sampling) and you will need access to the training/evaluation loop. For that case accelerate example is provided, which has explicit training and evaluation loops.

Please, let me know if it becomes a bit clearer and if you have any follow-up questions 🤗

@g1y5x3
Copy link

g1y5x3 commented May 1, 2024

Thank you for the clarification. It makes total sense. Quick question, after looking through your PR, it didn't touch the example. However, I remember that this needs a bit clarification as training that dataset for 10 epochs won't yield any good predictions, I tried to run it with 100 epochs, would take ~3 hours on a A6000. Does it need to be updated?

@qubvel
Copy link
Member

qubvel commented May 1, 2024

@g1y5x3 yes, you are right, it needs to be updated. Any help is appreciated, in case you want to contribute, you can open a PR that aligns notebook example with python one and ping me for any help or review 🙂

@NielsRogge
Copy link
Contributor

That's what my PR #29967 addresses, I'd prefer to have it merged to unblock people.

@qubvel
Copy link
Member

qubvel commented May 1, 2024

Yes, let's have it merged. And the next PR can address other points in a notebook example from your issue (taking as a base #30422)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Examples Which is related to examples in general Vision
Projects
None yet
Development

No branches or pull requests

5 participants