ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

by Mohammad Reza Taesiri , Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, and Anh Nguyen.

Abstract

Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way for image classifiers to reach high accuracy is to first zoom to the most discriminative region in the image and then extract features from there to predict image labels, discarding the rest of the image. Studying six popular networks ranging from AlexNet to CLIP, we find that proper framing of the input image can lead to the correct classification of 98.91% of ImageNet images. Furthermore, we uncover positional biases in various datasets, especially a strong center bias in two popular datasets: ImageNet-A and ObjectNet. Finally, leveraging our insights into the potential of zooming, we propose a test-time augmentation (TTA) technique that improves classification accuracy by forcing models to explicitly perform zoom-in operations before making predictions. Our method is more interpretable, accurate, and faster than MEMO, a state-of-the-art (SOTA) TTA method. We introduce ImageNet-Hard, a new benchmark that challenges SOTA classifiers including large vision-language models even when optimal zooming is allowed.

animation.mp4

ImageNet-Hard

The ImageNet-Hard is a new benchmark that comprises an array of challenging images, curated from several validation datasets of ImageNet. This dataset challenges state-of-the-art vision models, as merely zooming in often fails to enhance their ability to correctly classify images. Consequently, even the most advanced models, such as CLIP-ViT-L/14@336px, struggle to perform well on this dataset, achieving only 2.02% accuracy.

The ImageNet-Hard dataset is avaible to access and browser on Hugging Face:

ImageNet-Hard
ImageNet-Hard-4K .

Dataset Distribution

Performance Report

Model	Accuracy
AlexNet	7.34
VGG-16	12.00
ResNet-18	10.86
ResNet-50	14.74
ViT-B/32	18.52
EfficientNet-B0	16.57
EfficientNet-B7	23.20
EfficientNet-L2-Ns	39.00
CLIP-ViT-L/14@224px	1.86
CLIP-ViT-L/14@336px	2.02
OpenCLIP-ViT-bigG-14	15.93
OpenCLIP-ViT-L-14	15.60

Evaluation Code

CLIP
OpenCLIP
Other models

Supplementary Material

You can find all the supplementary material on Google Drive.

Citation information

If you use this software, please consider citing:

@article{taesiri2023zoom,
  title={ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification},
  author={Taesiri, Mohammad Reza and Nguyen, Giang and Habchi, Sarra and Bezemer, Cor-Paul and Nguyen, Anh},
  booktitle={Advances in Neural Information Processing Systems}
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

Abstract

ImageNet-Hard

Dataset Distribution

Performance Report

Supplementary Material

Citation information

Files

README.md

Latest commit

History

README.md

File metadata and controls

ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

Abstract

ImageNet-Hard

Dataset Distribution

Performance Report

Supplementary Material

Citation information