ViT: Image Classification

ViT (Vision Transformer) is a vision model introduced by Google in 2020, based on the Transformer architecture. Unlike traditional convolutional neural networks (CNNs), ViT divides an image into fixed-size patches, and these patches' linear embeddings are treated as a sequence, which is then input into the Transformer. ViT leverages self-attention mechanisms to capture long-range dependencies in the image, simplifying the process by eliminating convolutions. Although Transformers were initially designed for natural language processing tasks, ViT has shown excellent performance on image classification tasks, particularly when trained on large datasets like ImageNet. ViT's scalability allows it to handle larger image datasets and adapt to various vision tasks such as image classification and object detection.

Source model

  • Input shape: 224x224
  • Number of parameters: 82.55M
  • Model size: 330.5M
  • Output shape: 1x1000

Source model repository: ViT

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support