aplux/ViT · Hugging Face

ViT: Image Classification

ViT (Vision Transformer) is a vision model introduced by Google in 2020, based on the Transformer architecture. Unlike traditional convolutional neural networks (CNNs), ViT divides an image into fixed-size patches, and these patches' linear embeddings are treated as a sequence, which is then input into the Transformer. ViT leverages self-attention mechanisms to capture long-range dependencies in the image, simplifying the process by eliminating convolutions. Although Transformers were initially designed for natural language processing tasks, ViT has shown excellent performance on image classification tasks, particularly when trained on large datasets like ImageNet. ViT's scalability allows it to handle larger image datasets and adapt to various vision tasks such as image classification and object detection.

Source model

Input shape: 224x224
Number of parameters: 82.55M
Model size: 330.5M
Output shape: 1x1000

Source model repository: ViT

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Source Model: BSD-3-CLAUSE
Deployable Model: APLUX-MODEL-FARM-LICENSE