ViT: Image Classification
ViT (Vision Transformer) is a vision model introduced by Google in 2020, based on the Transformer architecture. Unlike traditional convolutional neural networks (CNNs), ViT divides an image into fixed-size patches, and these patches' linear embeddings are treated as a sequence, which is then input into the Transformer. ViT leverages self-attention mechanisms to capture long-range dependencies in the image, simplifying the process by eliminating convolutions. Although Transformers were initially designed for natural language processing tasks, ViT has shown excellent performance on image classification tasks, particularly when trained on large datasets like ImageNet. ViT's scalability allows it to handle larger image datasets and adapt to various vision tasks such as image classification and object detection.
Source model
- Input shape: 224x224
- Number of parameters: 82.55M
- Model size: 330.5M
- Output shape: 1x1000
Source model repository: ViT
Performance Reference
Please search model by model name in Model Farm
Inference & Model Conversion
Please search model by model name in Model Farm
License
Source Model: BSD-3-CLAUSE
Deployable Model: APLUX-MODEL-FARM-LICENSE