language: | |
- en | |
tags: | |
- vision-language | |
- vqa | |
- text-to-image-evaluation | |
license: mit | |
# Tiny Random VQAScore Model | |
This is a tiny random version of the VQAScore architecture for educational and testing purposes. | |
## Model Architecture | |
- **Vision Encoder**: Tiny CNN + Transformer (64 hidden size) | |
- **Language Model**: Tiny Transformer (256 hidden size) | |
- **Multimodal Projector**: MLP with 256 β 128 β 64 β 1 | |
## Usage | |
```python | |
from create_tiny_vqa_model import TinyVQAScore | |
# Load the model | |
model = TinyVQAScore(device="cpu") | |
# Score an image | |
from PIL import Image | |
image = Image.open("your_image.jpg") | |
score = model.score(image, "What is shown in this image?") | |
print(f"VQA Score: {score}") | |
``` | |
## Model Size | |
- **Parameters**: ~50K (vs ~11B for the original XXL model) | |
- **Memory**: ~200KB (vs ~22GB for the original XXL model) | |
## Disclaimer | |
This is a randomly initialized model for testing and educational purposes. It is not trained and will not produce meaningful VQA results. | |