alvarobartt HF Staff andito HF Staff commited on
Commit
739c23c
·
verified ·
0 Parent(s):

Duplicate from HuggingFaceTB/SmolVLM-Instruct

Browse files

Co-authored-by: Andres Marafioti <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ onnx/decoder_model_merged.onnx_data filter=lfs diff=lfs merge=lfs -text
37
+ onnx/decoder_model_merged_fp16.onnx_data filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ datasets:
5
+ - HuggingFaceM4/the_cauldron
6
+ - HuggingFaceM4/Docmatix
7
+ pipeline_tag: image-text-to-text
8
+ language:
9
+ - en
10
+ base_model:
11
+ - HuggingFaceTB/SmolLM2-1.7B-Instruct
12
+ - google/siglip-so400m-patch14-384
13
+ ---
14
+
15
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM.png" width="800" height="auto" alt="Image description">
16
+
17
+ # SmolVLM
18
+
19
+ SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.
20
+
21
+ ## Model Summary
22
+
23
+ - **Developed by:** Hugging Face 🤗
24
+ - **Model type:** Multi-modal model (image+text)
25
+ - **Language(s) (NLP):** English
26
+ - **License:** Apache 2.0
27
+ - **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
28
+
29
+ ## Resources
30
+
31
+ - **Demo:** [SmolVLM Demo](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
32
+ - **Blog:** [Blog post](https://huggingface.co/blog/smolvlm)
33
+
34
+ ## Uses
35
+
36
+ SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.
37
+
38
+ To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
39
+ <!-- todo: add link to fine-tuning tutorial -->
40
+
41
+ ### Technical Summary
42
+
43
+ SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to previous Idefics models:
44
+
45
+ - **Image compression:** We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
46
+ - **Visual Token Encoding:** SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
47
+
48
+ More details about the training and architecture are available in our technical report.
49
+
50
+
51
+ ### How to get started
52
+
53
+ You can use transformers to load, infer and fine-tune SmolVLM.
54
+
55
+ ```python
56
+ import torch
57
+ from PIL import Image
58
+ from transformers import AutoProcessor, AutoModelForVision2Seq
59
+ from transformers.image_utils import load_image
60
+
61
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
62
+
63
+ # Load images
64
+ image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
65
+ image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
66
+
67
+ # Initialize processor and model
68
+ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
69
+ model = AutoModelForVision2Seq.from_pretrained(
70
+ "HuggingFaceTB/SmolVLM-Instruct",
71
+ torch_dtype=torch.bfloat16,
72
+ _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
73
+ ).to(DEVICE)
74
+
75
+ # Create input messages
76
+ messages = [
77
+ {
78
+ "role": "user",
79
+ "content": [
80
+ {"type": "image"},
81
+ {"type": "image"},
82
+ {"type": "text", "text": "Can you describe the two images?"}
83
+ ]
84
+ },
85
+ ]
86
+
87
+ # Prepare inputs
88
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
89
+ inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
90
+ inputs = inputs.to(DEVICE)
91
+
92
+ # Generate outputs
93
+ generated_ids = model.generate(**inputs, max_new_tokens=500)
94
+ generated_texts = processor.batch_decode(
95
+ generated_ids,
96
+ skip_special_tokens=True,
97
+ )
98
+
99
+ print(generated_texts[0])
100
+ """
101
+ Assistant: The first image shows a green statue of the Statue of Liberty standing on a stone pedestal in front of a body of water.
102
+ The statue is holding a torch in its right hand and a tablet in its left hand. The water is calm and there are no boats or other objects visible.
103
+ The sky is clear and there are no clouds. The second image shows a bee on a pink flower.
104
+ The bee is black and yellow and is collecting pollen from the flower. The flower is surrounded by green leaves.
105
+ """
106
+ ```
107
+
108
+
109
+ ### Model optimizations
110
+
111
+ **Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.
112
+
113
+ ```python
114
+ from transformers import AutoModelForVision2Seq
115
+ import torch
116
+
117
+ model = AutoModelForVision2Seq.from_pretrained(
118
+ "HuggingFaceTB/SmolVLM-Instruct",
119
+ torch_dtype=torch.bfloat16
120
+ ).to("cuda")
121
+ ```
122
+
123
+ You can also load SmolVLM with 4/8-bit quantization using bitsandbytes, torchao or Quanto. Refer to [this page](https://huggingface.co/docs/transformers/en/main_classes/quantization) for other options.
124
+
125
+ ```python
126
+ from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
127
+ import torch
128
+
129
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
130
+ model = AutoModelForVision2Seq.from_pretrained(
131
+ "HuggingFaceTB/SmolVLM-Instruct",
132
+ quantization_config=quantization_config,
133
+ )
134
+ ```
135
+
136
+ **Vision Encoder Efficiency**: Adjust the image resolution by setting `size={"longest_edge": N*384}` when initializing the processor, where N is your desired value. The default `N=4` works well, which results in input images of
137
+ size 1536×1536. For documents, `N=5` might be beneficial. Decreasing N can save GPU memory and is appropriate for lower-resolution images. This is also useful if you want to fine-tune on videos.
138
+
139
+
140
+ ## Misuse and Out-of-scope Use
141
+
142
+ SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
143
+
144
+ - Prohibited Uses:
145
+ - Evaluating or scoring individuals (e.g., in employment, education, credit)
146
+ - Critical automated decision-making
147
+ - Generating unreliable factual content
148
+ - Malicious Activities:
149
+ - Spam generation
150
+ - Disinformation campaigns
151
+ - Harassment or abuse
152
+ - Unauthorized surveillance
153
+
154
+ ### License
155
+
156
+ SmolVLM is built upon [the shape-optimized SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) for text decoder part.
157
+
158
+ We release the SmolVLM checkpoints under the Apache 2.0 license.
159
+
160
+ ## Training Details
161
+
162
+ ### Training Data
163
+
164
+ The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.
165
+ <img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:90%;" />
166
+
167
+
168
+
169
+
170
+ ## Evaluation
171
+
172
+ | Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
173
+ |-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
174
+ | SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
175
+ | Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
176
+ | InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
177
+ | PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
178
+ | moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
179
+ | MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
180
+ | MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
SmolVLM.png ADDED
added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "<end_of_utterance>": 49154,
3
+ "<fake_token_around_image>": 49152,
4
+ "<image>": 49153
5
+ }
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Idefics3ForConditionalGeneration"
4
+ ],
5
+ "image_seq_len": 81,
6
+ "image_token_id": 49153,
7
+ "model_type": "idefics3",
8
+ "scale_factor": 3,
9
+ "text_config": {
10
+ "_attn_implementation_autoset": false,
11
+ "_flash_attn_2_enabled": true,
12
+ "_name_or_path": "/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model",
13
+ "add_cross_attention": false,
14
+ "architectures": [
15
+ "VLlama3ForCausalLM"
16
+ ],
17
+ "attention_bias": false,
18
+ "attention_dropout": 0.0,
19
+ "bad_words_ids": null,
20
+ "begin_suppress_tokens": null,
21
+ "bos_token_id": 0,
22
+ "chunk_size_feed_forward": 0,
23
+ "cross_attention_hidden_size": null,
24
+ "decoder_start_token_id": null,
25
+ "diversity_penalty": 0.0,
26
+ "do_sample": false,
27
+ "early_stopping": false,
28
+ "encoder_no_repeat_ngram_size": 0,
29
+ "eos_token_id": 0,
30
+ "exponential_decay_length_penalty": null,
31
+ "finetuning_task": null,
32
+ "forced_bos_token_id": null,
33
+ "forced_eos_token_id": null,
34
+ "head_dim": 64,
35
+ "hidden_act": "silu",
36
+ "hidden_size": 2048,
37
+ "id2label": {
38
+ "0": "LABEL_0",
39
+ "1": "LABEL_1"
40
+ },
41
+ "initializer_range": 0.02,
42
+ "intermediate_size": 8192,
43
+ "is_decoder": false,
44
+ "is_encoder_decoder": false,
45
+ "label2id": {
46
+ "LABEL_0": 0,
47
+ "LABEL_1": 1
48
+ },
49
+ "length_penalty": 1.0,
50
+ "max_length": 20,
51
+ "max_position_embeddings": 16384,
52
+ "min_length": 0,
53
+ "mlp_bias": false,
54
+ "model_type": "llama",
55
+ "neftune_noise_alpha": 0.0,
56
+ "no_repeat_ngram_size": 0,
57
+ "num_attention_heads": 32,
58
+ "num_beam_groups": 1,
59
+ "num_beams": 1,
60
+ "num_hidden_layers": 24,
61
+ "num_key_value_heads": 32,
62
+ "num_return_sequences": 1,
63
+ "output_attentions": false,
64
+ "output_hidden_states": false,
65
+ "output_scores": false,
66
+ "pad_token_id": 2,
67
+ "perceiver_config": {
68
+ "_attn_implementation_autoset": false,
69
+ "_name_or_path": "",
70
+ "add_cross_attention": false,
71
+ "architectures": null,
72
+ "attention_dropout": 0.0,
73
+ "bad_words_ids": null,
74
+ "begin_suppress_tokens": null,
75
+ "bos_token_id": null,
76
+ "chunk_size_feed_forward": 0,
77
+ "cross_attention_hidden_size": null,
78
+ "decoder_start_token_id": null,
79
+ "diversity_penalty": 0.0,
80
+ "do_sample": false,
81
+ "early_stopping": false,
82
+ "encoder_no_repeat_ngram_size": 0,
83
+ "eos_token_id": null,
84
+ "exponential_decay_length_penalty": null,
85
+ "finetuning_task": null,
86
+ "forced_bos_token_id": null,
87
+ "forced_eos_token_id": null,
88
+ "hidden_act": "silu",
89
+ "id2label": {
90
+ "0": "LABEL_0",
91
+ "1": "LABEL_1"
92
+ },
93
+ "is_decoder": false,
94
+ "is_encoder_decoder": false,
95
+ "label2id": {
96
+ "LABEL_0": 0,
97
+ "LABEL_1": 1
98
+ },
99
+ "length_penalty": 1.0,
100
+ "max_length": 20,
101
+ "min_length": 0,
102
+ "model_type": "vllama3",
103
+ "no_repeat_ngram_size": 0,
104
+ "num_beam_groups": 1,
105
+ "num_beams": 1,
106
+ "num_key_value_heads": 1,
107
+ "num_return_sequences": 1,
108
+ "output_attentions": false,
109
+ "output_hidden_states": false,
110
+ "output_scores": false,
111
+ "pad_token_id": null,
112
+ "prefix": null,
113
+ "problem_type": null,
114
+ "pruned_heads": {},
115
+ "qk_layer_norms_perceiver": false,
116
+ "remove_invalid_values": false,
117
+ "repetition_penalty": 1.0,
118
+ "resampler_depth": 6,
119
+ "resampler_head_dim": 96,
120
+ "resampler_n_heads": 16,
121
+ "resampler_n_latents": 64,
122
+ "return_dict": true,
123
+ "return_dict_in_generate": false,
124
+ "sep_token_id": null,
125
+ "suppress_tokens": null,
126
+ "task_specific_params": null,
127
+ "temperature": 1.0,
128
+ "tf_legacy_loss": false,
129
+ "tie_encoder_decoder": false,
130
+ "tie_word_embeddings": true,
131
+ "tokenizer_class": null,
132
+ "top_k": 50,
133
+ "top_p": 1.0,
134
+ "torch_dtype": null,
135
+ "torchscript": false,
136
+ "transformers_version": "4.46.0",
137
+ "typical_p": 1.0,
138
+ "use_bfloat16": false
139
+ },
140
+ "prefix": null,
141
+ "pretraining_tp": 1,
142
+ "problem_type": null,
143
+ "pruned_heads": {},
144
+ "qk_layer_norms": false,
145
+ "remove_invalid_values": false,
146
+ "repetition_penalty": 1.0,
147
+ "return_dict": true,
148
+ "return_dict_in_generate": false,
149
+ "rms_norm_eps": 1e-05,
150
+ "rope_scaling": null,
151
+ "rope_theta": 273768.0,
152
+ "sep_token_id": null,
153
+ "suppress_tokens": null,
154
+ "task_specific_params": null,
155
+ "temperature": 1.0,
156
+ "tf_legacy_loss": false,
157
+ "tie_encoder_decoder": false,
158
+ "tie_word_embeddings": false,
159
+ "tokenizer_class": null,
160
+ "top_k": 50,
161
+ "top_p": 1.0,
162
+ "torch_dtype": "bfloat16",
163
+ "torchscript": false,
164
+ "typical_p": 1.0,
165
+ "use_bfloat16": false,
166
+ "use_cache": true,
167
+ "use_resampler": false,
168
+ "vocab_size": 49155
169
+ },
170
+ "tie_word_embeddings": false,
171
+ "torch_dtype": "bfloat16",
172
+ "transformers_version": "4.46.0",
173
+ "transformers.js_config": {
174
+ "kv_cache_dtype": {
175
+ "q4f16": "float16",
176
+ "fp16": "float16"
177
+ },
178
+ "dtype": {
179
+ "embed_tokens": "auto",
180
+ "vision_encoder": "auto",
181
+ "decoder_model_merged": "q4"
182
+ }
183
+ },
184
+ "use_cache": true,
185
+ "vision_config": {
186
+ "size": {"longest_edge": 1920},
187
+ "max_image_size": {"longest_edge": 384},
188
+ "_attn_implementation_autoset": false,
189
+ "_name_or_path": "",
190
+ "add_cross_attention": false,
191
+ "architectures": null,
192
+ "attention_dropout": 0.0,
193
+ "bad_words_ids": null,
194
+ "begin_suppress_tokens": null,
195
+ "bos_token_id": null,
196
+ "chunk_size_feed_forward": 0,
197
+ "cross_attention_hidden_size": null,
198
+ "decoder_start_token_id": null,
199
+ "diversity_penalty": 0.0,
200
+ "do_sample": false,
201
+ "early_stopping": false,
202
+ "encoder_no_repeat_ngram_size": 0,
203
+ "eos_token_id": null,
204
+ "exponential_decay_length_penalty": null,
205
+ "finetuning_task": null,
206
+ "forced_bos_token_id": null,
207
+ "forced_eos_token_id": null,
208
+ "hidden_act": "gelu_pytorch_tanh",
209
+ "hidden_size": 1152,
210
+ "id2label": {
211
+ "0": "LABEL_0",
212
+ "1": "LABEL_1"
213
+ },
214
+ "image_size": 384,
215
+ "initializer_range": 0.02,
216
+ "intermediate_size": 4304,
217
+ "is_decoder": false,
218
+ "is_encoder_decoder": false,
219
+ "label2id": {
220
+ "LABEL_0": 0,
221
+ "LABEL_1": 1
222
+ },
223
+ "layer_norm_eps": 1e-06,
224
+ "length_penalty": 1.0,
225
+ "max_length": 20,
226
+ "min_length": 0,
227
+ "model_type": "idefics3",
228
+ "no_repeat_ngram_size": 0,
229
+ "num_attention_heads": 16,
230
+ "num_beam_groups": 1,
231
+ "num_beams": 1,
232
+ "num_channels": 3,
233
+ "num_hidden_layers": 27,
234
+ "num_return_sequences": 1,
235
+ "output_attentions": false,
236
+ "output_hidden_states": false,
237
+ "output_scores": false,
238
+ "pad_token_id": null,
239
+ "patch_size": 14,
240
+ "prefix": null,
241
+ "problem_type": null,
242
+ "pruned_heads": {},
243
+ "remove_invalid_values": false,
244
+ "repetition_penalty": 1.0,
245
+ "return_dict": true,
246
+ "return_dict_in_generate": false,
247
+ "sep_token_id": null,
248
+ "suppress_tokens": null,
249
+ "task_specific_params": null,
250
+ "temperature": 1.0,
251
+ "tf_legacy_loss": false,
252
+ "tie_encoder_decoder": false,
253
+ "tie_word_embeddings": false,
254
+ "tokenizer_class": null,
255
+ "top_k": 50,
256
+ "top_p": 1.0,
257
+ "torch_dtype": null,
258
+ "torchscript": false,
259
+ "typical_p": 1.0,
260
+ "use_bfloat16": false
261
+ },
262
+ "vocab_size": 49155
263
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 49154,
5
+ "pad_token_id": 2,
6
+ "transformers_version": "4.46.0"
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
mixture_the_cauldron.png ADDED
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a4f76cb64f6f2e4e74716d8fc1cfc6a70bbb3eeea69d424c3ec9902655065eb
3
+ size 4492630912
onnx/decoder_model_merged.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a27ef6fe177d3109e0913c63da7a4b0f2791fab95da3e5f91b31ba6e03115385
3
+ size 126930
onnx/decoder_model_merged.onnx_data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d530c318000311b2697d0b891ef46c69f9e9c89688761e043654d08a3cca376c
3
+ size 6849724416
onnx/decoder_model_merged_bnb4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39974ffc8a05f4de601005dc555d326f3dd2744ffd544e2892ad065fe25b2b8a
3
+ size 967330291
onnx/decoder_model_merged_fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6ec36a518b896dfc6448c4343898d0ea5109702677d34cea3919fba074044d1
3
+ size 1342510427
onnx/decoder_model_merged_fp16.onnx_data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4249b577dcd1cd146c6db45bea29108dd1e3831f1c6a5d6a226d01ac92ab411d
3
+ size 2082471936
onnx/decoder_model_merged_int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a2369559862dd3e40361a2b63ac0e7be18c07c72845393ee89df0e79713f6c7
3
+ size 1716139218
onnx/decoder_model_merged_q4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:612e5c30793bc2045f9262b597013a25bcca44b4f76a7db196938a57a77e1f79
3
+ size 1074284508
onnx/decoder_model_merged_q4f16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d74ec46083829ddb18f58fcceb358d2ba58d2a1320bdab431c32e4d2896981d
3
+ size 965031477
onnx/decoder_model_merged_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bddb1dcd933e681eb2a542186c081dd1e6cf4b67161d905ef9da31cabbd3474d
3
+ size 1716139269
onnx/decoder_model_merged_uint8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bddb1dcd933e681eb2a542186c081dd1e6cf4b67161d905ef9da31cabbd3474d
3
+ size 1716139269
onnx/embed_tokens.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ec8537866d20b78e618e15aea8f91a558266cd77fe783e513f095fc1de1c8c4
3
+ size 402678062
onnx/embed_tokens_bnb4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eca0a3199567ba01a76dc6b923fd14bce39d6eb51d26686654bb7a98acfad280
3
+ size 402678081
onnx/embed_tokens_fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:377adedd6ac1975e3afc3fb4c24dd6032a973626da71a5e0648dec3735a56527
3
+ size 201339266
onnx/embed_tokens_int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6666a926ca2a65f89016ea19ec7c5b8afd01c58e5aca1f33733f2d936f31c71d
3
+ size 100669984
onnx/embed_tokens_q4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eca0a3199567ba01a76dc6b923fd14bce39d6eb51d26686654bb7a98acfad280
3
+ size 402678081
onnx/embed_tokens_q4f16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d835e5524c9a8b349fe55a7f589ab21780417c2f1e67f52062cf7787dcbefc3b
3
+ size 201339285
onnx/embed_tokens_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6666a926ca2a65f89016ea19ec7c5b8afd01c58e5aca1f33733f2d936f31c71d
3
+ size 100669984
onnx/embed_tokens_uint8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6666a926ca2a65f89016ea19ec7c5b8afd01c58e5aca1f33733f2d936f31c71d
3
+ size 100669984
onnx/vision_encoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65bb9b57b64763897cc6dc397450449fce5607138843566a885e2f0a250343c8
3
+ size 1737427560
onnx/vision_encoder_bnb4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a8256b74fd9465f859fab31c1840ed073aa0edd7b75d61127eefe1ce1fcf560
3
+ size 251407732
onnx/vision_encoder_fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab171611906fa056c91a28fde0ef1fda897b44bbf1ca0d9ae692cfaff90947b1
3
+ size 868985807
onnx/vision_encoder_int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:352fe86ad7d8358f39fb896de9b2efd0d8a6cf2b6239565841bab5146a735d2f
3
+ size 436180765
onnx/vision_encoder_q4.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7d20e0f8a6201e4944759f2fbcab4fa035bbb1fb34e14700f25f1f00e678992
3
+ size 278736452
onnx/vision_encoder_q4f16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b47406ea04d0754ccdd5bd0d68e827a72979f962886cf9bdeae926342234298
3
+ size 247852840
onnx/vision_encoder_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38e7292275057cec773aad0218310041e325289d18b708f89deae541925f4274
3
+ size 436180848
onnx/vision_encoder_uint8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38e7292275057cec773aad0218310041e325289d18b708f89deae541925f4274
3
+ size 436180848
preprocessor_config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_image_splitting": true,
4
+ "do_normalize": true,
5
+ "do_pad": true,
6
+ "do_rescale": true,
7
+ "do_resize": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "Idefics3ImageProcessor",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "max_image_size": {
20
+ "longest_edge": 384
21
+ },
22
+ "processor_class": "Idefics3Processor",
23
+ "resample": 1,
24
+ "rescale_factor": 0.00392156862745098,
25
+ "size": {
26
+ "longest_edge": 1536
27
+ }
28
+ }
processor_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "processor_class": "Idefics3Processor",
3
+ "image_seq_len": 81
4
+ }
smolvlm-data.pdf ADDED
Binary file (55.4 kB). View file
 
special_tokens_map.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<fake_token_around_image>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<image>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<end_of_utterance>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ ],
25
+ "bos_token": {
26
+ "content": "<|im_start|>",
27
+ "lstrip": false,
28
+ "normalized": false,
29
+ "rstrip": false,
30
+ "single_word": false
31
+ },
32
+ "eos_token": {
33
+ "content": "<|im_end|>",
34
+ "lstrip": false,
35
+ "normalized": false,
36
+ "rstrip": false,
37
+ "single_word": false
38
+ },
39
+ "pad_token": {
40
+ "content": "<|im_end|>",
41
+ "lstrip": false,
42
+ "normalized": false,
43
+ "rstrip": false,
44
+ "single_word": false
45
+ },
46
+ "unk_token": {
47
+ "content": "<|endoftext|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false
52
+ }
53
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "49152": {
141
+ "content": "<fake_token_around_image>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "49153": {
149
+ "content": "<image>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "49154": {
157
+ "content": "<end_of_utterance>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": true
163
+ }
164
+ },
165
+ "additional_special_tokens": [
166
+ "<fake_token_around_image>",
167
+ "<image>",
168
+ "<end_of_utterance>"
169
+ ],
170
+ "bos_token": "<|im_start|>",
171
+ "clean_up_tokenization_spaces": false,
172
+ "eos_token": "<end_of_utterance>",
173
+ "legacy": false,
174
+ "model_max_length": 16384,
175
+ "pad_token": "<|im_end|>",
176
+ "processor_class": "Idefics3Processor",
177
+ "tokenizer_class": "GPT2Tokenizer",
178
+ "truncation_side": "left",
179
+ "chat_template": "<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}",
180
+ "unk_token": "<|endoftext|>",
181
+ "vocab_size": 49152
182
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff