Cosmos-Reason2-2B-W4A16
Optimized version of nvidia/Cosmos-Reason2-2B using Quantization. Optimized for reduced GPU memory usage and improved inference efficiency while maintaining high-quality multimodal reasoning performance.
This model was created by quantizing the base language model to INT4 weights while keeping activations in FP16 precision. The model preserves the reasoning capabilities of the original Cosmos-Reason2-2B model while significantly reducing the memory footprint of model weights.
For more efficient inference, Embedl’s proprietary optimizations and architectural enhancements require patching vLLM. These updates will be released at a later date. For now, the model can be used with vLLM through the NVIDIA Jetson container.
Model Details
| Field | Value |
|---|---|
| Base Model | nvidia/Cosmos-Reason2-2B |
| Input / Output | Text + Image / Video → Text |
| Release Date | 2026-02-13 |
| Version | 1.0 |
| Optimizations | Quantization (W4A16) |
| Developers | Embedl |
| Licenses | Upstream: NVIDIA Open Model License, Additional Information: Apache License 2.0, Optimized Components: Embedl Models Community Licence v1.0 (no redistribution) |
| Intended Use | Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose NLP on NVIDIA GPUs |
Optimizations
- Quantization (W4A16) - large reduction in memory footprint and latency.
Accuracy
For comparative evaluation, we report benchmark scores using the Physical AI Bench Reason Task.
We have not been able to reproduce the baseline benchmarks reported by nvidia/Cosmos-Reason2-2B on the Physical AI Bench Leaderboard, see related issue: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52
Overall + Category Scores
| Model | Overall | Embodied Reasoning | Common Sense |
|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | 50.60 | 53.93 | 47.19 |
| embedl/Cosmos-Reason2-2B-NVFP4A16 | 49.84 | 50.16 | 49.50 |
| embedl/Cosmos-Reason2-2B-W4A16 | 48.68 | 50.49 | 46.85 |
| embedl/Cosmos-Reason2-2B-W4A16-Edge2 | 50.58 | 53.61 | 47.52 |
Subcategory Scores
| Model | AV | Physical World | Time | Space | Agibot | HoloAssist | RoboFail | RoboVQA | BridgeData V2 |
|---|---|---|---|---|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | 44.00 | 46.90 | 45.30 | 55.00 | 34.00 | 60.00 | 49.00 | 90.91 | 42.00 |
| embedl/Cosmos-Reason2-2B-NVFP4A16 | 44.00 | 45.13 | 52.01 | 52.50 | 28.00 | 58.00 | 51.00 | 84.55 | 32.00 |
| embedl/Cosmos-Reason2-2B-W4A16 | 36.00 | 47.79 | 44.30 | 53.75 | 36.00 | 61.00 | 42.00 | 80.91 | 44.00 |
| embedl/Cosmos-Reason2-2B-W4A16-Edge2 | 45.00 | 44.25 | 48.66 | 52.50 | 32.00 | 59.00 | 54.00 | 85.45 | 43.00 |
Performance
On-device performance benchmarks can be explored on embedl/Edge-Inference-Benchmarks.
Usage Examples
Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).
vLLM Video Inference
vLLM image: NVIDIA vLLM 0.14.0 for Jetson
Test Hardware: NVIDIA Jetson AGX Orin
--gpu-memory-utilizationand--max-num-seqsshould be adapted to system specifications (i.e., available RAM).
docker run --rm -it \
--network host \
--shm-size=8g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--runtime=nvidia \
--name=vllm-serve \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \
--max-model-len 8192 \
--gpu-memory-utilization 0.75 \
--max-num-seqs 2
Test Hardware: NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super
gpu_memory_utilizationandmax_num_seqsshould be adapted to system specifications (i.e., available RAM).
from vllm import LLM, SamplingParams
if __name__ == "__main__":
model = "embedl/Cosmos-Reason2-2B-W4A16"
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
],
},
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {"url": video_url, "fps": 4},
},
{
"type": "text",
"text": "Describe this video in detail.",
},
],
},
]
llm = LLM(
model=model,
limit_mm_per_prompt={
"video": {
"count": 1,
"num_frames": 12,
"width": 1920,
"height": 1080,
},
"image": 0,
"audio": 0,
},
media_io_kwargs={"video": {"num_frames": -1}},
max_model_len=8192,
mm_processor_kwargs={"truncation": False},
# System-specific settings - Adapt depending on available RAM
disable_log_stats=False,
gpu_memory_utilization=0.75,
max_num_seqs=2,
)
output = llm.chat(
messages,
sampling_params=SamplingParams(temperature=0.0, max_tokens=256),
)
print(output[0].outputs[0].text)
Transformers Inference
Test Hardware: NVIDIA L4 GPU
Adapted from nvidia/Cosmos-Reason2-2B.
import torch
import transformers
if __name__ == "__main__":
model_name = "embedl/Cosmos-Reason2-2B-W4A16"
model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
model_name,
device_map="auto",
attn_implementation="sdpa",
)
processor: transformers.Qwen3VLProcessor = (
transformers.AutoProcessor.from_pretrained(model_name)
)
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
video_messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_url, "fps": 4},
{"type": "text", "text": "Describe this video in detail."},
],
},
]
# Process inputs
inputs = processor.apply_chat_template(
video_messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
truncation=False,
fps=4,
)
inputs = inputs.to(model.device)
# Run inference
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :]
for in_ids, out_ids in zip(
inputs.input_ids, generated_ids, strict=False
)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output_text[0])
License
Built on NVIDIA Cosmos
This model is a derivative of nvidia/Cosmos-Reason2-2B.
Licensed by NVIDIA Corporation under the NVIDIA Open Model License
- Upstream: NVIDIA Open Model License
- Additional Information: Apache License 2.0
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
Enterprise & Commercial Inquiries contact@embedl.com
Technical Issues & Early Access https://github.com/embedl/embedl-models
More Information & Model Releases https://embedl.com
Partner & Developer Opportunities
If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:
- Engineering support for on-prem/edge deployments
- Early access & partner co-marketing opportunities
Contact: contact@embedl.com
- Downloads last month
- 8,539