Instructions to use SearchingBinary/nolitai-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use SearchingBinary/nolitai-vision with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/root/.cache/huggingface/hub/models--zhaode--FastVLM-1.5B-Stage3/snapshots/13b83fd9ce0f45b187451909251353516717da7c") model = PeftModel.from_pretrained(base_model, "SearchingBinary/nolitai-vision") - Transformers
How to use SearchingBinary/nolitai-vision with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SearchingBinary/nolitai-vision") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SearchingBinary/nolitai-vision", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SearchingBinary/nolitai-vision with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SearchingBinary/nolitai-vision" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SearchingBinary/nolitai-vision", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/SearchingBinary/nolitai-vision
- SGLang
How to use SearchingBinary/nolitai-vision with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SearchingBinary/nolitai-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SearchingBinary/nolitai-vision", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SearchingBinary/nolitai-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SearchingBinary/nolitai-vision", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use SearchingBinary/nolitai-vision with Docker Model Runner:
docker model run hf.co/SearchingBinary/nolitai-vision
nolitai-vision β Meeting Vision Model (LoRA Adapter)
A LoRA adapter for FastVLM-1.5B-Stage3 fine-tuned for visual meeting intelligence tasks. Designed for on-device inference on Apple Silicon via MLX.
Status: Early checkpoint β This is an initial training run with limited data (~81 examples, 3 epochs). Performance is not yet production-ready. We're sharing it for research and community collaboration.
Model Details
| Property | Value |
|---|---|
| Base Model | zhaode/FastVLM-1.5B-Stage3 |
| Architecture | LlavaQwen2 (MobileClip vision + Qwen2 language model) |
| Adapter Size | 8.3 MB (LoRA weights only) |
| Training | LoRA (rank=8, alpha=16) on q/k/v/o attention projections |
| Framework | PyTorch (PEFT), convertible to MLX |
Capabilities
Given a video call screenshot, the model can:
- Speaker Identification β Detect the active/highlighted speaker in a video call grid
- Participant Listing β List all visible participants by name
- Platform Detection β Identify the meeting platform (Zoom, Teams, Meet, etc.)
- Slide OCR β Extract title and content from shared presentation slides
Example Tasks
Speaker ID Input: A screenshot of a Zoom call with a highlighted speaker tile Expected Output:
{"speaker": "Sarah Chen"}
Platform Detection Input: A screenshot of a video call Expected Output:
{"platform": "Microsoft Teams"}
Current Performance
| Task | Score | Notes |
|---|---|---|
| Speaker ID | 0% | Needs more diverse training examples |
| Participants | 0% | Needs more training data |
| Platform Detection | 60% | Partially learned |
| Slide OCR | 0% | Needs more training data |
| Overall | 10% | Early checkpoint, needs more data |
Training Details
- Method: LoRA (full precision base model, adapter-only training)
- LoRA Config: rank=8, alpha=16, dropout=0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj (language model only)
- Frozen Components: vision_tower (MobileClip), mm_projector (MLP)
- Dataset: ~81 synthetic video call screenshots with annotations
- Epochs: 3
- Learning Rate: 2e-5 (cosine scheduler, 5% warmup)
- Hardware: NVIDIA A40 48GB (RunPod)
- Training Time: ~3 minutes
- Final Train Loss: 2.50
Usage with PyTorch
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoProcessor, CLIPImageProcessor
from PIL import Image
# Load base model
base = AutoModelForCausalLM.from_pretrained(
"zhaode/FastVLM-1.5B-Stage3",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load and merge adapter
model = PeftModel.from_pretrained(base, "SearchingBinary/nolitai-vision")
model = model.merge_and_unload()
model.eval()
# Load processors
processor = AutoProcessor.from_pretrained("zhaode/FastVLM-1.5B-Stage3", trust_remote_code=True)
image_processor = CLIPImageProcessor.from_pretrained("zhaode/FastVLM-1.5B-Stage3")
tokenizer = processor.tokenizer
# Inference
image = Image.open("meeting_screenshot.png").convert("RGB")
image_tensor = image_processor.preprocess(image, return_tensors="pt")["pixel_values"]
image_tensor = image_tensor.to(device=model.device, dtype=torch.bfloat16)
prompt = 'Identify the active speaker. Respond with JSON: {"speaker": "Name"}'
chat = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# NOTE: FastVLM expects input_ids as positional arg, not keyword
outputs = model.generate(inputs["input_ids"], images=image_tensor, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Roadmap
- Expand training dataset to 1000+ examples
- Add more diverse meeting platforms and layouts
- Train for more epochs (target: >90% overall)
- Convert to MLX format for Apple Silicon deployment
- Integrate with nolitai-2b for full meeting intelligence pipeline
Part of nolit.ai
This model is part of nolit.ai β a native macOS meeting copilot that processes everything locally on your Mac. The vision model handles real-time speaker identification during video calls.
License
Apache 2.0
- Downloads last month
- -
Model tree for SearchingBinary/nolitai-vision
Base model
zhaode/FastVLM-1.5B-Stage3