Path to Multimodal Generalist

community

https://generalist.top/

path2generalist

AI & ML interests

Multimodal Generalist

Recent Activity

LXT submitted a paper 5 days ago

SAMTok: Representing Any Mask with Two Words

scofield7419 authored a paper 25 days ago

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

QingyuShi authored a paper about 1 month ago

RecTok: Reconstruction Distillation along Rectified Flow

View all activity

LXT

submitted a paper to Daily Papers 5 days ago

SAMTok: Representing Any Mask with Two Words

Paper • 2601.16093 • Published 5 days ago • 40

scofield7419

authored a paper 25 days ago

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Paper • 2512.22905 • Published about 1 month ago • 20

QingyuShi

authored a paper about 1 month ago

RecTok: Reconstruction Distillation along Rectified Flow

Paper • 2512.13421 • Published Dec 15, 2025 • 5

QingyuShi

submitted a paper to Daily Papers about 1 month ago

RecTok: Reconstruction Distillation along Rectified Flow

Paper • 2512.13421 • Published Dec 15, 2025 • 5

LXT

authored 13 papers about 1 month ago

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Paper • 2506.24102 • Published Jun 30, 2025

One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Paper • 2509.04444 • Published Sep 4, 2025

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Paper • 2508.12081 • Published Aug 16, 2025

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Paper • 2510.11712 • Published Oct 13, 2025 • 31

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21, 2025 • 37

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Paper • 2510.20579 • Published Oct 23, 2025 • 56

From Masks to Worlds: A Hitchhiker's Guide to World Models

Paper • 2510.20668 • Published Oct 23, 2025 • 8

PairUni: Pairwise Training for Unified Multimodal Language Models

Paper • 2510.25682 • Published Oct 29, 2025 • 14

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30, 2025 • 34

Visual Spatial Tuning

Paper • 2511.05491 • Published Nov 7, 2025 • 52

Towards Open Vocabulary Learning: A Survey

Paper • 2306.15880 • Published Jun 28, 2023

RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection

Paper • 2502.13071 • Published Feb 18, 2025

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published Nov 12, 2025 • 70

QingyuShi

authored a paper about 2 months ago

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Paper • 2512.02457 • Published Dec 2, 2025 • 14

scofield7419

authored a paper 2 months ago

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Paper • 2511.08521 • Published Nov 11, 2025 • 38

scofield7419

authored a paper 3 months ago

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Paper • 2509.11866 • Published Sep 15, 2025 • 2