FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference Paper • 2505.22758 • Published May 28, 2025 • 1
PaTH Attention: Position Encoding via Accumulating Householder Transformations Paper • 2505.16381 • Published May 22, 2025
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence Paper • 2502.09927 • Published Feb 14, 2025 • 1
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping Paper • 2501.06589 • Published Jan 11, 2025
Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models Paper • 2409.04787 • Published Sep 7, 2024 • 1
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 20
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Paper • 2408.13359 • Published Aug 23, 2024 • 23
The infrastructure powering IBM's Gen AI model development Paper • 2407.05467 • Published Jul 7, 2024 • 3
FlexAttention for Efficient High-Resolution Vision-Language Models Paper • 2407.20228 • Published Jul 29, 2024 • 1
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Paper • 2408.13359 • Published Aug 23, 2024 • 23
Enhancing Training Efficiency Using Packing with Flash Attention Paper • 2407.09105 • Published Jul 12, 2024 • 17