Papers - Training Research
updated
Measuring the Effects of Data Parallelism on Neural Network Training
Paper
• 1811.03600
• Published
• 2
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Paper
• 1804.04235
• Published
• 2
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Paper
• 1905.11946
• Published
• 3
Yi: Open Foundation Models by 01.AI
Paper
• 2403.04652
• Published
• 65
Extending Context Window of Large Language Models via Positional
Interpolation
Paper
• 2306.15595
• Published
• 54
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published
• 45
Algorithmic progress in language models
Paper
• 2403.05812
• Published
• 19
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a
Single GPU
Paper
• 2403.06504
• Published
• 56
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
• 2310.16795
• Published
• 27
CoCa: Contrastive Captioners are Image-Text Foundation Models
Paper
• 2205.01917
• Published
• 3
Paper
• 1605.07146
• Published
• 2
Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource
Constrained IoT Systems
Paper
• 2306.12691
• Published
• 3
Learning to Reason and Memorize with Self-Notes
Paper
• 2305.00833
• Published
• 5
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large
Language Model Guidance
Paper
• 2310.10021
• Published
• 2
Self-Supervised Vision Transformers Learn Visual Concepts in
Histopathology
Paper
• 2203.00585
• Published
• 3
DeepNet: Scaling Transformers to 1,000 Layers
Paper
• 2203.00555
• Published
• 2
Gemma: Open Models Based on Gemini Research and Technology
Paper
• 2403.08295
• Published
• 50
Scan and Snap: Understanding Training Dynamics and Token Composition in
1-layer Transformer
Paper
• 2305.16380
• Published
• 5
SELF: Language-Driven Self-Evolution for Large Language Model
Paper
• 2310.00533
• Published
• 2
GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length
Paper
• 2310.00576
• Published
• 2
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
Attention
Paper
• 2310.00535
• Published
• 2
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
Choice Capabilities in Chinchilla
Paper
• 2307.09458
• Published
• 12
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
• 2310.19956
• Published
• 10
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
• 2305.13169
• Published
• 3
MicroNAS: Memory and Latency Constrained Hardware-Aware Neural
Architecture Search for Time Series Classification on Microcontrollers
Paper
• 2310.18384
• Published
• 2
PreNAS: Preferred One-Shot Learning Towards Efficient Neural
Architecture Search
Paper
• 2304.14636
• Published
• 2
Can GPT-4 Perform Neural Architecture Search?
Paper
• 2304.10970
• Published
• 2
Neural Architecture Search: Insights from 1000 Papers
Paper
• 2301.08727
• Published
• 2
Unified Functional Hashing in Automatic Machine Learning
Paper
• 2302.05433
• Published
• 2
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper
• 2402.03620
• Published
• 117
Take a Step Back: Evoking Reasoning via Abstraction in Large Language
Models
Paper
• 2310.06117
• Published
• 2
Transformers Can Achieve Length Generalization But Not Robustly
Paper
• 2402.09371
• Published
• 14
Triple-Encoders: Representations That Fire Together, Wire Together
Paper
• 2402.12332
• Published
• 2
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published
• 129
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
Quiet-STaR: Language Models Can Teach Themselves to Think Before
Speaking
Paper
• 2403.09629
• Published
• 79
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper
• 2403.09631
• Published
• 12
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based
Semantic Control
Paper
• 2403.09055
• Published
• 26
Vision Transformer with Quadrangle Attention
Paper
• 2303.15105
• Published
• 2
Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of
Latent-Based Diffusion Models
Paper
• 2312.11473
• Published
• 3
Semi-Supervised Semantic Segmentation using Redesigned Self-Training for
White Blood Cells
Paper
• 2401.07278
• Published
• 2
Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized
Model Responses
Paper
• 2312.00763
• Published
• 23
Training Compute-Optimal Large Language Models
Paper
• 2203.15556
• Published
• 11
Unified Scaling Laws for Routed Language Models
Paper
• 2202.01169
• Published
• 2
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
Chain-of-Verification Reduces Hallucination in Large Language Models
Paper
• 2309.11495
• Published
• 40
Adapting Large Language Models via Reading Comprehension
Paper
• 2309.09530
• Published
• 82
Exploring Large Language Models' Cognitive Moral Development through
Defining Issues Test
Paper
• 2309.13356
• Published
• 38
Large Language Models Cannot Self-Correct Reasoning Yet
Paper
• 2310.01798
• Published
• 36
Table-GPT: Table-tuned GPT for Diverse Table Tasks
Paper
• 2310.09263
• Published
• 40
TabLib: A Dataset of 627M Tables with Context
Paper
• 2310.07875
• Published
• 8
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
• 2311.10642
• Published
• 25
Text Generation with Diffusion Language Models: A Pre-training Approach
with Continuous Paragraph Denoise
Paper
• 2212.11685
• Published
• 2
Neural networks behave as hash encoders: An empirical study
Paper
• 2101.05490
• Published
• 2
Large Language Models as Optimizers
Paper
• 2309.03409
• Published
• 79
Simple synthetic data reduces sycophancy in large language models
Paper
• 2308.03958
• Published
• 23
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Paper
• 2305.10429
• Published
• 4
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
• 2403.09919
• Published
• 21
FouriScale: A Frequency Perspective on Training-Free High-Resolution
Image Synthesis
Paper
• 2403.12963
• Published
• 8
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
• 2403.12596
• Published
• 11
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper
• 2403.10704
• Published
• 60
End-to-End Object Detection with Transformers
Paper
• 2005.12872
• Published
• 7
RewardBench: Evaluating Reward Models for Language Modeling
Paper
• 2403.13787
• Published
• 22
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper
• 2403.13501
• Published
• 9
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints
Paper
• 2305.13245
• Published
• 6
ReNoise: Real Image Inversion Through Iterative Noising
Paper
• 2403.14602
• Published
• 21
DreamReward: Text-to-3D Generation with Human Preference
Paper
• 2403.14613
• Published
• 37
Chain of Thought Empowers Transformers to Solve Inherently Serial
Problems
Paper
• 2402.12875
• Published
• 13
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive
Critiquing
Paper
• 2305.11738
• Published
• 9
Shepherd: A Critic for Language Model Generation
Paper
• 2308.04592
• Published
• 33
DRLC: Reinforcement Learning with Dense Rewards from LLM Critic
Paper
• 2401.07382
• Published
• 2
DenseFormer: Enhancing Information Flow in Transformers via Depth
Weighted Averaging
Paper
• 2402.02622
• Published
• 3
TRIP: Temporal Residual Learning with Image Noise Prior for
Image-to-Video Diffusion Models
Paper
• 2403.17005
• Published
• 13
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Paper
• 2403.16627
• Published
• 22
LLM Agent Operating System
Paper
• 2403.16971
• Published
• 73
Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models
Paper
• 2309.01674
• Published
• 2
Data Distributional Properties Drive Emergent In-Context Learning in
Transformers
Paper
• 2205.05055
• Published
• 2
InternLM2 Technical Report
Paper
• 2403.17297
• Published
• 34
LIMA: Less Is More for Alignment
Paper
• 2305.11206
• Published
• 27
Masked Audio Generation using a Single Non-Autoregressive Transformer
Paper
• 2401.04577
• Published
• 45
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
• 2403.20331
• Published
• 16
DiJiang: Efficient Large Language Models through Compact Kernelization
Paper
• 2403.19928
• Published
• 12
Why Transformers Need Adam: A Hessian Perspective
Paper
• 2402.16788
• Published
• 2
Bigger is not Always Better: Scaling Properties of Latent Diffusion
Models
Paper
• 2404.01367
• Published
• 22
Training LLMs over Neurally Compressed Text
Paper
• 2404.03626
• Published
• 23
Locating and Editing Factual Associations in GPT
Paper
• 2202.05262
• Published
• 1
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper
• 2208.01626
• Published
• 3
Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation
Paper
• 2107.07651
• Published
• 1
Superposition Prompting: Improving and Accelerating Retrieval-Augmented
Generation
Paper
• 2404.06910
• Published
• 3
RecurrentGemma: Moving Past Transformers for Efficient Open Language
Models
Paper
• 2404.07839
• Published
• 48
Instruction Tuning with Human Curriculum
Paper
• 2310.09518
• Published
• 3
OOVs in the Spotlight: How to Inflect them?
Paper
• 2404.08974
• Published
• 1
All you need is a good init
Paper
• 1511.06422
• Published
• 1
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
Diverse Models
Paper
• 2404.18796
• Published
• 71
KAN: Kolmogorov-Arnold Networks
Paper
• 2404.19756
• Published
• 116
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
Paper
• 2405.16759
• Published
• 8
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with
Geometric, Topological, and Algebraic Structures
Paper
• 2407.09468
• Published
• 2
Paper
• 2410.05258
• Published
• 180