@Kseniase on Hugging Face: "11 Alignment and Optimization Algorithms for LLMs When we need to align…"

Post

4611

11 Alignment and Optimization Algorithms for LLMs

When we need to align models' behavior with the desired objectives, we rely on specialized algorithms that support helpfulness, accuracy, reasoning, safety, and alignment with user preferences. Much of a model’s usefulness comes from post-training optimization methods.

Here are the main optimization algorithms (both classic and new) in one place:

1. PPO (Proximal Policy Optimization) -> Proximal Policy Optimization Algorithms (1707.06347)
Clips the probability ratio to prevent the new policy from diverging too far from the old one. It helps keep everything stable

2. DPO (Direct Preference Optimization) -> Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2305.18290)
It's a non RL method, where an LM is an implicit reward model. It uses a simple loss to boost the preferred answer’s probability over the less preferred one

3. GRPO (Group Relative Policy Optimization) -> DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2402.03300)
An RL method that compares a group of model outputs for the same input and updates the policy based on relative rankings. It doesn't need a separate critic model
It's latest application is Flow-GRPO which adds online RL into flow matching models -> Flow-GRPO: Training Flow Matching Models via Online RL (2505.05470)

4. DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) -> DAPO: An Open-Source LLM Reinforcement Learning System at Scale (2503.14476)
Decouples the clipping bounds for flexibility, introducing 4 key techniques: clip-higher (to maintain exploration), dynamic sampling (to ensure gradient updates), token-level loss (to balance learning across long outputs), and overlong reward shaping (to handle long, truncated answers)

5. Supervised Fine-Tuning (SFT) -> Training language models to follow instructions with human feedback (2203.02155)
Often the first post-pretraining step. A model is fine-tuned on a dataset of high-quality human-written input-output pairs to directly teach desired behaviors

More in the comments 👇

If you liked it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

Reinforcement Learning from Human Feedback (RLHF) -> https://huggingface.co/papers/2203.02155
The classic approach that combines supervised fine-tuning with RL on a reward model trained from human preference data.
Check out other RL+F approaches here: https://www.turingpost.com/p/rl-f
Monte Carlo Tree Search (MCTS) -> https://huggingface.co/papers/2305.10601
This planning algorithm builds a search tree by simulating many reasoning paths from the current state, balancing exploration and exploitation.
AMPO (Active Multi-Preference Optimization) -> https://huggingface.co/papers/2502.18293
Combines on-policy generation, contrastive learning, and smart selection of training examples. From many possible responses, it picks a small, diverse set with both high- and low-quality answers and unique styles
SPIN (Self-Play Fine-Tuning) -> https://huggingface.co/papers/2401.01335
Uses self-play, where the model learns by comparing its own generated responses to earlier outputs and human examples
SPPO (Self-Play Preference Optimization) -> https://huggingface.co/papers/2405.00675
Aligns LMs by framing training as a two-player game where the model learns to improve against itself through preference comparisons, aiming to reach a Nash equilibrium
RSPO (Regularized Self-Play Policy Optimization) -> https://huggingface.co/papers/2503.00030
Lets models learn through self-play, with an extra regularization term added to keep training stable. It achieves best results with a linear combination of forward and reverse KL divergence regularization

Join the conversation