Post-Training in 2026: GRPO, DAPO, RLVR & Beyond
How GRPO, DAPO, RLVR, and synthetic self-play replaced RLHF as the dominant post-training stack. A technical survey of what changed and why.


Twelve months ago, the standard recipe was clear: pretrain on trillions of tokens, then run RLHF with human preference labels. That recipe is dead. Every major model released in the past year, from DeepSeek-R1 to Nemotron 3 Super to GPT-5.3 Codex, uses a different post-training stack. The methods changed, the data sources changed, and the results changed with them.
This post maps what happened. Not a survey of every paper, but a practitioner's guide to the techniques that actually shipped in production models during 2025 and early 2026.
The Modern Post-Training Pipeline
Post-training now has three distinct stages after pretraining, each solving a different problem. The order matters.
Post-training (SFT, Preference Optimization, RL) now accounts for the majority of a model's usable capability. Pretraining provides the foundation; post-training shapes behavior.
SFT (Supervised Fine-Tuning) teaches the model the format: how to follow instructions, produce structured outputs, and respond in a conversational style. This stage uses 1-10M curated examples. Nemotron 3 Super used 7M samples from a broader corpus of 40M.
Preference Optimization aligns the model with human values and preferences. This is where DPO and its variants operate. The model learns which responses humans prefer over alternatives.
Reinforcement Learning pushes the model beyond its training data. Using verifiable rewards (math, code) or environment-based feedback (tool use, multi-step tasks), the model discovers new strategies through trial and error. This is the stage that produces reasoning capabilities.
GRPO & DAPO: RL Without the Baggage
PPO (Schulman et al., 2017) dominated RL-based LLM training for years. It works, but it requires a separate critic model that doubles memory usage and produces noisy value estimates. For large models, this cost is prohibitive.
GRPO: Group-Relative Advantages
GRPO (introduced in DeepSeekMath, then used in DeepSeek-R1) eliminates the critic entirely. For each prompt, it samples a group of responses (typically 8-64) and computes advantages by normalizing each response's reward against the group mean and standard deviation. The advantage for response i is simply (reward_i - mean) / std.
A recent theoretical analysis shows that GRPO's policy gradient is a U-statistic, making it asymptotically equivalent to an oracle algorithm with access to an ideal value function. This isn't just a hack that happens to work. It's provably optimal within a broad class of policy gradient methods.
GRPO samples multiple responses per prompt, scores them, then normalizes rewards within the group. No critic model needed.
DAPO: Stabilizing Long-Horizon RL
DAPO (ByteDance/Tsinghua, 2025) tackles the specific instabilities that arise when training reasoning models with long chain-of-thought outputs. It introduces four techniques:
- Clip-Higher: increases the upper clip range in the policy ratio to prevent entropy collapse, keeping the model exploratory
- Dynamic Sampling: filters batches to maintain consistent gradient signals, avoiding wasted compute on uninformative samples
- Token-level Policy Gradient Loss: critical for long CoT sequences where sequence-level loss creates vanishing gradients
- Overlong Reward Shaping: reduces reward noise from responses that exceed length limits
On AIME 2024, DAPO trained Qwen2.5-32B to 50 points, outperforming DeepSeek-R1-Zero with 50% fewer training steps. The full system is open-sourced.
Filled circles indicate a requirement. GRPO and DAPO remove the need for critic models, reference models, and preference pair curation.
Preference Optimization: Beyond DPO
DPO (Rafailov et al., 2023) showed you could skip RL entirely for preference alignment. But production use exposed its limitations: length bias, reference model dependency, and the cost of curating preference pairs. Three successors address these gaps.
SimPO: No Reference Model
SimPO uses the average log probability of a response as an implicit reward, removing the reference model entirely. This is a meaningful simplification: no need to keep a frozen copy of the model in memory during training. SimPO outperforms DPO by 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard.
KTO: Binary Feedback
KTO (Kahneman-Tversky Optimization) works with simple thumbs-up/thumbs-down feedback instead of pairwise comparisons. This matters for production systems where collecting paired preferences is expensive, but binary feedback is cheap and abundant (like/dislike buttons, regeneration signals).
ORPO: Merging SFT and Alignment
ORPO combines the SFT and preference optimization stages into a single training objective using odds ratios. One stage instead of two. This reduces training time and eliminates the distribution shift that occurs between SFT and preference tuning.
Each generation removes a requirement. ORPO has zero external dependencies: no reward model, no reference model, no paired data, no separate SFT stage.
Reasoning RL: Verifiable Rewards
The most consequential shift in 2025 was the move from human preference labels to verifiable rewards for reasoning tasks. DeepSeek-R1 demonstrated that pure RL with verifiable rewards (RLVR) can produce emergent reasoning capabilities, including self-reflection and dynamic strategy adaptation, without any human-labeled reasoning traces.
The key insight: for math, code, and structured reasoning, you do not need humans to judge quality. A unit test, a proof checker, or a mathematical verifier provides a binary signal that's faster, cheaper, and more consistent than any human annotator.
No human annotators in the loop. The verifier (unit test, proof checker, math evaluator) provides the reward signal directly.
Self-Verification (RISE)
RISE addresses a subtle problem: models trained with RLVR learn to generate correct answers but not to verify their own reasoning. RISE trains both problem-solving and self-verification within a single RL process, using the verifiable reward signal for both. The result is models that can catch their own mistakes during inference.
Handling Noisy Verifiers
Real verifiers are not perfect. Math checkers have edge cases. Code tests can be incomplete. Recent work develops correction algorithms that de-bias observed rewards under verifier noise, preventing the model from exploiting false positives in the verification signal.
Synthetic Self-Play
The bottleneck in post-training has always been data. Human annotation is slow and expensive. Synthetic data generation flips this: the model generates its own training data, then learns from it.
SPIN: Self-Play Fine-Tuning
SPIN (UCLA, 2024) trains models by having them distinguish their own outputs from human-written ones. The model progressively improves by learning to generate responses that are indistinguishable from human references, without additional human annotation. Performance matches or exceeds DPO-trained models.
SPICE: Grounded Self-Play
SPICE extends self-play by grounding it in external documents. A Challenger model mines documents to generate reasoning tasks. A Reasoner model solves them. This document grounding prevents the hallucination amplification and model collapse that ungrounded self-play methods face. Results: +8.9% on mathematical reasoning, +9.8% on general reasoning.
Synthetic Data for Reasoning
Recent work shows that rule-generated synthetic data teaches LLMs to compose knowledge, a fundamental generalizable skill. Models fine-tuned on synthetic multi-hop reasoning tasks transfer to real-world benchmarks, offering a scalable alternative to human annotation.
Agentic Post-Training
The newest frontier is training models for multi-step tool use and autonomous workflows. This requires RL environments, not static datasets.
NeMo Gym
NVIDIA's NeMo Gym provides interactive RL environments for training LLM agents. It supports multi-turn rollouts, tool-calling verification, and decoupled agent/environment architectures. Nemotron 3 Super was trained across 21 environment configurations generating 1.2 million rollouts, spanning math, code, tool use, and multi-turn conversations.
RLFactory
RLFactory is a plug-and-play framework for multi-round tool-use RL. It addresses stability through asynchronous calling and supports diverse reward signals (rule-based, model-judgment, tool-verification). Qwen3-4B trained with RLFactory on Search-R1 surpassed larger models on Natural Questions, with 6.8x throughput improvement.
Safety in Agentic Training
MOSAIC (March 2026) addresses a critical gap: how to train agents that know when to refuse. It structures inference as "plan, check, then act or refuse" and uses trajectory-level preference learning. Testing on Qwen and Phi models showed up to 50% reduction in harmful behavior while preserving task performance.
What's Next
Three directions are likely to define the next year of post-training research.
Unified pipelines. ORPO already merges SFT and preference optimization. The logical next step is merging all three stages into a single training objective that handles instruction following, preference alignment, and reasoning improvement simultaneously. Early work on this exists but nothing has shipped at scale.
Environment-native training. The shift from static datasets to interactive environments (NeMo Gym, RLFactory) is just beginning. As these environments become richer, covering browser use, file systems, databases, and APIs, the gap between "chat model" and "agent model" will widen. Models trained purely on text pairs will increasingly fall behind on agentic tasks.
Automatic curriculum generation. SPICE hints at this: models that generate their own training tasks, calibrated to their current ability level. Combined with RLVR for verification, this creates a closed loop where the model identifies its weaknesses, generates training data targeting those weaknesses, trains on it, and repeats. No human in the loop.