Back to blog
Research·Technical Deep Dive

Is Fine-Tuning Better Than Prompt Engineering in 2026?

I read every important paper from 2025 on SFT, GRPO, GEPA, and DSPy. The fine-tune-or-prompt question is wrong. Here is what actually moves the needle, with numbers, citations, and a decision tree.

Jonathan Chavez
Jonathan Chavez
Co-Founder @ LLM Stats
·14 min read
Is Fine-Tuning Better Than Prompt Engineering in 2026?

The question gets asked weekly in every AI engineering Slack: should we fine-tune or just write a better prompt? It is the wrong question. A year ago the trade was simple. Prompts were cheap and brittle, fine-tuning was expensive and powerful. In 2026 both halves of that sentence are wrong. Prompt optimization has become a real engineering discipline that beats reinforcement learning on its own benchmarks. Fine-tuning has become a 5GB-VRAM commodity that you can run on unlabeled data with no reward function. The boundary between the two has dissolved.

This piece is the result of reading every paper that mattered in 2025: GEPA, the ICLR 2026 Oral that lit prompt optimization on fire; SFT Memorizes, RL Generalizes from Berkeley and Google DeepMind; the BetterTogether paper that argues you should use both; and the practitioner work from Unsloth, OpenPipe, and the DSPy team. I will tell you what each one actually says, what it costs to run today, and which method to reach for first.

Six Numbers

From the 2025 / 2026 literature

The fine-tune-or-prompt
question is the wrong question.

Six results from the last twelve months that reset what each method costs and what each one buys you. The numbers below are measured, not estimated. The conclusion is not balanced; one side gained more than the other.

+0pp
GEPA over GRPO
Agrawal et al., ICLR 2026
0×
Fewer rollouts
GEPA, six-task average
+0%
Hybrid over weight-only
BetterTogether, Soylu 2024
0 GB
VRAM for 17B GRPO
Unsloth, 2025
0%
Anthropic cache savings
Cache hit, prompt prefix
+0pp
SFT vs PE accuracy
Highlighter.ai, power outages
Sources are linked inline through the article. SFT vs PE accuracy delta from the Highlighter.ai 2025 study (Qwen2.5-7B fine-tuned vs Claude Sonnet 3.7 prompted on power-outage classification).

The Question Is Wrong

Fine-tuning and prompt engineering are not two ways to do the same thing. They optimize different objects through different feedback channels.

Fine-tuning updates the weights. The feedback channel is a gradient computed against a loss. Whether that loss comes from token-prediction error (SFT), preference pairs (DPO), or a verifier (GRPO), it ends up as a number. The model changes permanently.

Prompt engineering edits the input. The feedback channel can be anything: an eval score, a stack trace, a judge LLM, a human note. The model never changes; what changes is the conditioning. Prompt edits are free and instant; weight edits are expensive and slow.

The interesting fact, made concrete by GEPA in 2025, is that natural language is a higher-bandwidth feedback signal than a scalar. A reward of 0.43 tells the model less than the sentence “the JSON parser failed because line 4 had a trailing comma”. When you can route that richer signal back into the prompt, prompt optimization becomes cheaper than RL for many tasks. When you cannot (because the cost of running the model at inference is too high, or the verifier is too noisy, or the format must be guaranteed), fine-tuning still wins. The job is to know which regime you are in.


The 2025 Evidence

Three results from last year reset the conversation. Take them in order.

1. SFT memorizes, RL generalizes

Chu et al. (ICML 2025, UC Berkeley + Google DeepMind + NYU) built two clean test environments: GeneralPoints, an arithmetic card game, and V-IRL, a navigation task with both text and visual variants. They trained the same Llama-3 backbone with SFT and with RL, then tested on unseen rules and unseen visuals.

The result was unambiguous. As compute scaled, RL with an outcome-based reward kept improving on out-of-distribution variants and SFT got worse. SFT learned the training rule and refused to generalize. RL also improved the model's underlying visual recognition, which SFT degraded. The headline matters because the industry default for years was "collect more SFT data". That instinct is wrong if you care about anything other than in-distribution accuracy.

The qualifier matters too. The paper shows SFT is still essential as a warmup: it stabilizes the output format so the RL stage can optimize anything at all. The right mental model is SFT to format, RL to generalize, not SFT or RL.

2. Fine-tuning crushes prompting on closed-vocabulary tasks

Highlighter.ai's 2025 study classified electrical power outage reports and serious workplace injury reports. They put fine-tuned Qwen2.5-7B against Claude Sonnet 3.5 and 3.7 with prompt engineering. The fine-tuned 7B model hit 88% accuracy on power outages versus 31% for prompted Claude. On serious injury classification, 78% versus 59%. At inference scale, the 7B model cost $789 per million classifications; prompted Claude cost $11,485 per million. The 14× cost gap came almost entirely from token efficiency: the prompted model needed an exhaustive instruction set on every call.

For supervised classification with a fixed label set and abundant training examples, fine-tuning is the right answer. The interesting question is what counts as "fixed label set with abundant examples". In 2026 that excludes most agent workflows, most coding tasks, most retrieval-grounded QA.

3. A great prompt framework can beat fine-tuning

The opposite case is older but has not been overturned. In 2023, Microsoft researchers showed that GPT-4 with the MedPrompt framework beat Med-PaLM 2, a model fine-tuned specifically on medical data, on every one of nine medical benchmarks, by up to 12 points. The framework was just dynamic few-shot retrieval, chain-of-thought, and ensembling. No weights moved.

The MedPrompt result has aged well because it predicted what GEPA would later prove rigorously: there is enormous slack in off-the-shelf prompts, and a structured optimizer can recover most of the lift that fine-tuning was supposed to provide.


The New Prompt Engineering

"Prompt engineering" in 2026 does not mean a human typing variations into a textbox. It means a compiled program with an optimizer attached, exactly the way you would think about training a small neural net.

DSPy and MIPROv2: Bayesian search over instructions

DSPy from Stanford NLP frames an LLM system as a graph of modules with typed signatures. Once you have that abstraction, you can replace each module's prompt with a searched candidate. MIPROv2 bootstraps few-shot examples by running your program and keeping the traces that scored highly, proposes new instructions grounded in those traces, then uses Bayesian optimization to find the best combination over a validation set. For the kind of pipelines that actually ship (RAG, multi-step agents, structured extraction), MIPROv2 routinely lifts end-to-end accuracy 10 to 30 points without anyone touching weights.

GEPA: language as the gradient

GEPA (Agrawal et al., UC Berkeley + Stanford + Databricks + MIT, ICLR 2026 Oral) is the paper that made the field rethink the prompt-vs-RL tradeoff. The insight is small and devastating: a scalar reward of 0.43 tells the model less than a sentence describing what failed. GEPA collects full execution traces (reasoning, tool calls, errors, intermediate outputs), reflects on them in natural language using an LLM, proposes a targeted prompt edit, and accepts the edit if it improves a score. To avoid local optima, it maintains a Pareto frontier of candidates that excel on different problem instances rather than greedily following the single best.

Across six benchmarks, GEPA outperformed GRPO by 6 percentage points on average and up to 19, while using up to 35 times fewer rollouts. It also beat MIPROv2 by 10 percentage points (12 on AIME-2025). It is now shipping inside DSPy as dspy.GEPA and MLflow as mlflow.genai.optimize_prompts().

GEPA vs GRPO

GRPOGEPA

Reflective prompts beat
policy gradients.

Same Qwen3-8B base, same task. GRPO updates weights through thousands of scalar rewards. GEPA reads the trace, edits the prompt, and ships the change. The dot on the right is the prompt edit finishing first.

HotpotQAQwen3-8B
39.860.1+20.3pp
IFBenchQwen3-8B
39.458.5+19.1pp
HoVerQwen3-8B
32.748.3+15.6pp
PUPAQwen3-8B
71.076.8+5.8pp
AIME-2025 (instruct)Reflective LM: GPT-4o
16.728.7+12.0pp

Average lift

+6 pp

Across the four DSPy benchmarks above plus AIME-2025.

Maximum lift

+19 pp

HotpotQA. GEPA hit 60.1 vs GRPO's 39.8 with the same budget.

Rollout efficiency

35×

Fewer rollouts than GRPO needed to reach the same point.

Source: Agrawal et al., GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, arXiv:2507.19457 (ICLR 2026 Oral). Numbers are from Table 2 and the AIME-2025 ablation. GRPO was given 24,000 rollouts; GEPA used a comparable or smaller budget on every task.

System prompt learning: the missing paradigm

In May 2025, Andrej Karpathy posted that the field was missing an entire paradigm. Pretraining gives knowledge. Fine-tuning gives habits. There is a third thing, which he called system prompt learning: the model writes notes for itself about what worked on what kind of problem, edits its own system prompt over time, and accumulates explicit strategies the way a human keeps a notebook.

The trigger was Claude's system prompt, which had ballooned to around 17,000 words by then, much of it general problem-solving strategy. An open-source implementation in optillm showed +8.6 points on Arena Hard and +6.7 points on AIME 2024 with this approach using only Gemini 2.0 Flash Lite. Whether it becomes a third pillar or stays an auxiliary technique, the existence proof matters: a sizeable share of the capability you usually buy with weight updates can be bought with prompt updates instead.


The New Fine-Tuning

The other half of the picture changed just as fast. The fine-tuning stack of late 2024 (PPO, paired preference data, expensive critics, full-model gradients) is mostly gone from new work. What replaced it is dramatically cheaper, dramatically simpler, and increasingly accessible.

GRPO replaced PPO

Group Relative Policy Optimization was introduced by DeepSeek in February 2024 and went mainstream when DeepSeek-R1 used it to elicit reasoning without any human-labeled chains of thought. GRPO removes PPO's critic model entirely. For each prompt, it samples a group of responses, scores each, and computes the advantage as the group-normalized reward. No value function, no separate network, half the memory.

A recent theoretical analysis shows GRPO's policy gradient is a U-statistic and is asymptotically equivalent to an oracle algorithm with a perfect value function. So this is not a hack that happens to work; it is provably optimal in a useful class of policy gradient methods.

Unsloth made GRPO consumer-grade

The big shift between "research result" and "you can run it tonight" was Unsloth. Their team made GRPO work with QLoRA, replaced the standard memory-hungry GRPO loss with a chunked variant, and integrated vLLM for fast generation. The numbers are wild. On a Llama 3.1 8B run with 20K context and 8 generations per prompt, standard implementations need 511GB of VRAM. Unsloth needs 54GB. That is a 90% reduction. A 17B model fits in 15GB. A 1.5B model fits in 5GB. Every parameter you need is a knob you can turn:beta for KL strength, num_generations for group size, loss_type defaulting to DAPO (token-normalized to remove length bias).

The November 2025 update added FP8 precision RL. Their March 2026 long-context release pushes Qwen3-8B GRPO to 110K context on a single 80GB H100, and gpt-oss QLoRA to 380K context on a 192GB B200. RL fine-tuning at frontier scale is no longer the exclusive domain of labs with cluster access.

ART and RULER: RL without reward engineering

The other practical breakthrough was OpenPipe's ART framework and its RULER reward function. RULER's pitch: skip the reward function entirely. Generate four to eight trajectories per scenario, send them to an LLM judge with a generic ranking rubric, and use the relative scores as GRPO rewards. Because GRPO normalizes within groups anyway, only the rankings matter. On four agent benchmarks, this matched or beat hand-tuned rewards on three of four, with 2 to 3 times faster development. Their email-search agent built on Qwen 2.5 14B with RULER beat OpenAI o3 on the same task.

What that means in practice: RL is no longer gated on whether you can write a reward function. You define the system prompt, point ART at it, and let an o3 or Gemini Flash judge close the loop. The old "you need labeled preference data to do RL" bottleneck is gone.

The 2026 Stack

Three layers, in order

The same task,
three optimizers stacked.

Each layer optimizes a different object through a different feedback channel. You stop at the first layer that hits your target. Most production teams stop at the first one.

01

Prompt optimization

Discovers the strategy.

A compiled program with searched instructions and few-shot examples. Reflects on traces to make targeted edits. No weights move.

DSPyGEPAMIPROv2

Feedback signal

Natural language feedback on traces

Cost

$10 to $200 per optimization run

02

Supervised fine-tuning

Bakes the format.

Train on completions from the prompt-optimized frontier model, often a few thousand. Stabilizes output format so RL can optimize anything later.

UnslothQLoRALoRA

Feedback signal

Cross-entropy loss on labeled completions

Cost

$15 to $500 per training run

03

Reinforcement learning

Pushes the ceiling.

GRPO over groups of generations, scored by a verifier or an LLM judge. Filter to the hard examples. Generalizes where SFT memorizes.

GRPODAPORULER (ART)

Feedback signal

Verifier reward or LLM-judge ranking

Cost

$200 to $5,000 per training run

Cost ranges based on OpenAI fine-tuning pricing ($3 to $25 per million training tokens), Unsloth-published GPU usage, and typical DSPy reflection-LM token spend. RL costs assume judge-model rollouts with o3 or Qwen3 32B.

Hard examples are all you need

One last result worth knowing if you are budgeting fine-tuning data. Hard Examples Are All You Need (2025) showed that under fixed annotation budgets, training GRPO on the 10% of prompts where the base model has the lowest pass@k beats training on a random or easy subset by up to 30 percentage points. The intuition: GRPO learns from variance in outcomes, and easy examples produce no variance. So if you are limited on data, do not crawl a giant corpus. Run the base model against your candidate prompts, throw out the ones it solves, and train only on the failures.


The Economics, Recalculated

The classic case for fine-tuning was always token efficiency. Prompts grow long because you stuff in instructions, examples, chain-of-thought scaffolds. A fine-tuned model needs none of that. At sufficient request volume, the saved input tokens pay back the training cost in weeks. That math is still right, but it now has to account for two changes that landed in 2025.

Prompt caching collapsed the long-prompt premium

OpenAI and Anthropic both ship prompt caching. OpenAI is automatic, 50% off cached input on prompts above 1,024 tokens. Anthropic is explicit (you mark cacheable blocks with cache_control) and gives 90% off on cache reads after a 25% write premium. Production teams report 60% to 90% input cost reduction depending on cache hit rate.

That changes the long-prompt argument for fine-tuning. If your prompt is 80% static (system prompt, tool schemas, retrieved docs) and 20% dynamic (the user query), Anthropic caching cuts your input bill by 72% on a 90% hit rate. The fine-tuning crossover point at 10K requests per day from a year ago effectively moves to 50K to 100K depending on prompt structure. The 2026 default for mid-volume workloads is "keep using the frontier model, just cache the prefix".

Cost Crossover

Monthly cost · log scale · entity extraction

Prompt caching
moved the breakeven point.

A year ago, fine-tuning crossed cached prompts at around 10K daily requests. After Anthropic's 90% cache discount and OpenAI's 50% automatic cache, the crossover sits closer to 100K. For everything below it, optimize the prompt.

$30,000$3,000$300$30$3

Fine-tune wins here

1001K10K100K1M

Requests per day

Prompt, no cache
Prompt + cache (90%)
Fine-tuned + amortized training
Entity-extraction workload, GPT-4o-mini base. Prompt assumes 1,200 input + 200 output tokens with 5 few-shot examples; fine-tuned variant drops the examples for 400 input + 200 output tokens. Fine-tuning cost amortized over one month at $50 training spend. Anthropic 90% cache assumed at typical hit rates above 75%.

The training side got cheaper too

OpenAI fine-tuning of GPT-4o-mini costs $3 per million training tokens. A typical 1,000-example tune costs $15 to $75 and finishes in 30 to 90 minutes. Self-hosted with Llama 3.1 8B and Unsloth QLoRA, you can train on a single 24GB consumer GPU at $1 to $2 per GPU-hour. The training cost is no longer the bottleneck. The bottleneck is whether you have enough labeled data and whether the fine-tuned model still beats prompt optimization on a frontier model.


BetterTogether: The Hybrid

The most underrated paper in this whole space is Fine-Tuning and Prompt Optimization: Two Great Steps That Work Better Together (Soylu, Potts, Khattab, 2024). The setup: take a multi-module pipeline (multi-hop QA, math reasoning, feature classification), and try three optimization schedules. Optimize prompts only. Optimize weights only. Alternate the two, letting the same model teach itself across stages.

The alternating scheme beat prompt-only by up to 6% and weight-only by up to 60% on average across mistral-7b, llama-2-7b, and llama-3-8b. Sixty percent. Released in DSPy as BetterTogether, it is the single best argument that the "or" in "fine-tune or prompt" is doing more harm than the actual choice between them.

BetterTogether strategies optimizing the weights and prompts of a pipeline together outperform directly optimizing weights alone and prompts alone by up to 60% and 6%, respectively.

The intuition is straightforward. Prompt optimization discovers what kind of decomposition or strategy works for your task. Fine-tuning then bakes that decomposition into the weights so you do not have to keep paying for it in tokens. After fine-tuning, you can prompt-optimize again on the new weights to find a strategy the smaller model can execute. Each pass builds on the last. MetaTuner (2026) extends this with a joint discrete-continuous objective that learns prompts and weights end-to-end, beating BetterTogether by 10 to 17% on math and QA.


When Each Actually Wins

Here is the honest answer. The choice depends on four properties of your task. Read these in order; the first one that applies usually decides it.

Decision Field

Four cases, four winners

Match the method
to the shape of the task.

Read top to bottom. The first row whose conditions you meet is usually the right answer. Hybrid pipelines exist for any of these, but the dominant lever is the one named.

01
Fine-tune

Closed-vocabulary classification at scale

When the labels are fixed and the volume is real.

If you have 10K+ labeled examples, a stable schema, and at least a few thousand requests per day, fine-tune. Token efficiency alone pays back a $50 to $500 training run in weeks.

  • Fits when you ship
  • Triage routing
  • Compliance tagging
  • Intent detection
14×

Inference cost gap (Highlighter.ai 2025)

02
Prompt

Multi-step reasoning, agents, and RAG

When the task structure is the actual problem.

GEPA on the right DSPy program will close most of the gap to a fine-tuned baseline at 35× lower compute cost. The win compounds because the same program ports to a new base model in minutes.

  • Fits when you ship
  • Email research agent
  • Multi-hop QA
  • Tool-augmented coding
+19 pp

GEPA over GRPO on HotpotQA

03
Hybrid

Distill a frontier model to small and fast

When the prompt works but the model is too expensive.

Hit the accuracy target on a frontier model with prompt optimization, log every prompt-completion pair, then SFT a 7B to 14B open model on those logs. This is OpenAI's own recommended pattern. RL on top only if the verifier is real.

  • Fits when you ship
  • Latency-critical UX
  • On-device inference
  • Cost-bound APIs
≈98%

Cost reduction in OpenAI's published example

04
Fine-tune

Verifiable, generalization-bound problems

When SFT's memorization is the bottleneck.

Math, code, structured extraction, anything where correctness is checkable. Chu et al. (ICML 2025) showed RL with outcome rewards generalizes to unseen rule and visual variants where SFT fails. Use Unsloth + GRPO, or ART + RULER if you cannot write the verifier.

  • Fits when you ship
  • Reasoning-grade math
  • Verified code generation
  • JSON-schema extraction
+30 pp

Hard-example training over random (2508.14094)

Stats and examples sourced from the papers cited inline in the article. The pattern in row 03 is the one OpenAI explicitly recommends in its model selection guide.

The pattern that fits the most production workloads in 2026 is the third one: prompt-optimize on a frontier model, then distill to a small fine-tuned model when latency or cost forces it. OpenAI's own model selection guide recommends exactly this. You start by hitting the accuracy target on the most capable model. You log every prompt and completion. You fine-tune a smaller model on those logs. The small fine-tuned model often matches the big model at 2% of the cost.


The 2026 Default Playbook

The pipeline below is what I would build today, in order, and the stop conditions for each step. Every layer is optional; you stop when the previous layer hits your target.

Step 1: Write a clean DSPy program

Define the task as a graph of typed signatures, not as a string. This costs a day and pays back forever, because every subsequent step (prompt optimization, fine-tuning, evaluation) is gated on having a measurable program. If your eval set is too small to score signal, generate one synthetically with a frontier model before you do anything else. Without an eval set, you cannot optimize anything.

Step 2: Optimize the prompts with GEPA or MIPROv2

Run dspy.GEPA with a small budget (100 to 200 metric calls is enough for most tasks). GEPA uses an LLM as the reflection model, which is the only meaningful cost; it produces a better program with more deliberate instructions and a sensible set of few-shot examples. If the result clears your accuracy bar, ship it. You are done.

Step 3: Add prompt caching

Restructure your prompt so the static portion (system prompt, tool schemas, retrieved corpora) sits at the front. Add Anthropiccache_control markers or rely on OpenAI's automatic caching. This often cuts your input bill 60% to 90% with no behavioral change. Most teams stop here.

Step 4: SFT a smaller model on the optimized prompt

If latency or cost still does not work, generate completions from the prompt-optimized frontier model on a few thousand inputs, then SFT a smaller open-weights model (Qwen3-8B is the current sweet spot) on those completions. Use Unsloth with QLoRA. This is the distillation step OpenAI's guide recommends. If accuracy on your eval drops more than 2 to 3 points, reach for a slightly larger student model rather than more SFT data.

Step 5: GRPO with verifiable or judge-based rewards

For tasks with a real verifier (math, code, structured extraction), run GRPO on top of the SFT model with Unsloth. For agent tasks without a verifier, use ART with RULER and a strong judge model (Qwen3 32B works fine; o3 is overkill). Filter your training set to the hard examples (those the SFT model fails on). Expect 10 to 20 points of improvement on the eval if the prompt-optimized ceiling was real.

Step 6: Re-optimize the prompt for the fine-tuned model

The smaller fine-tuned model often benefits from a different prompt than the frontier model used. Re-running GEPA on the fine-tuned model frequently recovers another 2 to 5 points and shortens the inference prompt. This is the BetterTogether closure: weight optimization, then prompt optimization, then ship.

Most teams will finish at step 2 or 3. A meaningful minority will go to step 4. Only the highest-volume or most quality-critical workloads will need 5 and 6. The wrong move in 2026 is to start at 5 because someone said fine-tuning was the answer.

Questions

Frequently Asked Questions

  • Not by default, and the framing is misleading. Modern prompt optimization (DSPy with GEPA) outperforms RL fine-tuning (GRPO) by 6 to 19 points on average across six benchmarks, while using up to 35x fewer rollouts. Fine-tuning still wins for high-volume token-cost reduction, hard-to-prompt formats, and when you need a much smaller model. The best production systems combine both: prompt-optimize first, then SFT for format, then GRPO if you have verifiable rewards.
  • GEPA (Genetic-Pareto) is a reflective prompt optimizer from a 2025 paper by Agrawal et al. Instead of compressing execution traces into a single scalar reward (what GRPO does), GEPA reads the full trace, errors, and feedback in natural language, reflects on what went wrong, and proposes targeted prompt edits. Because language is a higher-bandwidth feedback channel than a scalar, it learns from far fewer examples. It was accepted to ICLR 2026 as an Oral.
  • Fine-tune when you have more than 10K requests per day (token savings pay back training in 1 to 2 months), need a hyper-specific format that prompts cannot enforce reliably, want to distill a frontier model into a small one for latency, or need verifiable-reward learning on a task where prompt optimization has hit its ceiling. Stay on prompts if your data changes weekly, your task is fluid, or your volume is below roughly 1,000 requests per day.
  • BetterTogether (Soylu et al., 2024) is a DSPy meta-optimizer that alternates prompt optimization and weight optimization. On multi-hop QA, math, and feature classification with mistral-7b, llama-2-7b, and llama-3-8b, it beats prompt-only by 6% average and weight-only by up to 60% average. The intuition: prompts discover the strategy, fine-tuning bakes it in.
  • For rule-based tasks, yes. Chu et al. (ICML 2025) showed that on the GeneralPoints arithmetic game and V-IRL navigation, SFT overfits the training rule and fails on unseen variants, while RL with outcome rewards transfers. But SFT is still required as a warmup: it stabilizes the output format that RL then optimizes. The takeaway is not “skip SFT”, it is “use SFT sparingly, then let RL do the heavy lifting on generalization”.
  • With Unsloth and QLoRA, you can run GRPO on a model with up to 17B parameters on 15GB of VRAM, and a 1.5B model fits in 5GB. For Llama 3.1 8B at 20K context with 8 generations per prompt, Unsloth uses 54GB versus 510GB for a standard implementation, a 90% reduction. The November 2025 update added FP8 RL and pushed Qwen3-8B to 110K context on a single 80GB H100.

Continue Reading