What happened?
GPT-5.4 is now sitting at the top of the PostTrainBench leaderboard, scoring 28.22% — up from 20.23% without any prompt elicitation. There was no new model release, no architecture change. Researcher Hardik Bhatnagar found that the model was only using about 1.5 of its allocated 10 compute hours during evaluation.
The fix, it turned out, was almost embarrassingly simple.
The sentence that changed everything
"You still have time, keep improving."
That single nudge — no fine-tuning, no system prompt overhaul — pushed GPT-5.4 from 4th place to 1st on PostTrainBench. The relative improvement was 40%.
This is what researchers call elicitation: the art of getting better performance out of a model simply by asking the right way. The implication is significant: elicitation quality may matter as much as raw model capability.
What PostTrainBench results show
PostTrainBench is a standardized evaluation framework measuring model performance after initial training. It combines multiple tasks including BFCL (function calling), ArenaHard, and others.
Current leaderboard highlights:
- GPT-5.4 (with elicitation): 28.22% — #1
- GPT-5.4 (baseline): 20.23% — #4
- Qwen3-4B: 41.40% average, 100% on BFCL
- Gemma-3-4B: 24.85% average
Smaller models like Qwen3-4B outperform much larger ones in specific tasks — further evidence that size alone does not determine capability.
What this means for European developers and businesses
The old paradigm was simple: bigger model, better results. The emerging picture is more nuanced. A model can perform dramatically better with the right prompt — and dramatically worse without it.
For Baltic and European companies integrating AI into their workflows, this has a practical edge. Investing in prompt engineering — the craft of formulating the right instructions — can yield the same or better results than subscribing to a more expensive model tier.
Takeaway
As Hardik Bhatnagar noted, "PostTrainBench scores are a function of both model capability and elicitation." Source: @hrdkbhatnagar on X
AI is not a black box you throw money at. It responds to how you communicate with it. And sometimes all it takes is: you still have time.