Claude Opus 4.5 vs. GPT-5.2 Codex vs. Gemini 3 Pro on real-world coding tasks
Analysis
The article compares three large language models (LLMs) – Claude Opus 4.5, GPT-5.2 Codex, and Gemini 3 Pro – on real-world coding tasks within a Next.js project. The author focuses on practical feature implementation rather than benchmark scores, evaluating the models based on their ability to ship features, time taken, token usage, and cost. Gemini 3 Pro performed best, followed by Claude Opus 4.5, with GPT-5.2 Codex being the least dependable. The evaluation uses a real-world project and considers the best of three runs for each model to mitigate the impact of random variations.
Key Takeaways
- •Gemini 3 Pro showed the best performance in the coding task, excelling in caching and fallback mechanisms.
- •Claude Opus 4.5 was reliable but had some UI issues.
- •GPT-5.2 Codex was the least dependable.
- •The evaluation focused on real-world feature implementation and practical aspects like cost and time.
- •The study used a real-world Next.js project for evaluation.
“Gemini 3 Pro performed the best. It set up the fallback and cache effectively, with repeated generations returning in milliseconds from the cache. The run cost $0.45, took 7 minutes and 14 seconds, and used about 746K input (including cache reads) + ~11K output.”