Rankings

The Definitive Ranking of Large Language Models in 2025

By Marcus Chen

The large language model landscape has shifted dramatically over the past twelve months. What was once a two-horse race between OpenAI and Google has fractured into a crowded arena where open-source contenders routinely match or exceed proprietary systems on specific benchmarks. Ranking these models is no longer a matter of pointing to a single leaderboard score. It requires weighing trade-offs across reasoning depth, coding proficiency, multilingual fluency, latency, cost, and the less quantifiable dimension of how a model actually feels to use in production.

This ranking reflects hundreds of hours of hands-on testing across real-world workloads, combined with publicly available benchmark data from MMLU, HumanEval, GSM8K, ARC-Challenge, and the newer GPQA and MixEval suites. Where benchmarks disagree with practical experience, we say so. Where a model excels on paper but frustrates in deployment, we flag that too.

Scoring Methodology

Before diving into the rankings, it is worth explaining how we arrived at these positions. Each model was evaluated across six dimensions, each weighted according to its importance for the broadest set of professional use cases:

Reasoning & Knowledge (25%) — Performance on graduate-level reasoning tasks, factual accuracy, and the ability to handle multi-step logical chains without losing coherence.
Coding Proficiency (20%) — HumanEval pass rates, real-world debugging accuracy, ability to work with large codebases, and support for less common languages and frameworks.
Instruction Following (20%) — How precisely the model adheres to complex, multi-constraint prompts. This includes format compliance, length control, and resistance to prompt injection.
Multilingual Capability (10%) — Performance across non-English languages, particularly in translation, summarization, and culturally nuanced generation.
Latency & Efficiency (10%) — Time-to-first-token and throughput in standard API configurations. For open-source models, this includes performance on consumer-grade hardware.
Cost & Accessibility (15%) — Pricing per million tokens, availability of free tiers, and whether the model can be self-hosted.

Each dimension is scored from 1 to 10. The weighted composite produces the final ranking score. We deliberately avoided giving any single dimension more than 25% weight because no single capability defines a model's overall utility.

The Complete Ranking

Rank	Model	Developer	Reasoning	Coding	Instruction	Multilingual	Latency	Cost	Composite
1	Claude 3.5 Sonnet	Anthropic	9.5	9.5	9.5	8.5	8.0	8.0	9.03
2	GPT-4o	OpenAI	9.5	9.0	9.0	9.0	8.5	7.5	8.83
3	Gemini Ultra 1.5	Google	9.0	8.5	8.5	9.5	7.5	7.5	8.50
4	Llama 3 405B	Meta	8.5	8.5	8.0	8.0	6.5	9.5	8.28
5	Mistral Large 2	Mistral AI	8.5	8.0	8.5	9.0	7.5	8.0	8.28
6	Gemini Pro 1.5	Google	8.0	8.0	8.0	8.5	8.5	8.5	8.18
7	GPT-4o Mini	OpenAI	7.5	7.5	8.0	7.5	9.0	9.5	8.08
8	Qwen 2.5 72B	Alibaba	8.0	8.0	7.5	9.0	7.0	9.0	8.03
9	DeepSeek V3	DeepSeek	8.0	8.5	7.5	7.5	7.0	9.0	7.98
10	Cohere Command R+	Cohere	7.5	7.0	8.0	8.5	8.0	8.0	7.68

#1: Claude 3.5 Sonnet — The Model That Changed the Conversation

Anthropic's Claude 3.5 Sonnet did not just iterate on its predecessor; it redefined what a mid-tier model could accomplish. When it launched in mid-2024, it immediately matched or exceeded GPT-4o on most public benchmarks while costing significantly less per token. But the benchmarks only tell half the story.

What sets Claude 3.5 Sonnet apart in daily use is its consistency. Ask it to refactor a 500-line Python module, and it will maintain awareness of dependencies across the entire file. Ask it to write a legal brief following specific jurisdictional formatting, and it does so without the hallucinated citations that still occasionally plague other models. Its instruction-following fidelity is, in our testing, the best in the industry. It respects constraints around output format, length, and tone with a precision that reduces the need for re-prompting.

The coding story is particularly strong. On SWE-bench, Claude 3.5 Sonnet resolves real GitHub issues at a rate that puts it alongside specialized coding models. It handles TypeScript generics, Rust lifetimes, and complex SQL query optimization with genuine fluency. The model also demonstrates a notable ability to reason about code architecture, not just syntax. It will suggest design pattern changes when they are warranted rather than blindly implementing whatever you ask for.

Where it falls short: its multilingual performance, while good, does not match Gemini Ultra's breadth across low-resource languages. And Anthropic's API, while reliable, still lacks some of the tooling ecosystem that OpenAI has built. For a deeper comparison with its closest rival, see our Claude 3.5 Sonnet vs GPT-4o head-to-head analysis.

#2: GPT-4o — The Reliable Workhorse

GPT-4o remains the model that most enterprises reach for first, and for good reason. OpenAI's flagship benefits from years of iterative refinement, an enormous tooling ecosystem, and the broadest third-party integration landscape of any LLM. Its multimodal capabilities, particularly image understanding and voice interaction, remain class-leading.

On raw reasoning benchmarks, GPT-4o trades blows with Claude 3.5 Sonnet. It scores slightly higher on GPQA Diamond questions in our testing and handles certain categories of mathematical proof with more rigor. Its MMLU scores have held steady at the top of the pack, and its ability to synthesize information from long documents is excellent thanks to its 128K context window.

The coding gap between GPT-4o and Claude 3.5 Sonnet is narrower than many analysts suggest. GPT-4o excels at generating boilerplate, scaffolding full-stack applications, and working with popular frameworks. Where it loses ground is in nuanced debugging sessions requiring careful reasoning about state. It tends to over-generate, offering verbose solutions where a targeted fix would suffice.

Pricing remains a concern. At the time of writing, GPT-4o's per-token cost is meaningfully higher than Claude 3.5 Sonnet's for equivalent workloads. For organizations processing millions of tokens daily, that delta compounds quickly. OpenAI has partially addressed this with GPT-4o Mini, but the capability gap between the full model and the mini variant is noticeable on complex tasks.

#3: Gemini Ultra 1.5 — Google's Long-Context Champion

Google's Gemini Ultra 1.5 brought something to the table that no other model could match at launch: a functional million-token context window. While other models advertise large context lengths, Gemini Ultra actually maintains coherence and retrieval accuracy across enormous input spans. In our needle-in-a-haystack tests with 800K+ token inputs, it consistently located and reasoned about embedded information that other models either missed or hallucinated around.

This long-context capability is not a gimmick. It unlocks use cases that are genuinely impractical with other models: analyzing entire codebases in a single pass, processing full legal discovery documents, or maintaining conversation context across days of interaction. For teams working with large document corpuses, Gemini Ultra is in a class of its own.

The model also leads in multilingual benchmarks, benefiting from Google's decades of investment in machine translation and its training data diversity. It handles code-switching between languages with unusual grace, and its performance on non-Latin scripts is measurably ahead of competitors.

The downsides are real, however. Gemini Ultra's latency is notably higher than GPT-4o or Claude 3.5 Sonnet for standard queries. Its instruction following, while much improved over earlier Gemini releases, still occasionally drifts on highly constrained prompts. And Google's API pricing and availability have been inconsistent, creating friction for developers who need predictable costs.

#4: Llama 3 405B — Open Source Reaches the Frontier

Meta's Llama 3 405B is the most important open-source model release in the history of the field. That is not hyperbole. For the first time, an openly available model competes directly with the best proprietary systems across a broad range of tasks. Its MMLU score sits within striking distance of GPT-4o, and its coding performance on HumanEval exceeds what GPT-4 achieved just eighteen months ago.

The significance extends beyond raw performance. Llama 3 405B can be fine-tuned, quantized, and deployed on private infrastructure without licensing fees. For organizations with data sovereignty requirements, regulatory constraints around third-party API usage, or simply a desire to control their AI stack, it represents a genuine alternative to the proprietary providers.

The practical challenges are substantial. Running a 405-billion parameter model requires serious hardware. Even with 4-bit quantization, you need multiple high-end GPUs. Inference latency on self-hosted deployments is typically 3-5x slower than calling a commercial API. And while the base model is strong, it lacks the extensive RLHF tuning and safety layers that Anthropic and OpenAI have invested years in building.

Despite these caveats, Llama 3's existence has permanently altered the competitive dynamics of the industry. It puts a ceiling on how much proprietary providers can charge and a floor on how capable any serious model needs to be.

#5: Mistral Large 2 — Europe's Answer to Silicon Valley

Mistral AI continues to punch well above its weight. Mistral Large 2 delivers performance that rivals models trained with ten times its rumored budget, and it does so with a particular strength in European languages that reflects the company's Paris-based heritage. On French, German, Spanish, and Italian benchmarks, it consistently outperforms GPT-4o and matches Gemini Ultra.

The model's instruction following is excellent, particularly for structured output generation. When asked to produce JSON, XML, or tabular data conforming to a specific schema, Mistral Large 2 has the lowest format-violation rate in our testing. This makes it a strong choice for pipeline integration where downstream systems expect predictable output structures.

Its coding capabilities are solid if not spectacular. It handles Python, JavaScript, and Java with confidence but stumbles more frequently on less common languages. Its mathematical reasoning is improving rapidly but still trails the top three models on olympiad-level problems.

#6-7: Gemini Pro 1.5 and GPT-4o Mini — The Efficiency Tier

These two models represent the sweet spot for many production workloads. Gemini Pro 1.5 offers a remarkable balance of capability and cost, inheriting much of Gemini Ultra's long-context prowess at a fraction of the price. GPT-4o Mini, meanwhile, has become the default choice for high-volume applications where latency matters more than peak capability.

GPT-4o Mini deserves particular attention. Its time-to-first-token is consistently under 200 milliseconds, and its throughput is high enough to handle real-time chat applications without queuing. The capability gap relative to full GPT-4o is most noticeable on complex reasoning chains and advanced mathematics, but for summarization, classification, extraction, and straightforward generation tasks, the mini variant is often indistinguishable from its larger sibling.

#8: Qwen 2.5 72B — The Quiet Competitor

Alibaba's Qwen 2.5 72B does not receive the media attention it deserves. Released under a permissive license, it offers performance that competes with models several times its parameter count. Its Chinese language capability is unmatched, but more importantly, its English performance has improved to the point where it is a credible option for global deployments.

Qwen 2.5 excels in mathematical reasoning and code generation, scoring within a point of GPT-4o on GSM8K and ahead of Mistral Large on HumanEval. Its relatively modest size means it can run on a single high-end GPU when quantized, making it one of the most practical open-source options for self-hosting.

#9: DeepSeek V3 — Cost Efficiency Redefined

DeepSeek V3 is the model that keeps proprietary providers awake at night. Built with a mixture-of-experts architecture that activates only a fraction of its parameters per query, it achieves remarkable cost efficiency. Its API pricing undercuts GPT-4o by more than 90% while delivering performance that is genuinely competitive on coding and reasoning tasks.

The model's coding capabilities are its strongest suit. On HumanEval and MBPP, it scores alongside Claude 3.5 Sonnet, which is an extraordinary achievement given its pricing. Its training approach, which emphasizes code and mathematical data, produces a model that thinks in a structured, logical way that many developers prefer.

The caveats center on its instruction following and safety tuning, which are less refined than the top-tier proprietary models. It occasionally produces responses that drift from the prompt's constraints, and its content filtering is less consistent. For regulated industries, these are meaningful concerns.

#10: Cohere Command R+ — The Enterprise Specialist

Cohere's Command R+ rounds out our top ten not because it leads any single benchmark category, but because it offers a combination of capabilities that is uniquely suited to enterprise search and retrieval-augmented generation. Its grounded generation, which ties responses to specific source documents and provides inline citations, is the most mature implementation of RAG-native behavior in any foundation model.

For organizations building knowledge management systems, customer support platforms, or internal search tools, Command R+ offers functionality that would require significant engineering to replicate with other models. Its multilingual support across over 100 languages also makes it a practical choice for global enterprises.

Use-Case Recommendations

Rankings are useful as a starting point, but model selection ultimately depends on what you are building. Here is where we would point teams based on their primary use case:

Software Development & Code Review: Claude 3.5 Sonnet or DeepSeek V3. The former for highest quality, the latter for cost-sensitive workloads.
Long Document Analysis: Gemini Ultra 1.5. Nothing else comes close for million-token workloads.
High-Volume Production APIs: GPT-4o Mini or Gemini Pro 1.5. Both offer the latency and throughput characteristics that real-time applications demand.
Self-Hosted Deployments: Llama 3 405B for maximum capability, Qwen 2.5 72B for the best performance-per-GPU ratio.
Enterprise Search & RAG: Cohere Command R+ for its native citation and grounding features.
European Language Workloads: Mistral Large 2 for its exceptional performance on French, German, and Spanish tasks.
General-Purpose Excellence: GPT-4o remains the safest all-around choice for teams that need one model to handle everything.

What This Ranking Misses

No ranking captures everything. We deliberately excluded several factors that matter but are difficult to score objectively: the quality of each provider's documentation, the responsiveness of their support teams, the stability of their APIs over time, and the trajectory of their improvement. OpenAI's track record of rapid iteration is itself a competitive advantage that does not show up in any benchmark.

We also did not score safety and alignment, not because they are unimportant but because they require their own dedicated analysis. Anthropic leads in transparency around its safety work, but all top-tier providers have made meaningful investments in this area. For a deeper exploration of the technical underpinnings shared by all these models, our technical deep dive into the transformer architecture provides essential context.

"The gap between the best proprietary model and the best open-source model has never been smaller. By the end of 2025, it may effectively close."

This ranking will age. Models that do not exist today will appear on next quarter's list. The pace of improvement in this field is such that any static ranking becomes a historical document within months. What endures is the framework: evaluate models on what matters to your specific workload, test them against your actual data, and be prepared to switch when something better arrives. In the current market, something better always arrives.

LLM Rankings GPT-4o Claude 3.5 Sonnet Gemini Ultra Llama 3 AI Benchmarks Model Comparison