AI Models

Google Gemini Ultra: A Thorough Review of Google's Most Ambitious AI

By Sarah Mitchell

Google has never been short on ambition when it comes to artificial intelligence. The company invented the Transformer architecture, published the "Attention Is All You Need" paper that launched the modern AI era, and built some of the most powerful computing infrastructure on the planet. Yet for the better part of two years, Google found itself in the unusual position of playing catch-up, watching OpenAI capture the public imagination with ChatGPT and GPT-4 while Google's own AI products seemed perpetually one step behind.

Gemini Ultra is Google's definitive answer to that narrative. It is the largest and most capable model in the Gemini family, and Google has positioned it as a direct competitor to GPT-4 and whatever comes next from OpenAI. After several months of extensive use across both the Gemini Advanced web interface and the API, I can say that Gemini Ultra is a genuinely impressive model that excels in specific areas while still carrying notable weaknesses. This is not a clear-cut victory for Google, but it is no longer a contest where OpenAI is the obvious frontrunner either.

Architecture: What We Know and What We Can Infer

Google has been characteristically cagey about Gemini Ultra's precise architecture. The original technical report confirmed that the Gemini family uses a decoder-only Transformer architecture with modifications for efficient training on Google's TPU v4 and v5 pods. The models were trained with a mixture of multimodal data from the start, incorporating text, images, audio, and video into the training process rather than adding these capabilities post-hoc.

What Google has not disclosed is the model's parameter count, the exact composition of its training data, or the specific architectural innovations that differentiate it from prior Google models like PaLM 2. Based on the model's capabilities, context window sizes, and inference latency characteristics, credible estimates from the research community place Gemini Ultra in the range of 1 to 1.5 trillion parameters, likely using a mixture-of-experts (MoE) architecture where only a fraction of the total parameters are active for any given input. This would be consistent with the approach Google DeepMind pioneered with the Switch Transformer and Glam papers.

The MoE hypothesis is supported by Gemini Ultra's inference characteristics. Despite its presumed massive size, its response latency is comparable to models like GPT-4 Turbo that are believed to be significantly smaller. This is consistent with an MoE model where routing mechanisms activate only a subset of parameters per token, keeping the effective computation manageable even as the total parameter count scales enormously.

The native multimodal training is a genuine architectural distinction. While OpenAI has described GPT-4V as a vision capability added to a text-trained model, Google claims that Gemini processes interleaved multimodal data natively. In theory, this should result in deeper cross-modal understanding, though as we will see, the practical implications of this design choice are more nuanced than the marketing suggests.

Benchmark Performance: Impressive Numbers, Complicated Reality

Google's benchmark results for Gemini Ultra were headline-grabbing. The model claimed to be the first to achieve human-expert performance on MMLU (Massive Multitask Language Understanding), scoring 90.0% compared to GPT-4's reported 86.4%. On multimodal benchmarks like MMMU and MathVista, Gemini Ultra posted new state-of-the-art results. Its performance on coding benchmarks, particularly HumanEval and Natural2Code, was competitive with the best specialized coding models.

However, the benchmark story requires several caveats. First, Google achieved its MMLU score using a technique called "uncertainty-routed chain-of-thought," where the model uses chain-of-thought reasoning for questions it is uncertain about and direct answering for confident predictions. This is a legitimate evaluation technique, but it produces scores that are not directly comparable to the 5-shot results typically reported for other models. When evaluated under standard 5-shot conditions, Gemini Ultra's MMLU score drops to approximately 83.7%—still excellent, but not the record-setting figure Google highlighted.

Second, benchmarks measure narrow, well-defined tasks that do not always predict real-world performance. In my own testing, Gemini Ultra's performance on novel, complex reasoning tasks—the kind of problems you encounter in actual work, not in benchmark suites—is roughly comparable to GPT-4. Sometimes Gemini excels, particularly on tasks involving mathematical reasoning, scientific knowledge, and multimodal understanding. Other times, GPT-4 produces more coherent, well-structured responses, especially for nuanced writing, creative tasks, and following complex multi-step instructions.

Multimodal Capabilities: Where Gemini Genuinely Shines

If there is one area where Gemini Ultra has a legitimate claim to superiority over the competition, it is multimodal understanding. The model's ability to process and reason about images, documents, charts, and diagrams is exceptional. Its performance on complex visual reasoning tasks often surpasses GPT-4V, particularly when the task requires integrating visual information with domain-specific knowledge.

I tested Gemini Ultra extensively with scientific figures, engineering diagrams, and financial charts. Its ability to extract quantitative information from complex charts—reading values from axes, identifying trends, comparing multiple data series—is the best I have seen from any model. When given a complex multi-panel scientific figure from a research paper, Gemini Ultra consistently provided more accurate and detailed descriptions than GPT-4V, correctly identifying relationships between panels and drawing appropriate scientific conclusions.

The video understanding capabilities are particularly noteworthy. Gemini can process video inputs (through the API) and answer questions that require understanding temporal sequences, not just individual frames. I tested it with instructional videos, meeting recordings, and lecture content. For short to medium-length videos (under 10 minutes), it could accurately summarize content, identify key moments, and answer detailed questions about what happened at specific points. For longer videos, performance degraded, with the model sometimes confusing the order of events or missing details from the middle sections.

Audio processing, while less publicized, is also strong. Gemini can transcribe speech, understand spoken questions in multiple languages, and even analyze characteristics of audio like background noise, music, and speaker changes. The integration of audio understanding with the text and vision capabilities creates a genuinely multimodal system that handles real-world content like news broadcasts, presentations, and video calls more naturally than competitors.

Reasoning and Mathematics: A Clear Strength

Gemini Ultra demonstrates exceptional mathematical and logical reasoning capabilities. On problems from competition mathematics, advanced physics, and formal logic, it consistently performs at or above the level of GPT-4. Its step-by-step mathematical problem solving is typically clear, well-organized, and correct, though like all large language models, it can make arithmetic errors on complex calculations, particularly those involving many intermediate steps.

The model's strength in mathematical reasoning extends to practical applications. When given data analysis tasks, Gemini Ultra produces thoughtful statistical interpretations, identifies appropriate analytical methods, and writes correct code for implementing analyses. Its understanding of probability and statistics is particularly strong, and it handles Bayesian reasoning problems with a level of sophistication that catches less capable models off guard.

Scientific reasoning is another high point. Gemini Ultra can engage with technical content from fields like molecular biology, quantum mechanics, and materials science at a level that is useful for researchers. It understands the conventions and methodologies of these fields, can critique experimental designs, and provides references that are generally (though not always) accurate. This makes it a valuable research assistant for literature review, experimental design brainstorming, and technical writing.

The Google Ecosystem Advantage

Gemini's integration into Google's product ecosystem represents both a strategic advantage and a potential limitation. Through Google Workspace, Gemini can access and work with your Gmail, Google Docs, Google Sheets, and Google Drive files. This integration is available to Gemini Advanced subscribers and enables use cases that standalone chatbots cannot match.

The practical value of this integration is significant. You can ask Gemini to summarize your recent emails about a specific project, find and analyze data from a spreadsheet in your Drive, or draft a document based on the contents of multiple files. For users deeply embedded in the Google ecosystem, this creates a level of contextual awareness and utility that ChatGPT, despite its plugins and GPTs, has not matched.

However, this ecosystem integration is also a limitation. If your organization uses Microsoft 365, Slack, or other productivity tools, Gemini's workspace integration is irrelevant. And Google's track record with product integrations—marked by launches, rebrandings, and occasional discontinuations—gives some users pause about building workflows that depend on these integrations remaining stable.

The Gemini API, accessible through Google AI Studio and Vertex AI, is competitive on pricing. Google has been aggressive about undercutting OpenAI's API pricing, and the 1 million token context window available for Gemini 1.5 Pro (the model most API users interact with) is dramatically larger than what competitors offer. This extended context window enables use cases like processing entire codebases, analyzing full-length books, or summarizing hour-long video recordings that are simply impossible with smaller context windows.

Weaknesses and Limitations

Gemini Ultra is not without significant weaknesses, and being honest about them is important for anyone deciding between this and competing models.

Instruction Following and Format Control

Gemini Ultra is less reliable than GPT-4 at following precise formatting instructions. When asked to produce output in specific formats—structured JSON, numbered lists with specific formatting, or text within exact word count constraints—Gemini more frequently deviates from instructions. It tends to be verbose, adding preamble and caveats that were not requested. For applications that require precise, predictable output formatting, this is a meaningful limitation.

Creative Writing Quality

In creative writing tasks, Gemini Ultra produces competent but often uninspired output. Its prose tends toward a neutral, informational register even when asked for more evocative or stylistic writing. GPT-4 and Claude 3 both produce more varied, engaging creative writing. This may reflect training priorities—Google appears to have optimized more heavily for factual accuracy and reasoning than for literary quality.

Hallucination and Factual Accuracy

While Gemini Ultra's factual accuracy is generally strong, it shares the hallucination problem common to all large language models. It will occasionally fabricate citations, invent statistics, or state plausible-sounding but incorrect facts with confidence. In my testing, its hallucination rate was roughly comparable to GPT-4—not worse, but not meaningfully better either, despite Google's emphasis on grounding and factuality during training.

Safety Overfiltering

Google has implemented conservative safety filters on Gemini, and they occasionally trigger in contexts where they should not. Legitimate questions about medical topics, historical events, or security research sometimes receive unnecessarily cautious responses or outright refusals. This overfiltering is frustrating for professional users and creates a perception that Google is prioritizing liability protection over user utility. OpenAI and Anthropic have found a better balance on this front.

Pricing and Availability

Gemini Ultra is available through the Gemini Advanced subscription, priced at $19.99 per month as part of the Google One AI Premium plan. This subscription includes access to the latest Gemini models, the extended context window, and integration with Google Workspace. The price is comparable to ChatGPT Plus at $20 per month, making price a non-factor in the decision between them.

For API access, Google offers a generous free tier through Google AI Studio and competitive pricing through Vertex AI. The Gemini 1.5 Pro model, which offers capabilities very close to Ultra for most tasks, is priced at rates that significantly undercut GPT-4 Turbo's per-token costs. The 1 million token context window is included at no extra per-token charge, which is remarkable given the computational cost of processing such long contexts.

The Verdict

Gemini Ultra represents Google's successful return to the frontier of AI capabilities. It is a world-class model that matches or exceeds GPT-4 in several important dimensions: multimodal understanding, mathematical reasoning, scientific knowledge, and processing of long documents. Its integration with Google's ecosystem adds practical value that standalone models cannot replicate.

However, it is not an outright winner. GPT-4 remains more reliable for precise instruction following, creative writing, and applications that require predictable output formatting. Claude 3 Opus offers more nuanced, thoughtful responses for complex analytical tasks. The "best" model depends heavily on your specific use case, and anyone who tells you otherwise is either not testing rigorously or has an agenda.

For users who are already in Google's ecosystem and want a powerful AI assistant integrated into their existing workflow, Gemini Advanced is an easy recommendation. For developers building applications, the Gemini API's pricing and context window make it a compelling choice for retrieval-augmented generation, document analysis, and multimodal applications. For general-purpose use where you need the most reliable, versatile AI assistant, the choice between Gemini and GPT-4 is genuinely close, and you would be well-served by either.

What Gemini Ultra proves definitively is that the AI model landscape is no longer a one-horse race. Google has shipped a model that competes at the highest level, and the competitive pressure between Google, OpenAI, Anthropic, and increasingly Meta and Mistral is driving rapid improvement across the board. For users, this competition is unambiguously good news.

Google Gemini AI Review Google DeepMind GPT-4 Multimodal AI LLM Comparison