Skip to main content
Back to blog
Guide
5 min readFebruary 24, 2026

Gemini 3.1 Pro: #1 APEX-Agents Score (Review)

ByLoïc Jané·Founder, Fleece AI

Gemini 3.1 Pro: Google's Most Capable Agentic AI Model

At a Glance: Gemini 3.1 Pro is Google's frontier AI model released February 19, 2026, featuring a 1M token context window, 77.1% ARC-AGI-2 reasoning, 33.5% APEX-Agents (highest of any model), and a customtools variant optimized for multi-step agentic tool execution. Updated February 20, 2026.

Gemini 3.1 Pro is Google's latest frontier AI model (released February 19, 2026). Purpose-built for complex reasoning, multi-step tool use, and agentic workflows, it currently leads the APEX-Agents benchmark — the most demanding test of professional AI agent capabilities.

This guide covers everything you need to know about Gemini 3.1 Pro: benchmarks, capabilities, pricing, and how it compares to GPT-5.2, Claude Opus 4.6, and Gemini 3 Flash for agentic automation. For a full model comparison, see our Best AI Models for Automation 2026 guide.


Key Capabilities

1M Token Context Window

Gemini 3.1 Pro supports up to 1 million input tokens — the largest context window of any production AI model alongside Gemini 3 Flash. This enables processing entire documents, lengthy email threads, multi-page reports, and extensive conversation histories in a single pass — no chunking, no information loss.

77.1% ARC-AGI-2 Reasoning

On ARC-AGI-2, a benchmark that tests an AI model's ability to solve entirely new logic patterns it has never seen before, Gemini 3.1 Pro scored 77.1% — more than double the reasoning performance of its predecessor, Gemini 3 Pro (~37%). This is the highest ARC-AGI-2 score of any production model as of February 2026.

33.5% APEX-Agents — Leading the Agentic Benchmark

APEX-Agents is the most demanding benchmark for professional AI agent work, testing cross-application tasks from investment banking, management consulting, and corporate law. Gemini 3.1 Pro leads the leaderboard at 33.5% — nearly double Gemini 3 Pro's 18.4%, and ahead of Claude Opus 4.6 (29.8%) and GPT-5.2 (23.0%).

Built-In Custom Tool Execution

For a deep dive on how tool calling benchmarks compare across models, see our Best AI Model for Tool Calling 2026 guide.

Gemini 3.1 Pro includes a specialized variant — gemini-3.1-pro-preview-customtools — optimized for multi-step tool execution in agentic workflows. This variant is designed for:

  • Higher tool precision: Fewer incorrect API calls, better parameter extraction
  • Multi-step orchestration: Planning and executing chains of tool calls efficiently
  • Edit-then-test loops: Strong performance on iterative tool usage patterns

80.6% SWE-Bench Verified

On SWE-Bench Verified (real-world software engineering tasks), Gemini 3.1 Pro scores 80.6% — effectively tied with Claude Opus 4.6 (80.8%) and ahead of Gemini 3 Flash (78%).

69.2% MCP-Atlas

On MCP-Atlas, which measures tool coordination across MCP (Model Context Protocol) servers, Gemini 3.1 Pro leads at 69.2% — ahead of Claude Sonnet 4.6 (61.3%) and Claude Opus 4.6 (60.3%).

65,536 Output Tokens

With up to 65K output tokens, Gemini 3.1 Pro can generate comprehensive reports, detailed summaries, and lengthy code outputs — essential for workflows that produce multi-page documents or complex data analyses.


Agentic Benchmark Comparison

BenchmarkGemini 3.1 ProGPT-5.2Claude Opus 4.6Gemini 3 Flash
APEX-Agents33.5%23.0%29.8%24.0%
ARC-AGI-277.1%68.8%
SWE-Bench Verified80.6%80%80.8%78%
MCP-Atlas69.2%60.3%
TAU2-Bench98.7%
Terminal-Bench 2.068.5%64.0%#1
BrowseComp85.9%

Automate with top-ranked AI modelsStart free on Fleece AI and deploy your first agent in under 60 seconds with GPT-5.2, Gemini 3 Flash, or Claude Opus 4.6.


Model Comparison

FeatureGemini 3.1 ProGPT-5.2Claude Opus 4.6Gemini 3 Flash
ProviderGoogleOpenAIAnthropicGoogle
ReleasedFeb 2026Dec 2025Feb 2026Dec 2025
Context Window1M tokens400K tokens200K (1M beta)1M tokens
Output Tokens65K128K128K65K
SpeedFastFastModerateVery Fast
Cost (input)$2/M tokens$1.75/M tokens$5/M tokens$0.10/M tokens
Cost (output)$12/M tokens$14/M tokens$25/M tokens$0.40/M tokens
Best ForAgentic workflows, tool useMulti-turn conversations, mathDeep analysis, long outputQuick tasks, cost-sensitive

Best Use Cases for Gemini 3.1 Pro

Multi-App Workflow Automation

Gemini 3.1 Pro's customtools variant and leading APEX-Agents score make it ideal for workflows that chain 5-20 tool calls across Gmail, Slack, Notion, Sheets, Salesforce, and other business apps.

Large Document Processing

The 1M token context window means entire PDFs, spreadsheets, and reports can be processed in a single pass. A 200-page document analysis requires no chunking or information loss.

Complex Reporting

Cross-referencing data from multiple sources (CRM, payment processor, project manager, analytics) and synthesizing insights into formatted reports.

Agentic Coding Tasks

At 80.6% on SWE-Bench Verified and 68.5% on Terminal-Bench 2.0, Gemini 3.1 Pro is highly capable for software engineering workflows, code review, and automated testing.

Research and Analysis

With the highest ARC-AGI-2 score (77.1%) of any model, Gemini 3.1 Pro excels at novel problem-solving and abstract reasoning. Combined with its 1M context window, it can analyze entire codebases, legal contracts, or financial reports in a single prompt — making it ideal for research-heavy workflows that require deep comprehension.

Customer Intelligence Pipelines

Gemini 3.1 Pro's structured output capabilities make it effective for extracting insights from customer feedback, support tickets, and survey data. Teams can build automated pipelines that classify sentiment, identify trends, and generate executive summaries across thousands of data points.

Try Gemini-powered automationStart free on Fleece AI and deploy AI agents powered by top-tier models.


Pricing

TierInputOutput
Under 200K tokens$2 / 1M tokens$12 / 1M tokens
Over 200K tokens$4 / 1M tokens$24 / 1M tokens

Gemini 3.1 Pro offers strong price-to-performance: $2/M input tokens is 2.5x cheaper than Claude Opus 4.6 ($5/M) while leading most agentic benchmarks.


Technical Details for Developers

  • Model ID: gemini-3.1-pro-preview (standard), gemini-3.1-pro-preview-customtools (agentic)
  • Provider: Google AI / Vertex AI
  • Supports: Text, vision, tool use, structured output, thinking (3 levels)
  • Context: 1,048,576 input tokens, 65,536 output tokens
  • API Access: Gemini API, Vertex AI, Google AI Studio, Gemini CLI

According to Google, Gemini 3.1 Pro is part of the Gemini model family designed for enterprise AI deployment with industry-leading context windows and agentic capabilities.


Frequently Asked Questions

Is Gemini 3.1 Pro available on Fleece AI?

Fleece AI currently offers GPT-5.2 (free tier) and Claude Opus 4.6 (Pro plan). Gemini 3 Flash is available for speed-optimized tasks. Gemini 3.1 Pro access depends on Google's API availability and may be added as a premium model option.

What is Gemini 3.1 Pro's customtools variant?

Gemini 3.1 Pro's customtools variant (gemini-3.1-pro-preview-customtools) is a specialized endpoint optimized for multi-step tool execution in agentic workflows. It delivers higher tool precision, fewer incorrect API calls, and better parameter extraction compared to the standard variant. It is part of Google's "Antigravity" agentic stack on Vertex AI.

How does Gemini 3.1 Pro compare to Claude Opus 4.6?

Gemini 3.1 Pro leads on APEX-Agents (33.5% vs 29.8%), ARC-AGI-2 (77.1% vs 68.8%), and MCP-Atlas (69.2% vs 60.3%). Claude Opus 4.6 leads on Terminal-Bench 2.0 (agentic coding), offers 128K output tokens (vs 65K), and has the best caching economics when combined with Batch API. Gemini 3.1 Pro costs $2/M input vs $5/M for Claude Opus 4.6.

How does Gemini 3.1 Pro compare to GPT-5.2?

Gemini 3.1 Pro offers a larger context window (1M vs 400K tokens) and leads on agentic benchmarks (33.5% APEX-Agents vs 23.0%). GPT-5.2 excels at structured output, math (100% on AIME 2025), and multi-turn tool conversations (98.7% TAU2-Bench). GPT-5.2 also offers 128K output tokens vs Gemini 3.1 Pro's 65K.

What is the APEX-Agents benchmark?

APEX-Agents (AI Productivity Index) is the most demanding benchmark for professional AI agent work, released in early 2026. It tests whether AI agents can execute long-horizon, cross-application tasks from investment banking, management consulting, and corporate law — including navigating chat logs, PDFs, spreadsheets, and calendar items.


Related Articles

Start automating with AI agents — deploy your first AI agent in under 60 seconds with Fleece AI, powered by GPT-5.2, Gemini 3 Flash, and Claude Opus 4.6.

Ready to delegate your first task?

Deploy your first AI agent in under 60 seconds. No credit card required.

Related articles