MCP Was the Wrong Abstraction for AI Agents

TL;DR: MCP (Model Context Protocol) promised to be the universal standard for AI agent integrations. But for data-intensive workflows, it introduced massive token bloat, latency issues, and reduced agent autonomy. After testing both approaches on GetATeam, we found that code execution saves 98% of tokens and produces better results. Here's why skills beat MCP servers for production workloads, and when you should still use MCP.

The MCP Hype vs Reality

When Anthropic introduced the Model Context Protocol, everyone jumped on board. Finally, a universal standard for AI agent integrations! Connect once, unlock an entire ecosystem of tools.

The promise was beautiful: build your agent, plug in MCP servers, and instantly access Google Drive, Salesforce, Slack, databases, you name it.

The reality? For complex, data-heavy workflows, token consumption exploded. Latency went through the roof. Agents became slower and more expensive.

At GetATeam, we built agents both ways: MCP-based and code-execution-based. The difference was shocking.

Same task. Same agent. Different approach:

MCP version: 50,000 tokens consumed
Code execution version: 1,000 tokens consumed

That's a 98% reduction in token usage. Not 10%. Not 50%. Ninety-eight percent.

And it wasn't just cheaper, it was better. The code-execution agent produced higher quality results and worked more autonomously.

So what went wrong with MCP?

Problem 1: Tool Definitions Overload Context

Here's how MCP works:

You connect your agent to an MCP server. That server exposes 20-30 tools. Each tool has a name, description, and parameters.

Most developers don't connect just one MCP server. They connect 5-6 different servers.

The math:

6 MCP servers × 25 tools each = 150 tools
Each tool definition: ~100-200 tokens
Total overhead: 15,000-30,000 tokens

And your agent might only need to use ONE tool.

But it still has to load all 150 tool definitions into its context window. Every. Single. Time.

Important caveat: You could implement selective tool loading in MCP, caching definitions or using a search endpoint to filter tools. But the protocol's design doesn't encourage this pattern. Most implementations we've seen load all available tools upfront, and that's what we benchmarked against.

This causes three problems:

1. Higher costs You're paying for 30,000 tokens of tool definitions before the agent even starts the task.

2. Increased latency Larger context = slower inference. Your agent takes longer to respond.

3. More hallucinations Too much irrelevant context confuses the model. It might call the wrong tool or mix up parameters.

Problem 2: Intermediate Results Bloat Context

Let's say your agent needs to read a Google Doc transcript.

With MCP:

Agent calls get_document(doc_id)
MCP returns the entire 50,000-token transcript
Agent reads it all into context
Agent extracts the 2 paragraphs it actually needed

You just consumed 50,000+ tokens to get 200 tokens of useful information.

Now imagine the agent is coordinating a multi-step workflow:

Read transcript (50k tokens)
Query database (20k tokens)
Fetch Salesforce data (15k tokens)
Read email thread (10k tokens)

Each intermediate result bloats the context window. You're paying for massive token usage, and the agent is drowning in irrelevant data.

When this doesn't apply: If your agent needs to perform deep semantic analysis of the full document (summarization, sentiment analysis, finding contradictions), you'll need those tokens in context regardless of your approach. This optimization applies to data retrieval, filtering, and aggregation tasks, not tasks that require reasoning about entire large documents.

The Alternative: Code Execution with Skills

Anthropic quietly published a blog post suggesting a different approach: let agents write and execute code instead of calling pre-defined MCP tools.

Here's how it works:

Structure:

skills/
├── google-drive/
│   ├── get-document.ts
│   ├── upload-file.ts
│   └── list-files.ts
├── salesforce/
│   ├── create-lead.ts
│   ├── update-contact.ts
│   └── get-account.ts
└── slack/
    ├── send-message.ts
    └── read-channel.ts

Each skill is a simple TypeScript function. When the agent needs a tool, it:

Discovers the skill (via search or directory listing)
Imports only that one skill
Executes the code

Example:

Instead of loading 150 tool definitions, the agent does this:

// Import only what's needed
import { getDocument } from './skills/google-drive/get-document.ts';

// Fetch the document
const transcript = await getDocument('doc_abc123');

// Save to file system (not context!)
fs.writeFileSync('/tmp/transcript.txt', transcript);

// Extract only what's needed
const firstParagraph = transcript.split('\n\n')[0];

// Use it
console.log(firstParagraph);

What changed:

Before (MCP): 50,000 tokens in context
After (Code): 200 tokens in context (just the paragraph)

The rest of the transcript is stored in the file system, not the LLM context window.

Why This Works Better

1. Massive Token Savings

MCP approach:

Tool definitions: 30,000 tokens
Intermediate results: 95,000 tokens
Total: 125,000 tokens

Code execution approach:

Import statements: 100 tokens
Actual data needed: 500 tokens
Total: 600 tokens

That's 98% reduction. At $3/M tokens (Claude Sonnet), that's the difference between $0.38 and $0.002 per task.

At scale, this matters. If you run 10,000 tasks/month:

MCP cost: $3,800/month
Code execution cost: $20/month

2. Progressive Disclosure

With MCP, you're limited by context window size. Want to give your agent access to 1000 tools? Good luck fitting that into 200k tokens.

With code execution, there's no limit. You can have 10,000 skills.

How? The agent uses a search tool to discover what it needs:

// Agent searches for relevant skills
const skills = searchSkills('salesforce customer data');
// Returns: ['salesforce/get-account.ts', 'salesforce/get-contact.ts']

// Import only what's relevant
import { getAccount } from skills[0];

The agent only loads what it needs, when it needs it.

Technical note: Our searchSkills() implementation uses keyword matching across skill filenames and inline documentation comments. For larger skill libraries (1000+ skills), we're exploring embedding-based semantic search, though we haven't found keyword search to be a bottleneck yet at our current scale (~200 skills).

3. Privacy and Data Control

Enterprise clients hate exposing sensitive data to third-party LLMs.

With MCP, when your agent calls get_customer_data(), the full response (including emails, phone numbers, SSNs) goes into the LLM context.

With code execution, you can add a harness layer:

// Original skill
export async function getCustomerData(id: string) {
  const data = await db.query('SELECT * FROM customers WHERE id = ?', [id]);
  return data;
}

// Wrapped with privacy harness
export async function getCustomerDataAnonymized(id: string) {
  const data = await getCustomerData(id);

  // Anonymize before returning to LLM
  return {
    ...data,
    email: anonymize(data.email),      // user@example.com → u***@e***.com
    phone: anonymize(data.phone),      // +1-555-1234 → +1-***-****
    ssn: '[REDACTED]'
  };
}

The agent never sees sensitive data, but it can still work with customer records.

4. State Persistence and Skill Evolution

This is the game-changer.

With MCP, tools are static. You connect to a server, you get the tools it provides, end of story.

With code execution, the agent can create its own skills.

Example from GetATeam:

// Agent realizes it needs a skill that doesn't exist
// It writes one and saves it

const skillCode = `
export async function analyzeCodeQuality(repoUrl: string) {
  // Clone repo
  await exec('git clone ' + repoUrl + ' /tmp/repo');

  // Run linter
  const lintResults = await exec('cd /tmp/repo && eslint .');

  // Run tests
  const testResults = await exec('cd /tmp/repo && npm test');

  // Analyze
  return {
    lintIssues: lintResults.split('\n').length,
    testsPassing: testResults.includes('passing'),
    complexity: calculateComplexity(lintResults)
  };
}
`;

fs.writeFileSync('./skills/custom/analyze-code-quality.ts', skillCode);

// Now the agent can use this skill forever
import { analyzeCodeQuality } from './skills/custom/analyze-code-quality.ts';

The agent evolves. It builds tools for itself. Over time, it accumulates a library of specialized skills.

This is impossible with static MCP servers.

The GetATeam Implementation

We tested both approaches on real production tasks.

Test case: Research a competitor, extract key features, generate comparison report.

MCP version:

Connected 4 MCP servers (web scraping, database, document generation, email)
47 tools loaded into context
Task consumed 87,000 tokens
Execution time: 45 seconds
Cost: $0.26

Code execution version:

Skills folder with 50+ TypeScript functions
Agent discovered and imported 3 skills
Task consumed 1,800 tokens
Execution time: 12 seconds
Cost: $0.005

Results:

98% token reduction
73% faster
98% cost reduction
Better output quality (agent had more context budget for reasoning)

The code-execution agent also handled edge cases better because it could write custom logic on the fly.

Benchmark methodology: Token counts include both input (system prompt + tool definitions + conversation history) and output (tool calls + responses + final answer). We used Claude Sonnet 3.5 (claude-3-5-sonnet-20241022) with temperature 0.7 for both tests. The task was run 10 times and results were averaged. MCP servers used: filesystem MCP, postgres MCP, browserbase MCP, and a custom email MCP server.

Security Architecture

The elephant in the room: letting agents execute arbitrary code is a security nightmare if done wrong.

Here's how we approach it at GetATeam:

VM-Level Isolation

Each customer (or group of agents belonging to the same customer) runs in a dedicated VM. This provides strong isolation boundaries:

Filesystem isolation: One customer's agents can't access another's data
Network isolation: VMs have separate network namespaces
Resource limits: CPU, memory, and disk quotas prevent resource exhaustion
Snapshot rollback: If an agent corrupts its environment, we can restore to last known good state

Capability-Based Permissions

Skills are executed with minimal privileges:

// Skills can only access:
// - Their own skill directory (read-only)
// - /tmp workspace (read-write)
// - Whitelisted network endpoints
// - Specific environment variables

// Skills CANNOT:
// - Access system binaries outside approved list
// - Bind to privileged ports
// - Modify system files
// - Execute sudo commands

Code Validation

Before execution, we scan agent-generated code:

const BLOCKED_PATTERNS = [
  /rm\s+-rf/,           // Recursive deletion
  /sudo/,               // Privilege escalation
  /eval\(/,             // Dynamic code execution
  /child_process/,      // Unrestricted shell access
  /\bexec\b.*&&.*rm/,   // Chained dangerous commands
];

function validateCode(code: string) {
  for (const pattern of BLOCKED_PATTERNS) {
    if (pattern.test(code)) {
      throw new Error(`Blocked dangerous pattern: ${pattern}`);
    }
  }

  // AST-based validation for more sophisticated threats
  const ast = parse(code);
  validateAST(ast);
}

Post-Execution Cleanup

After each task:

/tmp workspace is wiped
Network connections are closed
Orphaned processes are killed
File handles are released

Incident Response

Despite precautions, things can go wrong. Our monitoring:

Logs all executed code
Tracks resource usage patterns
Alerts on anomalous behavior (sudden CPU spike, unusual network traffic)
Maintains audit trail for forensics

Has this caused production issues? Yes. In early testing, an agent wrote a skill that entered an infinite loop, consuming 100% CPU for 20 minutes before we caught it. We added timeout guards (max 15 minutes per skill execution) and resource monitoring.

When to Still Use MCP

Does this mean MCP is dead? No.

MCP's real value is ecosystem-wide standardization. If 1000 tools expose MCP interfaces, your agent can use them all without custom integration code. For use cases where token overhead is negligible, this interoperability is genuinely valuable.

There are still valid use cases:

1. Simple, Well-Defined APIs

For straightforward integrations where you just need to send data to an API, MCP works fine.

Example: Customer support ticket creation

User message → Agent → MCP tool: create_ticket() → Done

No complex transformations. No multi-step workflows. Just a simple API call.

If your task involves ≤5 tool calls and returns minimal data, MCP's overhead is acceptable.

2. Third-Party Integrations You Don't Control

If you're integrating with an external service that provides an MCP server, and you don't need custom logic, use it.

Why reinvent the wheel? If the MCP server works and token usage is reasonable, stick with it.

The value here is maintenance burden transfer. When Stripe updates their API, their MCP server updates too. You don't need to touch your code.

3. Prototyping and Demos

MCP is faster to set up initially. No need to build a code execution sandbox.

For quick prototypes or demos, MCP gets you up and running fast.

4. Non-Technical Users

If you're building a no-code agent builder for non-developers, MCP provides a simpler abstraction.

Users can "connect integrations" without writing code.

When to Use Code Execution

Use code execution (skills) when:

1. Token Efficiency Matters

If your agent handles large data volumes or runs many tasks, token costs add up fast.

Code execution pays for itself immediately.

Threshold: If you're processing >10k tokens of intermediate data per task, code execution will save significant money.

2. You Need Custom Logic

Anytime you need to transform data, filter results, or combine multiple APIs, code execution gives you full control.

MCP forces you into the constraints of pre-defined tools.

3. Long-Running or Multi-Step Workflows

For complex tasks that span hours or days, code execution with file system persistence is essential.

Don't bloat the context window with intermediate results.

4. Privacy is Critical

Enterprise clients, healthcare, finance, any domain with strict data privacy requirements.

Code execution lets you anonymize, filter, and control exactly what the LLM sees.

5. You Want Agents That Evolve

If your goal is autonomous agents that learn and adapt, code execution enables skill creation and evolution.

MCP keeps agents static.

The Practical Reality: Hybrid Approach

At GetATeam, we use both.

MCP for:

Simple integrations (Slack notifications, calendar events)
Third-party services with official MCP servers
Quick prototypes
Tasks with <10 tool calls and minimal data transfer

Code execution for:

Data-heavy tasks (document processing, database queries)
Multi-step workflows (research → analysis → report generation)
Custom business logic
Agent skill evolution
Privacy-sensitive operations

The key is knowing when to use each approach.

Decision heuristic:

Will this task process >10k tokens of intermediate data? → Code execution
Do I need custom data transformation? → Code execution
Is there an official MCP server that does exactly what I need? → MCP
Is this a quick prototype? → MCP
Everything else: evaluate token cost vs. implementation time

Implementation Guide: Building Skills

If you want to implement code execution skills, here's how:

Step 1: Set Up a Secure Sandbox

You need an isolated environment where the agent can execute code safely.

Options:

VM-based sandboxes (strong isolation, higher overhead) ← We use this
Docker containers (lightweight, good isolation)
Serverless functions (Lambda, Cloud Run) (easy scaling, cold start latency)

At GetATeam, we use dedicated VMs per customer with:

Ubuntu 22.04 minimal
Docker for additional containerization
Firewall rules blocking all outbound except whitelisted endpoints
Quotas: 4 CPU cores, 8GB RAM, 50GB disk per VM

Step 2: Structure Your Skills

Create a clear directory structure:

skills/
├── core/           # Basic utilities
├── integrations/   # API wrappers
├── data/          # Data processing
└── custom/        # Agent-generated skills

Each skill is a simple function:

// skills/integrations/github/get-repo.ts
export async function getRepo(owner: string, repo: string) {
  const response = await fetch(`https://api.github.com/repos/${owner}/${repo}`);
  return response.json();
}

Step 3: Add Skill Discovery

Give your agent a way to find skills:

// skills/core/search-skills.ts
export function searchSkills(query: string) {
  // Simple keyword search across skill filenames and comments
  const allSkills = listAllSkills();
  return allSkills.filter(skill =>
    skill.name.includes(query) ||
    skill.description.includes(query)
  );
}

Step 4: Enable File System Persistence

Let the agent save intermediate results:

// Agent can save large data to disk instead of context
fs.writeFileSync('/tmp/large-dataset.json', JSON.stringify(data));

// Later, process it without re-loading into context
const data = JSON.parse(fs.readFileSync('/tmp/large-dataset.json'));
const summary = extractSummary(data); // Only summary goes to LLM

Step 5: Add Safety Guardrails

Prevent the agent from doing dangerous things:

// Multi-layer validation
function validateCode(code: string) {
  // 1. Pattern-based blocking
  const BLOCKED_PATTERNS = [
    /rm\s+-rf/,
    /sudo/,
    /shutdown/,
    /reboot/,
    /\/etc\//,    // System config access
  ];

  for (const pattern of BLOCKED_PATTERNS) {
    if (pattern.test(code)) {
      throw new Error(`Blocked dangerous pattern: ${pattern}`);
    }
  }

  // 2. AST analysis for deeper threats
  const ast = parseTypeScript(code);
  checkForDangerousAPIs(ast);

  // 3. Static analysis
  const lintResults = eslint.verify(code, securityRules);
  if (lintResults.length > 0) {
    throw new Error(`Security lint failed: ${lintResults}`);
  }
}

The Future of Agent Tooling

We're seeing a shift in production AI systems toward code execution for data-intensive workloads.

Why? Because abstractions reduce autonomy.

Every layer you add between the agent and the actual task reduces its ability to handle edge cases and adapt.

MCP is an abstraction on top of APIs. It constrains what the agent can do.

Code execution removes that constraint. The agent works directly with the underlying systems.

And as LLMs get better at generating code (which they're already excellent at), this approach becomes more reliable and powerful.

In our production experience at GetATeam, code-execution agents:

Consume 98% fewer tokens (for data-heavy tasks)
Run 70%+ faster
Handle edge cases better
Evolve and improve over time

MCP still has its place. But for production AI systems that need to be efficient, autonomous, and adaptable, skills beat MCP servers for most complex workflows.

The right question isn't "MCP vs. code execution?" It's "which tool for which job?"

Want to see this in action? GetATeam agents use code execution for all complex workflows. We've built 200+ production skills and our agents create new ones as needed. The architecture is battle-tested on thousands of real tasks.

About the author: Joseph Benguira is the CTO and co-founder of GetATeam, where AI agents execute real work autonomously. He's built production AI systems since the early days of LLMs and has strong opinions about abstractions, complexity, and what actually works in production.