TL;DR: MCP (Model Context Protocol) promised to be the universal standard for AI agent integrations. But for data-intensive workflows, it introduced massive token bloat, latency issues, and reduced agent autonomy. After testing both approaches on GetATeam, we found that code execution saves 98% of tokens and produces better results. Here's why skills beat MCP servers for production workloads, and when you should still use MCP.
The MCP Hype vs Reality
When Anthropic introduced the Model Context Protocol, everyone jumped on board. Finally, a universal standard for AI agent integrations! Connect once, unlock an entire ecosystem of tools.
The promise was beautiful: build your agent, plug in MCP servers, and instantly access Google Drive, Salesforce, Slack, databases, you name it.
The reality? For complex, data-heavy workflows, token consumption exploded. Latency went through the roof. Agents became slower and more expensive.
At GetATeam, we built agents both ways: MCP-based and code-execution-based. The difference was shocking.
Same task. Same agent. Different approach:
- MCP version: 50,000 tokens consumed
- Code execution version: 1,000 tokens consumed
That's a 98% reduction in token usage. Not 10%. Not 50%. Ninety-eight percent.
And it wasn't just cheaper, it was better. The code-execution agent produced higher quality results and worked more autonomously.
So what went wrong with MCP?
Problem 1: Tool Definitions Overload Context
Here's how MCP works:
You connect your agent to an MCP server. That server exposes 20-30 tools. Each tool has a name, description, and parameters.
Most developers don't connect just one MCP server. They connect 5-6 different servers.
The math:
- 6 MCP servers × 25 tools each = 150 tools
- Each tool definition: ~100-200 tokens
- Total overhead: 15,000-30,000 tokens
And your agent might only need to use ONE tool.
But it still has to load all 150 tool definitions into its context window. Every. Single. Time.
Important caveat: You could implement selective tool loading in MCP, caching definitions or using a search endpoint to filter tools. But the protocol's design doesn't encourage this pattern. Most implementations we've seen load all available tools upfront, and that's what we benchmarked against.
This causes three problems:
1. Higher costs You're paying for 30,000 tokens of tool definitions before the agent even starts the task.
2. Increased latency Larger context = slower inference. Your agent takes longer to respond.
3. More hallucinations Too much irrelevant context confuses the model. It might call the wrong tool or mix up parameters.
Problem 2: Intermediate Results Bloat Context
Let's say your agent needs to read a Google Doc transcript.
With MCP:
- Agent calls
get_document(doc_id) - MCP returns the entire 50,000-token transcript
- Agent reads it all into context
- Agent extracts the 2 paragraphs it actually needed
You just consumed 50,000+ tokens to get 200 tokens of useful information.
Now imagine the agent is coordinating a multi-step workflow:
- Read transcript (50k tokens)
- Query database (20k tokens)
- Fetch Salesforce data (15k tokens)
- Read email thread (10k tokens)
Each intermediate result bloats the context window. You're paying for massive token usage, and the agent is drowning in irrelevant data.
When this doesn't apply: If your agent needs to perform deep semantic analysis of the full document (summarization, sentiment analysis, finding contradictions), you'll need those tokens in context regardless of your approach. This optimization applies to data retrieval, filtering, and aggregation tasks, not tasks that require reasoning about entire large documents.
The Alternative: Code Execution with Skills
Anthropic quietly published a blog post suggesting a different approach: let agents write and execute code instead of calling pre-defined MCP tools.
Here's how it works:
Structure:
skills/
├── google-drive/
│ ├── get-document.ts
│ ├── upload-file.ts
│ └── list-files.ts
├── salesforce/
│ ├── create-lead.ts
│ ├── update-contact.ts
│ └── get-account.ts
└── slack/
├── send-message.ts
└── read-channel.ts
Each skill is a simple TypeScript function. When the agent needs a tool, it:
- Discovers the skill (via search or directory listing)
- Imports only that one skill
- Executes the code
Example:
Instead of loading 150 tool definitions, the agent does this:
// Import only what's needed
import { getDocument } from './skills/google-drive/get-document.ts';
// Fetch the document
const transcript = await getDocument('doc_abc123');
// Save to file system (not context!)
fs.writeFileSync('/tmp/transcript.txt', transcript);
// Extract only what's needed
const firstParagraph = transcript.split('\n\n')[0];
// Use it
console.log(firstParagraph);
What changed:
- Before (MCP): 50,000 tokens in context
- After (Code): 200 tokens in context (just the paragraph)
The rest of the transcript is stored in the file system, not the LLM context window.
Why This Works Better
1. Massive Token Savings
MCP approach:
Tool definitions: 30,000 tokens
Intermediate results: 95,000 tokens
Total: 125,000 tokens
Code execution approach:
Import statements: 100 tokens
Actual data needed: 500 tokens
Total: 600 tokens
That's 98% reduction. At $3/M tokens (Claude Sonnet), that's the difference between $0.38 and $0.002 per task.
At scale, this matters. If you run 10,000 tasks/month:
- MCP cost: $3,800/month
- Code execution cost: $20/month
2. Progressive Disclosure
With MCP, you're limited by context window size. Want to give your agent access to 1000 tools? Good luck fitting that into 200k tokens.
With code execution, there's no limit. You can have 10,000 skills.
How? The agent uses a search tool to discover what it needs:
// Agent searches for relevant skills
const skills = searchSkills('salesforce customer data');
// Returns: ['salesforce/get-account.ts', 'salesforce/get-contact.ts']
// Import only what's relevant
import { getAccount } from skills[0];
The agent only loads what it needs, when it needs it.
Technical note: Our searchSkills() implementation uses keyword matching across skill filenames and inline documentation comments. For larger skill libraries (1000+ skills), we're exploring embedding-based semantic search, though we haven't found keyword search to be a bottleneck yet at our current scale (~200 skills).
3. Privacy and Data Control
Enterprise clients hate exposing sensitive data to third-party LLMs.
With MCP, when your agent calls get_customer_data(), the full response (including emails, phone numbers, SSNs) goes into the LLM context.
With code execution, you can add a harness layer:
// Original skill
export async function getCustomerData(id: string) {
const data = await db.query('SELECT * FROM customers WHERE id = ?', [id]);
return data;
}
// Wrapped with privacy harness
export async function getCustomerDataAnonymized(id: string) {
const data = await getCustomerData(id);
// Anonymize before returning to LLM
return {
...data,
email: anonymize(data.email), // user@example.com → u***@e***.com
phone: anonymize(data.phone), // +1-555-1234 → +1-***-****
ssn: '[REDACTED]'
};
}
The agent never sees sensitive data, but it can still work with customer records.
4. State Persistence and Skill Evolution
This is the game-changer.
With MCP, tools are static. You connect to a server, you get the tools it provides, end of story.
With code execution, the agent can create its own skills.
Example from GetATeam:
// Agent realizes it needs a skill that doesn't exist
// It writes one and saves it
const skillCode = `
export async function analyzeCodeQuality(repoUrl: string) {
// Clone repo
await exec('git clone ' + repoUrl + ' /tmp/repo');
// Run linter
const lintResults = await exec('cd /tmp/repo && eslint .');
// Run tests
const testResults = await exec('cd /tmp/repo && npm test');
// Analyze
return {
lintIssues: lintResults.split('\n').length,
testsPassing: testResults.includes('passing'),
complexity: calculateComplexity(lintResults)
};
}
`;
fs.writeFileSync('./skills/custom/analyze-code-quality.ts', skillCode);
// Now the agent can use this skill forever
import { analyzeCodeQuality } from './skills/custom/analyze-code-quality.ts';
The agent evolves. It builds tools for itself. Over time, it accumulates a library of specialized skills.
This is impossible with static MCP servers.
The GetATeam Implementation
We tested both approaches on real production tasks.
Test case: Research a competitor, extract key features, generate comparison report.
MCP version:
- Connected 4 MCP servers (web scraping, database, document generation, email)
- 47 tools loaded into context
- Task consumed 87,000 tokens
- Execution time: 45 seconds
- Cost: $0.26
Code execution version:
- Skills folder with 50+ TypeScript functions
- Agent discovered and imported 3 skills
- Task consumed 1,800 tokens
- Execution time: 12 seconds
- Cost: $0.005
Results:
- 98% token reduction
- 73% faster
- 98% cost reduction
- Better output quality (agent had more context budget for reasoning)
The code-execution agent also handled edge cases better because it could write custom logic on the fly.
Benchmark methodology: Token counts include both input (system prompt + tool definitions + conversation history) and output (tool calls + responses + final answer). We used Claude Sonnet 3.5 (claude-3-5-sonnet-20241022) with temperature 0.7 for both tests. The task was run 10 times and results were averaged. MCP servers used: filesystem MCP, postgres MCP, browserbase MCP, and a custom email MCP server.
Security Architecture
The elephant in the room: letting agents execute arbitrary code is a security nightmare if done wrong.
Here's how we approach it at GetATeam:
VM-Level Isolation
Each customer (or group of agents belonging to the same customer) runs in a dedicated VM. This provides strong isolation boundaries:
- Filesystem isolation: One customer's agents can't access another's data
- Network isolation: VMs have separate network namespaces
- Resource limits: CPU, memory, and disk quotas prevent resource exhaustion
- Snapshot rollback: If an agent corrupts its environment, we can restore to last known good state
Capability-Based Permissions
Skills are executed with minimal privileges:
// Skills can only access:
// - Their own skill directory (read-only)
// - /tmp workspace (read-write)
// - Whitelisted network endpoints
// - Specific environment variables
// Skills CANNOT:
// - Access system binaries outside approved list
// - Bind to privileged ports
// - Modify system files
// - Execute sudo commands
Code Validation
Before execution, we scan agent-generated code:
const BLOCKED_PATTERNS = [
/rm\s+-rf/, // Recursive deletion
/sudo/, // Privilege escalation
/eval\(/, // Dynamic code execution
/child_process/, // Unrestricted shell access
/\bexec\b.*&&.*rm/, // Chained dangerous commands
];
function validateCode(code: string) {
for (const pattern of BLOCKED_PATTERNS) {
if (pattern.test(code)) {
throw new Error(`Blocked dangerous pattern: ${pattern}`);
}
}
// AST-based validation for more sophisticated threats
const ast = parse(code);
validateAST(ast);
}
Post-Execution Cleanup
After each task:
/tmpworkspace is wiped- Network connections are closed
- Orphaned processes are killed
- File handles are released
Incident Response
Despite precautions, things can go wrong. Our monitoring:
- Logs all executed code
- Tracks resource usage patterns
- Alerts on anomalous behavior (sudden CPU spike, unusual network traffic)
- Maintains audit trail for forensics
Has this caused production issues? Yes. In early testing, an agent wrote a skill that entered an infinite loop, consuming 100% CPU for 20 minutes before we caught it. We added timeout guards (max 15 minutes per skill execution) and resource monitoring.
When to Still Use MCP
Does this mean MCP is dead? No.
MCP's real value is ecosystem-wide standardization. If 1000 tools expose MCP interfaces, your agent can use them all without custom integration code. For use cases where token overhead is negligible, this interoperability is genuinely valuable.
There are still valid use cases:
1. Simple, Well-Defined APIs
For straightforward integrations where you just need to send data to an API, MCP works fine.
Example: Customer support ticket creation
User message → Agent → MCP tool: create_ticket() → Done
No complex transformations. No multi-step workflows. Just a simple API call.
If your task involves ≤5 tool calls and returns minimal data, MCP's overhead is acceptable.
2. Third-Party Integrations You Don't Control
If you're integrating with an external service that provides an MCP server, and you don't need custom logic, use it.
Why reinvent the wheel? If the MCP server works and token usage is reasonable, stick with it.
The value here is maintenance burden transfer. When Stripe updates their API, their MCP server updates too. You don't need to touch your code.
3. Prototyping and Demos
MCP is faster to set up initially. No need to build a code execution sandbox.
For quick prototypes or demos, MCP gets you up and running fast.
4. Non-Technical Users
If you're building a no-code agent builder for non-developers, MCP provides a simpler abstraction.
Users can "connect integrations" without writing code.
When to Use Code Execution
Use code execution (skills) when:
1. Token Efficiency Matters
If your agent handles large data volumes or runs many tasks, token costs add up fast.
Code execution pays for itself immediately.
Threshold: If you're processing >10k tokens of intermediate data per task, code execution will save significant money.
2. You Need Custom Logic
Anytime you need to transform data, filter results, or combine multiple APIs, code execution gives you full control.
MCP forces you into the constraints of pre-defined tools.
3. Long-Running or Multi-Step Workflows
For complex tasks that span hours or days, code execution with file system persistence is essential.
Don't bloat the context window with intermediate results.
4. Privacy is Critical
Enterprise clients, healthcare, finance, any domain with strict data privacy requirements.
Code execution lets you anonymize, filter, and control exactly what the LLM sees.
5. You Want Agents That Evolve
If your goal is autonomous agents that learn and adapt, code execution enables skill creation and evolution.
MCP keeps agents static.
The Practical Reality: Hybrid Approach
At GetATeam, we use both.
MCP for:
- Simple integrations (Slack notifications, calendar events)
- Third-party services with official MCP servers
- Quick prototypes
- Tasks with <10 tool calls and minimal data transfer
Code execution for:
- Data-heavy tasks (document processing, database queries)
- Multi-step workflows (research → analysis → report generation)
- Custom business logic
- Agent skill evolution
- Privacy-sensitive operations
The key is knowing when to use each approach.
Decision heuristic:
- Will this task process >10k tokens of intermediate data? → Code execution
- Do I need custom data transformation? → Code execution
- Is there an official MCP server that does exactly what I need? → MCP
- Is this a quick prototype? → MCP
- Everything else: evaluate token cost vs. implementation time
Implementation Guide: Building Skills
If you want to implement code execution skills, here's how:
Step 1: Set Up a Secure Sandbox
You need an isolated environment where the agent can execute code safely.
Options:
- VM-based sandboxes (strong isolation, higher overhead) ← We use this
- Docker containers (lightweight, good isolation)
- Serverless functions (Lambda, Cloud Run) (easy scaling, cold start latency)
At GetATeam, we use dedicated VMs per customer with:
- Ubuntu 22.04 minimal
- Docker for additional containerization
- Firewall rules blocking all outbound except whitelisted endpoints
- Quotas: 4 CPU cores, 8GB RAM, 50GB disk per VM
Step 2: Structure Your Skills
Create a clear directory structure:
skills/
├── core/ # Basic utilities
├── integrations/ # API wrappers
├── data/ # Data processing
└── custom/ # Agent-generated skills
Each skill is a simple function:
// skills/integrations/github/get-repo.ts
export async function getRepo(owner: string, repo: string) {
const response = await fetch(`https://api.github.com/repos/${owner}/${repo}`);
return response.json();
}
Step 3: Add Skill Discovery
Give your agent a way to find skills:
// skills/core/search-skills.ts
export function searchSkills(query: string) {
// Simple keyword search across skill filenames and comments
const allSkills = listAllSkills();
return allSkills.filter(skill =>
skill.name.includes(query) ||
skill.description.includes(query)
);
}
Step 4: Enable File System Persistence
Let the agent save intermediate results:
// Agent can save large data to disk instead of context
fs.writeFileSync('/tmp/large-dataset.json', JSON.stringify(data));
// Later, process it without re-loading into context
const data = JSON.parse(fs.readFileSync('/tmp/large-dataset.json'));
const summary = extractSummary(data); // Only summary goes to LLM
Step 5: Add Safety Guardrails
Prevent the agent from doing dangerous things:
// Multi-layer validation
function validateCode(code: string) {
// 1. Pattern-based blocking
const BLOCKED_PATTERNS = [
/rm\s+-rf/,
/sudo/,
/shutdown/,
/reboot/,
/\/etc\//, // System config access
];
for (const pattern of BLOCKED_PATTERNS) {
if (pattern.test(code)) {
throw new Error(`Blocked dangerous pattern: ${pattern}`);
}
}
// 2. AST analysis for deeper threats
const ast = parseTypeScript(code);
checkForDangerousAPIs(ast);
// 3. Static analysis
const lintResults = eslint.verify(code, securityRules);
if (lintResults.length > 0) {
throw new Error(`Security lint failed: ${lintResults}`);
}
}
The Future of Agent Tooling
We're seeing a shift in production AI systems toward code execution for data-intensive workloads.
Why? Because abstractions reduce autonomy.
Every layer you add between the agent and the actual task reduces its ability to handle edge cases and adapt.
MCP is an abstraction on top of APIs. It constrains what the agent can do.
Code execution removes that constraint. The agent works directly with the underlying systems.
And as LLMs get better at generating code (which they're already excellent at), this approach becomes more reliable and powerful.
In our production experience at GetATeam, code-execution agents:
- Consume 98% fewer tokens (for data-heavy tasks)
- Run 70%+ faster
- Handle edge cases better
- Evolve and improve over time
MCP still has its place. But for production AI systems that need to be efficient, autonomous, and adaptable, skills beat MCP servers for most complex workflows.
The right question isn't "MCP vs. code execution?" It's "which tool for which job?"
Want to see this in action? GetATeam agents use code execution for all complex workflows. We've built 200+ production skills and our agents create new ones as needed. The architecture is battle-tested on thousands of real tasks.
About the author: Joseph Benguira is the CTO and co-founder of GetATeam, where AI agents execute real work autonomously. He's built production AI systems since the early days of LLMs and has strong opinions about abstractions, complexity, and what actually works in production.