TL;DR: State management is the hardest unsolved problem in production AI agents. After running 1000+ tasks on GetATeam, we've battle-tested every pattern from simple memory to distributed state machines. This article breaks down what actually works, what fails catastrophically, and the architecture patterns that scale.
Why State Management Breaks Most AI Agents
You've built an AI agent that works perfectly in your demo. It handles a task, responds correctly, and you're ready to ship. Then you put it in production, and everything falls apart.
The agent forgets context mid-conversation. It repeats actions it already completed. It loses track of multi-step workflows. Users report inconsistent behavior that you can't reproduce.
Sound familiar?
This isn't a bug in your code. It's a fundamental architecture problem: state management.
Unlike traditional software where state is predictable and controlled, AI agents operate in a fundamentally different paradigm. They make decisions based on context that evolves over time, across multiple interactions, sometimes spanning days or weeks.
At GetATeam, we've run over 1000 production tasks with AI agents. We've seen every failure mode imaginable. And we've learned some hard lessons about what actually works.
The State Problem Nobody Talks About
Here's what makes state management in AI agents uniquely difficult:
1. Context Windows Are Finite
Your agent might need to reference a conversation from 3 days ago, but LLMs have token limits. You can't just dump everything into the prompt every time.
2. Multi-Agent Coordination
When multiple agents work together, they need shared state. Agent A completes step 1, but how does Agent B know to start step 2? Race conditions and inconsistencies emerge fast.
3. Long-Running Tasks
A task might take hours or days. The agent needs to pick up where it left off after restarts, crashes, or deliberate pauses. Traditional session-based state doesn't cut it.
4. Human-in-the-Loop
Users might interrupt, provide new information, or change requirements mid-task. Your state needs to adapt without losing critical context.
5. Idempotency
If a task fails halfway and you retry, you can't have the agent repeat already-completed actions. It needs to know what's done and what's pending.
Pattern 1: Memory Files (Simple but Effective)
The Problem: Agent forgets context between sessions.
The Solution: Persistent memory files.
At GetATeam, every agent has a memory.md file in their directory. It's literally just a markdown file that the agent reads at startup and updates throughout execution.
Example code:
// Read memory at startup
const memory = fs.readFileSync('/app/agents/employee-id/memory.md', 'utf-8');
// Include in agent prompt
const prompt = `Your current memory: ${memory}
New task: ${task}
Update your memory.md file if you learn anything important.`;
What works:
- Dead simple to implement
- Human-readable (you can debug by just reading the file)
- Survives restarts and redeployments
- Works for 80% of use cases
What fails:
- No structured queries (you can't easily ask "what tasks did I complete yesterday?")
- Race conditions if multiple agents access the same file
- Grows unbounded without pruning
- No versioning or rollback
Real example from GetATeam:
Our Joseph Benguira agent maintains memory.md with:
- Current focus areas
- Recent decisions and their rationale
- Ongoing projects and status
- Key learnings from past tasks
When he receives a new task, he reads his memory first. This gives continuity across days and weeks.
Best practices:
- Structure memory with clear sections (## Current Projects, ## Recent Learnings, etc.)
- Timestamp entries so you can prune old data
- Keep it under 2000 tokens (roughly 1500 words)
- Update after significant actions, not every minor step
Pattern 2: TODO Lists (Task Tracking)
The Problem: Multi-step tasks lose track of what's done and what's pending.
The Solution: Structured TODO lists with explicit state.
We use TODO.md files with three states: pending, in_progress, completed.
Example:
## Current Tasks
- [x] Create database schema
- [ ] Implement authentication ← CURRENTLY WORKING ON THIS
- [ ] Build frontend components
- [ ] Write tests
- [ ] Deploy to production
## Completed
- [x] Set up project structure
- [x] Configure CI/CD pipeline
The agent reads this file before each action and updates it immediately after completing steps.
Critical rules we learned:
- Only ONE task in_progress at a time - Prevents confusion about current focus
- Mark complete IMMEDIATELY - Not later, not in batches, RIGHT AFTER
- Break down vague tasks - "Build feature" → specific implementable steps
- Remove irrelevant tasks - Don't let stale items clutter the list
Real failure we experienced:
Early on, our agents would mark multiple tasks as in_progress. Result? They'd jump between tasks randomly, duplicate work, or forget which task they were actually doing. Now we enforce: exactly one in_progress task.
Pattern 3: Database State (When Files Aren't Enough)
The Problem: You need to query state, handle concurrency, or coordinate multiple agents.
The Solution: PostgreSQL with proper schema design.
At GetATeam, we use Postgres for:
- Task queue management
- Agent activity logs
- User preferences and profiles
- Email gateway state (tracking conversations)
Schema example for task management:
CREATE TABLE agent_tasks (
id SERIAL PRIMARY KEY,
agent_id VARCHAR(255) NOT NULL,
task_type VARCHAR(100) NOT NULL,
status VARCHAR(50) NOT NULL,
context JSONB,
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_agent_status ON agent_tasks(agent_id, status);
CREATE INDEX idx_created_at ON agent_tasks(created_at DESC);
Query patterns that work:
// Get current task for agent
const currentTask = await db.query(
'SELECT * FROM agent_tasks WHERE agent_id = $1 AND status = $2 ORDER BY created_at DESC LIMIT 1',
[agentId, 'in_progress']
);
// Mark task complete atomically
await db.query(
'UPDATE agent_tasks SET status = $1, completed_at = NOW() WHERE id = $2 AND status = $3',
['completed', taskId, 'in_progress']
);
The JSONB context field is powerful:
Store any task-specific data without schema migrations:
{
"email_thread_id": "thread_abc123",
"files_generated": ["/tmp/report.pdf", "/tmp/analysis.csv"],
"user_preferences": { "format": "markdown", "tone": "technical" },
"checkpoints": [
{ "step": "research", "completed": true, "timestamp": "2025-11-08T10:30:00Z" },
{ "step": "draft", "completed": true, "timestamp": "2025-11-08T11:15:00Z" },
{ "step": "review", "completed": false }
]
}
Lessons from production:
- Use transactions for multi-step updates - Prevents partial state
- Index on (agent_id, status) - Fast lookups for "what's this agent doing?"
- Archive old completed tasks - Don't let the table grow unbounded
- Store timestamps - Essential for debugging and analytics
Pattern 4: Event Sourcing (Auditability + Recovery)
The Problem: You need to understand what happened, replay events, or recover from failures.
The Solution: Store events, not just current state.
Instead of updating a task status directly, store every event that happens:
CREATE TABLE agent_events (
id SERIAL PRIMARY KEY,
agent_id VARCHAR(255) NOT NULL,
event_type VARCHAR(100) NOT NULL,
event_data JSONB NOT NULL,
timestamp TIMESTAMP DEFAULT NOW()
);
Events might be:
{ "type": "task_started", "data": { "task_id": 123, "description": "..." } }
{ "type": "file_created", "data": { "path": "/tmp/report.pdf", "size": 45632 } }
{ "type": "email_sent", "data": { "to": "user@example.com", "subject": "..." } }
{ "type": "error_occurred", "data": { "message": "API timeout", "retry_count": 2 } }
{ "type": "task_completed", "data": { "task_id": 123, "duration_seconds": 145 } }
Why this is powerful:
- Replay: Reconstruct exactly what the agent did by replaying events
- Debug: See the full timeline when something goes wrong
- Analytics: Query patterns like "how many tasks fail due to API timeouts?"
- Recovery: If state gets corrupted, rebuild from events
Trade-off: Storage grows fast. Solution: Archive events older than N days to cold storage.
Pattern 5: Distributed State with Redis
The Problem: Multiple agents need real-time coordination.
The Solution: Redis for shared state, with TTLs and atomic operations.
Use case 1: Rate limiting
// Ensure agent doesn't make more than 10 API calls per minute
const key = 'rate_limit:' + agentId + ':' + Math.floor(Date.now() / 60000);
const count = await redis.incr(key);
await redis.expire(key, 60);
if (count > 10) {
throw new Error('Rate limit exceeded');
}
Use case 2: Lock-based coordination
// Ensure only one agent processes a task at a time
const lockKey = 'task_lock:' + taskId;
const acquired = await redis.set(lockKey, agentId, 'NX', 'EX', 300);
if (!acquired) {
console.log('Another agent is handling this task');
return;
}
try {
await processTask(taskId);
} finally {
await redis.del(lockKey);
}
Use case 3: Pub/Sub for event notifications
// Agent A publishes event
await redis.publish('task_completed', JSON.stringify({ task_id: 123, result: '...' }));
// Agent B subscribes
redis.subscribe('task_completed', (channel, message) => {
const event = JSON.parse(message);
console.log('Task completed:', event.task_id);
});
When NOT to use Redis:
- Long-term storage (use Postgres)
- Complex queries (use Postgres)
- Data you can't afford to lose (Redis is in-memory, use Postgres for durability)
When to use Redis:
- Short-term coordination
- Rate limiting
- Real-time event notifications
- Caching
The GetATeam Architecture
Here's how we combine these patterns in production:
┌─────────────────────────────────────────────────┐
│ AI Agent (e.g., Joseph Benguira) │
│ │
│ Startup: │
│ 1. Read memory.md (context from past) │
│ 2. Read TODO.md (current tasks) │
│ 3. Query Postgres (pending tasks from queue) │
│ │
│ During execution: │
│ 4. Update TODO.md (mark progress) │
│ 5. Write events to Postgres (audit trail) │
│ 6. Use Redis locks (coordination) │
│ 7. Publish Redis events (notify other agents) │
│ │
│ On completion: │
│ 8. Update memory.md (learnings) │
│ 9. Mark task complete in Postgres │
│ 10. Remove from TODO.md │
└─────────────────────────────────────────────────┘
Layer 1: Files (memory.md, TODO.md)
- Fast, simple, human-readable
- For agent-specific context and task tracking
Layer 2: Postgres
- Durable, queryable, transactional
- For task queues, user data, audit logs
Layer 3: Redis
- Fast, ephemeral, distributed
- For coordination, rate limiting, events
Common Pitfalls We've Learned
1. Storing everything in the prompt
Early mistake: Including all context in every LLM call. Result: Hit token limits fast, high costs, slow responses.
Fix: Selective context. Only include relevant memory, not everything.
2. No idempotency
If an agent crashes mid-task and retries, it might duplicate actions (e.g., send the same email twice).
Fix: Check completion status before every action. Use database transactions.
3. No versioning
Agent updates its memory, but there's no way to rollback if it made a mistake.
Fix: Git-track memory files or use event sourcing to reconstruct previous states.
4. Race conditions
Two agents try to update the same TODO.md file simultaneously. Chaos ensues.
Fix: Use database locks or Redis locks for shared state.
5. Unbounded growth
Memory files and TODO lists grow forever until they break.
Fix: Prune old data. Keep memory under 2000 tokens. Archive completed tasks.
Testing State Management
How do you test this? We use these strategies:
1. Checkpoint tests
Run a task, kill the agent mid-execution, restart it. Does it resume correctly?
2. Concurrent execution
Run multiple agents simultaneously on shared resources. Do they coordinate properly or step on each other?
3. Long-running tasks
Let a task run for hours or days. Does state remain consistent?
4. Failure injection
Randomly kill processes, disconnect databases, timeout APIs. Does the agent recover gracefully?
Metrics That Matter
Track these to know if your state management works:
- State consistency rate: % of tasks where final state matches expected
- Recovery success rate: % of crashed tasks that resume correctly
- Duplicate action rate: % of retries that repeat already-completed steps
- Context loss incidents: # of times agent forgets critical information
At GetATeam, we log every state transition and run daily reports to catch regressions.
The Future: Smarter State
We're experimenting with:
1. Semantic memory search
Instead of grep-ing memory.md, use vector embeddings to find relevant context:
const relevantMemories = await vectorDB.query(
'Find memories related to: user prefers technical tone',
{ limit: 5 }
);
2. Automatic memory summarization
LLM condenses old memories periodically to keep files small while preserving key information.
3. Multi-agent shared memory
Agents can read each other's memory (with permissions) to coordinate better.
4. Predictive state
Agent predicts what state will be needed next and pre-loads it, reducing latency.
Conclusion
State management is hard, but it's solvable. The key lessons:
- Start simple: memory.md and TODO.md get you 80% there
- Add complexity only when needed: Postgres for queries, Redis for coordination
- Make state explicit: Don't hide state in LLM conversations
- Design for failures: Agents will crash, APIs will timeout, assume it
- Test relentlessly: Checkpoint tests, concurrency tests, long-running tests
At GetATeam, our agents handle complex multi-day tasks with high reliability because we've invested in solid state architecture. It's not glamorous, but it's what separates demos from production systems.
Want to see this in action? GetATeam agents run in production 24/7, managing email, writing code, coordinating across time zones. The state management patterns in this article power every one of those tasks.
About the author: Joseph Benguira is the CTO and co-founder of GetATeam, where AI agents with real personalities execute actual work. He's spent 25+ years in software engineering, from Microsoft stacks to open source infrastructure, and now builds production AI systems that don't need babysitting.