← /thoughts

Building an MCP Agent Farm from Scratch

2026-01-20
#ai#mcp#agents#architecture

The Model Context Protocol (MCP) is a standard for giving AI models structured access to external tools. I built an Agent Farm on top of it — an orchestration layer that manages multiple AI agent templates, each with different models, system prompts, and tool configurations. This is how it works, what I got wrong, and what I'd do differently.

Why I Built This

I was integrating AI into multiple applications at once: a portfolio tracker, this website, an admin dashboard. Each needed different AI behavior — the portfolio tracker needed structured JSON output with financial analysis, this website needed a conversational personality, the admin tool needed dry technical responses.

The naive approach is to hardcode the model and system prompt into each application. I did that for about two weeks before it became unmanageable. Changing a prompt meant redeploying the application. Switching from Claude to a local model meant updating every call site. Testing different models against each other was a manual process.

So I built an abstraction layer. An Agent Farm.

Architecture

The farm has three layers:

Gateway — receives requests from client applications, authenticates them using API keys, validates the request format, and routes to the correct agent worker. Built with Fastify because it handles streaming responses cleanly and has low overhead.

Agent Workers — execute tasks using the configured LLM provider. Each worker picks up a job from the queue, loads the agent template (system prompt, model config, tool list), calls the model, and streams the response back. Workers are stateless — they scale horizontally without coordination.

Queue — BullMQ backed by Redis handles async task management, retry logic, rate limiting, and backpressure. When the queue gets long, BullMQ handles it gracefully. When a worker crashes mid-stream, the job retries.

Client App
    │
    ▼
Gateway (Fastify) — authenticates, validates, routes
    │
    ▼
BullMQ Queue (Redis) — async, retryable, rate-limited
    │
    ▼
Agent Workers (N parallel)
    │
    ▼
LLM Provider (Anthropic / Ollama / OpenAI)

Template System

Templates are the key abstraction. Each template defines:

  • llmProvider: anthropic, openai, or ollama
  • llmModel: e.g., claude-sonnet-4-6, gpt-4o, qwen2.5-coder:14b
  • systemPrompt: the full system prompt injected before every conversation
  • mcpServerIds: which MCP tool servers this agent can access
  • temperature: controls response randomness
  • maxTokens: caps response length

Template 3 (Claude Sonnet) handles complex portfolio analysis and returns structured JSON. Template 4 (Qwen local) handles simple queries with streaming markdown. Template 5 powers this website's chat agent with a specific Markandey Singh personality.

The beauty is separation of concerns: the application doesn't know which model it's talking to. It sends { agentTemplateId: 5, inputPrompt: "tell me about your career" } and gets back a streamed response. Swap models, change prompts, add tools — zero application code changes.

MCP Tool Integration

Each template can be assigned MCP tool servers. The current setup includes:

  • web-search: Brave Search API for live web results
  • portfolio-tracker: financial data tools (holdings, P&L, tax calculations)
  • knowledge-store: RAG via pgvector for semantic memory across conversations
  • aws: EC2, S3, Lambda management tools
  • atlassian: Jira and Confluence integration

When a worker runs a task, it connects to the assigned MCP servers, makes the tool list available to the model, and handles tool call execution inside the streaming loop. The model decides when and which tools to call — the worker executes them and feeds results back. This is the core of what makes MCP powerful: the model can browse the web, query a database, and check infrastructure status mid-response, all without the client application knowing anything about those tools.

The Provider Abstraction Bug

This was my most embarrassing bug. @ai-sdk/openai version 3 defaults to OpenAI's new Responses API (/v1/responses). Ollama only supports the older Chat Completions API (/v1/chat/completions). For three weeks, every Ollama call returned a 404 and the error message was cryptic enough that I didn't immediately identify it.

Fix: call openai.chat(model) instead of openai(model) for the Ollama provider. The broader lesson: wrap SDK calls in a thin factory function (createProvider(provider, model)). You fix the bug once, not in every application.

Streaming is Non-Negotiable

A 5-second wait for a complete response feels broken. The same content streamed token-by-token from the first second feels fast and alive. Every endpoint supports SSE streaming — the client sees event: token within 200ms of submitting a query.

Perceived performance matters more than actual performance. A slower, cheaper model that streams immediately outperforms a powerful model that thinks for 8 seconds before responding.

Cost Monitoring Built In

Every task logs: which template was used, which provider was called, response time, estimated token count, and inferred cost. This is how I discovered that 93% of queries to this website's agent are simple enough for Qwen running locally. The complex queries get routed to Claude. Result: 76% cost savings with no perceptible quality difference for most conversations.

You cannot optimize what you don't measure. Log every query from day one.

What I'd Do Differently

Design for streaming from the start. I built the queue system without streaming, then retrofitted it. That was painful. Design for it upfront.

Separate gateway from farm early. I ran them together initially. As the farm scaled, the gateway became a bottleneck. They're separate services now — should have been separate from day one.

Version templates. Right now, changing a template prompt updates it globally with no rollback. Template versioning and A/B testing across versions is on the backlog.

The full system is running right now. The AI on this website is Template 5 — Qwen 2.5 Coder 14B on my local hardware, costing ₹0 per conversation. You can watch the live agent orchestration in the Lab.