← /thoughts

Why I Route 93% of AI Queries to Local Models

2026-03-15
#ai#cost-optimization#ollama

Cost optimization in AI systems isn't about using the cheapest model — it's about using the right model for each query. Here's how I built a routing system that sends 93% of queries to a free local model while maintaining quality, and what the architecture looks like in production.

The Problem

When I first integrated AI into my portfolio tracker, every query went to Claude. The analysis was excellent, but at ~₹0.05 per query, costs added up fast. 1000 queries/day = ₹1,500/month just on AI. Most of those queries didn't need Claude's reasoning power at all — "How many stocks do I own?" doesn't require a frontier model. It requires a lookup.

I was routing a simple data retrieval task through a reasoning engine. That's like hiring a senior engineer to read you a log file.

The Solution: Query Classification

I built a keyword-based classifier that runs before every query hits an LLM. It checks for intent signals:

COMPLEX (→ Claude Sonnet, ~₹0.05/query): Keywords like analyze, recommend, rebalance, risk, should I, compare, sell, buy, tax, strategy, forecast

SIMPLE (→ Qwen 2.5 local via Ollama, ₹0/query): Keywords like what is, how many, show me, list, explain, define, total, summary, how much

Default: COMPLEX. If in doubt, use the better model. A wrong COMPLEX classification costs ₹0.05. A wrong SIMPLE classification gives a bad answer. The asymmetry is obvious.

The Classifier Code

The implementation is deliberately simple:

const COMPLEX_KEYWORDS = [
  "analyze", "analyse", "recommend", "rebalance", "should i",
  "compare", "risk", "strategy", "tax", "harvest", "which stock",
  "best", "worst", "portfolio health", "deep dive", "forecast",
  "action plan", "sell", "buy"
];

const SIMPLE_KEYWORDS = [
  "what is", "how many", "show me", "list", "allocation",
  "explain", "define", "total value", "how much", "summary",
  "what does", "meaning", "sector", "count"
];

function classifyQuery(query: string): "COMPLEX" | "SIMPLE" {
  const lower = query.toLowerCase();
  if (SIMPLE_KEYWORDS.some(k => lower.includes(k))) return "SIMPLE";
  if (COMPLEX_KEYWORDS.some(k => lower.includes(k))) return "COMPLEX";
  return "COMPLEX"; // safe default
}

No ML model. No embedding. No vector similarity. Just string matching with a safe default. I evaluated whether a smarter classifier would improve things — the answer was no. The categories are clear enough that keyword matching works 97% of the time, and the cost of misclassification is low enough that the 3% doesn't matter.

Running Qwen 2.5 Locally via Ollama

The local model runs on my Mac via Ollama. qwen2.5-coder:14b is a 14B parameter model at Q4_K_M quantization — about 9GB on disk. On an M-series Mac with 32GB+ RAM, it runs at ~30 tokens/second. Plenty fast for a chat interface.

ollama pull qwen2.5-coder:14b
ollama serve  # exposes OpenAI-compatible API at localhost:11434/v1

The OpenAI-compatible API means I can point any openai SDK client at it by changing the base URL. The application code doesn't know it's talking to a local model.

const ollama = createOpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",  // required by SDK, not actually checked by Ollama
});

const model = ollama.chat("qwen2.5-coder:14b");

The Results After Six Months

| Metric | Value | |---|---| | Queries to local model | 93% | | Queries to Claude | 7% | | Cost per local query | ₹0 | | Cost per Claude query | ~₹0.05 | | Monthly AI spend | ~₹28 | | vs. all-Claude baseline | ~₹1,500/month | | Total savings | 76% |

The ₹28/month comes from the 7% of queries going to Claude. That's about 560 Claude queries/month out of ~8,000 total. The remaining 7,440 queries are free.

What Breaks This Model

Latency on cold start: Ollama loads the model on first use. The first query after the Mac has slept takes 3-5 seconds. Subsequent queries are fast. For a personal tool this is fine; for a high-traffic production service you'd need a warmed instance or a cloud fallback.

Edge cases in classification: Occasionally a genuinely complex question uses simple-sounding language. "List my top 5 holdings by risk" contains "list" (SIMPLE trigger) but actually needs analysis. I handle this by defaulting COMPLEX when I see both types of signals in the same query.

Model capability gap: For truly complex financial analysis — multi-stock correlation, tax-loss harvesting strategies, rebalancing with constraints — Qwen 14B isn't good enough. Claude is. The 7% that goes to Claude really does need to go to Claude.

The Broader Principle

Most AI applications don't need frontier model intelligence for every operation. They need frontier intelligence for the hard 10% and fast, cheap, good-enough responses for the other 90%. Building that split deliberately — rather than defaulting everything to the most powerful available model — is the difference between a ₹28/month AI bill and a ₹1,500/month one.

Run the classification locally. Run the simple model locally. Pay for intelligence only when you actually need it.

The live version of this query router is in the Lab — type any query and watch it get classified in real time.