I built an MCP server in a weekend — here's what production-grade actually means

A weekend ago I shipped an MCP server. The README says "production-grade." But "production-grade" is one of those phrases that's been beaten flat by software-marketing copy until it carries almost no signal — so before that phrase lands on anything else I publish, I want to define it.

This post is the definition. Seven patterns that exist in real MCP servers and are absent from every demo I've reviewed. Auth, idempotency, schema versioning, cost instrumentation, the bits between an MCP server you'd demo on a livestream and an MCP server you'd hand to a paying user.

I'll show the gap, name each missing piece, and (where useful) drop the snippet I run in mine. Repo at the bottom — MIT, one-click install for Claude and Cursor.

Key Takeaways

Most public MCP examples stop at "the model called my tool." That's not the bar.
Production-grade MCP = bearer auth scoped per tool, a versioned tool registry, idempotency keys, structured per-call observability, deprecate-then-remove for tool removals, per-tool cost instrumentation, and schema validation that survives Anthropic SDK churn.
Skipping any one of these is fine for a demo. Skipping any one of these in front of paying users produces a specific, named failure mode within 30 days.
The total code that closes the gap is small (mine is ~600 lines on top of the Anthropic SDK). The thinking is the expensive part.
I'm releasing the server as an MIT-licensed reference repo, with each of the seven patterns labelled by section number.

What "production-grade" actually means here
#1 — Bearer auth with per-tool scopes
#2 — A versioned tool registry
#3 — Idempotency keys on every mutating call
#4 — Structured logging: input, output, latency, cost
#5 — Deprecate-then-remove for tool removals
#6 — Cost instrumentation per tool, not per request
#7 — Schema validation that survives SDK updates
The reference repo

1. What "production-grade" actually means here

Two definitions, neither controversial:

A demo is software you'd show on a livestream. It works the happy path, it works for one user, it works on your laptop. The interesting bits are the parts that work.

Production-grade is software you'd hand to a paying customer. The interesting bits are the parts that fail: what happens when the model calls your tool with stale arguments, when one tool's schema changes, when the SDK gets a minor-version bump, when two concurrent users hit the same handler, when the bill arrives.

MCP is at the awkward stage where demos are everywhere and production servers are rare. The gap is mostly engineering, not concept. The seven things below are what I shipped after burning the weekend on each one.

2. Bearer auth with per-tool scopes

Default MCP examples authenticate at the server level — a single bearer token gates the whole connection. That's fine for one user and one client. It is not fine the moment a server has more than one consumer.

The pattern that holds up:

// Tool registration declares its required scope
server.tool({
  name: "send_email",
  description: "Send an email on behalf of the user",
  scope: "email.write",
  inputSchema: SendEmailSchema,
  handler: async (input, ctx) => {
    // ctx.scopes is verified before this runs
  },
});

Then the connection auth check returns a set, not a boolean:

const scopes = await verifyBearer(req.headers.authorization);
// → ["calendar.read", "email.write"] or throws

On every tool call, before handler invocation, the server compares the required scope against the granted set and returns a structured unauthorized error if absent.

Why it matters: the first time a customer asks "can I give my MCP server to my agent, but only for read-only access?" — this is the answer. Without per-tool scopes, you can't answer it. With them, it's one line in the tool registration.

Demos skip this because everything is Bearer dev-token and the dev token grants everything. Production-grade has at least three scope tiers (read, write, admin) and at least three rotation paths (issue, revoke, expire).

3. A versioned tool registry

Tool definitions change. You'll rename a parameter, you'll widen a type, you'll add an optional field. In a demo, you push the change, the model picks it up, it works.

In production, the MCP client cached your tool definition at connection time. It will call your tool with the old schema for the lifetime of that connection — which can be hours. If you've widened a type, fine. If you've renamed userId to accountId, you've just broken every in-flight session.

The fix is registry versioning. Each tool registration carries a version. On connect, the client receives the current version per tool. On call, the server checks the called version against the current one. Mismatch returns a structured schema_mismatch error with both signatures, and the client refetches.

server.tool({
  name: "send_email",
  version: "2.1.0",      // bump on any breaking change
  // ...
});

There's a quieter benefit: every tool call in your logs is tagged with the schema version it was made against. Six months in, when someone asks "why did this call fail," you can actually answer it.

4. Idempotency keys on every mutating call

LLMs retry. The Anthropic SDK retries. Your client retries. Networks retry.

Every tool call that has a side effect — sends an email, charges a card, writes a row — must be idempotent. The right pattern is the boring one used in every payment API: the caller passes an idempotency key, the server keys writes by it, and a replay returns the same response as the first call.

server.tool({
  name: "send_email",
  idempotent: true,
  handler: async (input, ctx) => {
    const key = ctx.idempotencyKey;  // 24-byte random, generated client-side
    return await idempotentInsert(key, () => sendEmail(input));
  },
});

Skip this and you will eventually send the same email three times to the same person. It's not a hypothetical — there are public MCP demos where you can reproduce it by toggling wi-fi during a tool call. The model retries, the tool fires twice, the user gets two emails.

For read-only tools, idempotency is free (just don't mutate). For mutating tools, the implementation cost is one Redis call per invocation. Skipping it is a class of bug that no amount of prompting fixes.

5. Structured logging: input, output, latency, cost

Every MCP demo I've reviewed logs something. None of them log the right thing.

The schema that earns its keep, one row per tool call:

{
  call_id: string,            // generated server-side, propagated downstream
  tool_name: string,
  tool_version: string,       // see #3
  scope_required: string,
  user_id: string,            // or org_id, whatever your tenancy unit is
  input: object,              // the call args
  output_summary: object,     // size, type, status — not the whole payload
  latency_ms: number,
  llm_cost_usd: number,       // if the tool itself calls an LLM
  external_cost_usd: number,  // any third-party API spend on this call
  status: "ok" | "error" | "deprecated",
  error_code?: string,
}

Index tool_name and user_id and you get answers to every question that matters: which tool is slowest, which user costs the most, which tool errors most. Feed it to evals — production logs are the best eval set you'll ever have.

Demos log console.log("called", name). That isn't telemetry, that's decoration.

6. Deprecate-then-remove for tool removals

You will remove tools. Some were mistakes, some got replaced, some belonged to a feature that pivoted.

Removing a tool from an MCP server breaks anything depending on it. The naive version is: pull the registration, redeploy, surprise. The right version is the same pattern HTTP APIs have used for two decades:

Phase 1 — tool returns a structured deprecated warning alongside its result. Clients log it.
Phase 2 (≥30 days later) — tool returns the warning and refuses the call with removed_use_replacement, pointing at the new tool name.
Phase 3 (≥30 days later) — tool registration is pulled.

In code:

server.tool({
  name: "old_search",
  deprecated: { since: "2026-04-01", replacement: "search_v2" },
  // handler unchanged for phase 1
});

A 20-line addition to your registry. It is the difference between "we shipped a breaking change Friday and went home" and "we shipped a breaking change after 60 days of warnings, with a one-line migration path."

7. Cost instrumentation per tool, not per request

This is the line item you don't notice until the bill changes.

A typical MCP request from Claude Opus 4.7 with three tool calls hides three separate cost sources: the model invocation itself, any LLM calls inside the tools, and third-party API spend (if any). If you only measure model cost, you'll be off by 3–10×.

The pattern: every tool handler returns its incurred cost alongside the result. The server attaches it to the structured log (see #5), and a daily roll-up groups by tool and user.

return {
  result: emailId,
  __meta: {
    llm_cost_usd: 0.0021,    // a small Haiku 4.5 classification inside the tool
    external_cost_usd: 0.0,
  },
};

The reason this is important isn't bookkeeping — it's product. Once you can answer "which tool costs the most per active user," you can answer "what should I cache, what should I batch, what should I move to a cheaper model." Most demos can't even ask the question.

Concrete number from my server: 80% of the cost was in one tool — a document-search tool that called Sonnet 4.6 to rerank results. Knowing that was the difference between a £40/month server and a £400/month server. Caching reranks for repeat queries got it down to £8/month at the same usage. None of that optimization is possible without per-tool cost.

8. Schema validation that survives SDK updates

The Anthropic SDK is on a steady release cadence, and a few of those releases have, historically, changed tool-call semantics in subtle ways — argument shapes, system-prompt budget accounting, tool-call ID format. None have been hard breakages. All have surfaced for someone as "my server stopped working this morning."

Two patterns that keep this from being a 5am pager:

Pin the SDK in production. Renovate or Dependabot opens a PR, your eval suite runs against the new version, you merge after green. Never auto-upgrade.
Schema-validate every tool call at the server boundary. If the SDK ships a new field, log it. If a field disappears, log that too. Both events alert.

const parsed = ToolCallSchema.safeParse(rawCall);
if (!parsed.success) {
  logger.error({ event: "schema_mismatch", issues: parsed.error.issues });
  return errorResponse("schema_mismatch");
}

This is the cheapest of the seven to implement and the one I've gotten the most value from. Twice now an SDK release has shifted something subtle and the schema check caught it before any user-facing call was affected.

9. The reference repo

I'm releasing the server as MIT-licensed boilerplate. All seven patterns above are implemented and labelled by section number — you can read the post and the code side by side.

Repo: github.com/webgururobin/mcp-server-production-starter
TypeScript on the Anthropic MCP SDK (@modelcontextprotocol/sdk), 40 vitest cases
stdio entry point for Claude Desktop and Cursor (config snippets in examples/)
HTTP entry point with bearer auth (pnpm start:http)
Eval harness (pnpm evals) — invokes each tool through the seven patterns and prints the cost roll-up

Fork it, copy the patterns you need, or use it as the starting point for your own.

What "production-grade" lets you do

The shorthand for this post: production-grade MCP is the surface area between your tools and a customer who pays for them. Auth answers who is allowed. Versioning answers what version they're calling. Idempotency answers what happens when something retries. Observability answers what just happened, and what did it cost. Deprecation answers how we change things without breaking them. Schema validation answers how this survives SDK churn.

Skip any one of these in front of users and the failure mode is specific — and the kind of thing nobody asks about in design review, because they're all variants of "what happens on the bad path."

If you're building an MCP server right now, work through the seven. Most are an afternoon of code; collectively they're a weekend — which is exactly the budget I used.

If you want the long-form dispatch — one engineering write-up per week plus the build I shipped that week, no roundups, no filler — subscribe to The AI Engineer Dispatch.

If you've got an MCP server (or any AI feature) stuck somewhere between a demo and a real deployment and you want a second pair of eyes, book a 20-minute intro call.