A 3-axis scorecard for picking the first AI feature to ship in your SaaS — user value × engineering cost × LLM cost-per-request. Walked through with four real candidate features.
You've decided to add AI to your SaaS. Good. Now comes the harder question: which AI feature should you ship first?
Every founder I talk to has a Notion doc with five candidate features. By the end of the call, three of them are wrong, one is right, and one is a maybe. The pattern is consistent enough that I've stopped being surprised by it.
This post is the framework I use in the £900 AI Integration Audit to answer that question. It's a 3-axis scorecard, weighted, and it works because most "what to build" guides ignore axis 3 entirely — and axis 3 is where 80% of AI projects die in production.
Key Takeaways
- The first AI feature is a strategic decision, not a popularity contest. Picking wrong burns 6–12 weeks before you find out.
- Score each candidate on three axes: user value × engineering cost × LLM cost-per-request. Most teams skip the third one.
- The right first feature has high user value, modest engineering cost, and a tolerable per-request LLM cost at your projected scale.
- "Cool demo" features almost never make good first features. The ones that do are unglamorous and measurable.
- Disqualifiers matter as much as the score: features without clear evals, fallbacks, or a cost ceiling should not be v1.
Table of Contents
- The wrong way to pick a first AI feature
- The 3-axis scorecard
- Walking through 4 real candidate features
- The disqualifier checklist
- What "ship first" actually means
- Get the scorecard template
1. The wrong way to pick a first AI feature
Three patterns I see repeatedly:
Vibes. Someone on the team saw a demo, thought it was cool, and now it's on the roadmap. No connection to the customer's job-to-be-done.
Exec demos. "We need something to show the board." This optimizes for the screenshot, not the user. The feature ships, gets used twice, and dies.
Competitor-driven. "Notion has AI. Linear has AI. We need AI." This is the worst one because it's seductive — it feels like strategy. It isn't. You're picking a feature based on someone else's product surface, not your customer's problem.
All three skip the question that actually matters: does this feature, at our scale, with our cost structure, make economic sense to run for the next 24 months?
2. The 3-axis scorecard
Score each candidate feature 1–10 on each of three axes. Weight them. Pick the highest-scoring feature. That's your v1.
Axis 1: User Value (weight: 40%)
How much does this feature improve the user's job-to-be-done? Specifically:
- How frequently will users hit this feature in a typical workflow?
- How much time / friction does it remove?
- Would a user notice (and complain) if it disappeared after 30 days?
A score of 10 means: every active user encounters this feature weekly, it removes a clearly-painful step, and removing it would generate support tickets. A score of 3 means: nice-to-have, low frequency, "would be cool."
Axis 2: Engineering Cost (weight: 30%)
How much engineering work is the v1 — including the unsexy parts? Score inversely (10 = cheap, 1 = expensive):
- How many distinct components? (UI, retrieval, prompt, eval, monitoring)
- Does it need new data infrastructure? (vector DB, ETL, embeddings)
- How tight is the latency budget?
- What happens on the bad path? (LLM down, output malformed)
A 10 here is something you can build in 2 weeks with no new infra. A 3 is "we need a vector store, a re-ranker, evals, and a UX redesign."
Axis 3: LLM Cost-per-Request (weight: 30%)
This is the one most teams skip and the one that kills the most projects. Score inversely — cheap = high score.
The math:
cost_per_request = (input_tokens × input_price) + (output_tokens × output_price)
monthly_cost = cost_per_request × requests_per_user × MAU
A score of 10 is a feature that runs for fractions of a cent on Claude Haiku 4.5 or GPT-4o-mini. A score of 3 is a feature that needs Claude Opus 4.7 with 50k context windows on every request.
I've watched founders ship a feature that costs £8 per active user per month. When the bill arrives, it gets killed. They burned 8 weeks of engineering time.
Putting it together:
final_score = (user_value × 0.4) + (engineering_cost_inv × 0.3) + (llm_cost_inv × 0.3)
Highest score wins. Ties broken by user value.
3. Walking through 4 real candidate features
Let's score four features I see proposed often. Imagine a UK B2B SaaS with 5,000 MAU.
Candidate A: AI document Q&A — users upload PDFs, ask questions, get answers with citations.
| Axis | Score | Reasoning |
|---|---|---|
| User value | 8 | High — used weekly, removes real friction (Ctrl+F across 50 PDFs) |
| Engineering cost | 5 | Moderate — needs chunking, embeddings, vector DB, re-ranking, citation UI |
| LLM cost | 6 | Moderate — Claude Sonnet 4.6 at 8k context, ~£0.03/request, ~£3/MAU/month |
Weighted score: (8 × 0.4) + (5 × 0.3) + (6 × 0.3) = 3.2 + 1.5 + 1.8 = 6.5
Candidate B: AI-generated email replies — when a user gets an email in the app, an AI draft is pre-filled.
| Axis | Score | Reasoning |
|---|---|---|
| User value | 6 | Useful but most users will edit heavily; novelty fades |
| Engineering cost | 8 | Cheap — single prompt, no retrieval needed |
| LLM cost | 7 | Cheap — Haiku 4.5 handles this fine, ~£0.001/request |
Weighted score: (6 × 0.4) + (8 × 0.3) + (7 × 0.3) = 2.4 + 2.4 + 2.1 = 6.9
Candidate C: AI agent that handles support tickets autonomously — multi-step, takes actions, replies to customers.
| Axis | Score | Reasoning |
|---|---|---|
| User value | 9 | Massive — saves human support hours |
| Engineering cost | 2 | Brutal — multi-step tool use, evals across many paths, fallback logic, ops support |
| LLM cost | 3 | Expensive — multi-call agent loops, Opus-tier needed for reliability, ~£0.15+/conversation |
Weighted score: (9 × 0.4) + (2 × 0.3) + (3 × 0.3) = 3.6 + 0.6 + 0.9 = 5.1
Candidate D: AI semantic search across the user's data — natural-language search instead of keyword filters.
| Axis | Score | Reasoning |
|---|---|---|
| User value | 7 | Strong — search is high-frequency in most SaaS |
| Engineering cost | 4 | Moderate-hard — needs embeddings + vector DB + relevance tuning |
| LLM cost | 7 | Cheap — embedding generation is cheap, query-time uses small model |
Weighted score: (7 × 0.4) + (4 × 0.3) + (7 × 0.3) = 2.8 + 1.2 + 2.1 = 6.1
Ranking:
- AI-generated email replies — 6.9
- AI document Q&A — 6.5
- AI semantic search — 6.1
- AI support agent — 5.1
Notice what happened: the most exciting feature on the list (the autonomous agent) ranked last, because it scored badly on engineering cost and LLM cost. The most boring feature (pre-filled email replies) ranked first.
This is the pattern. Boring features ship. Exciting features die in beta.
4. The disqualifier checklist
Score is necessary but not sufficient. A feature can rank #1 and still be wrong as v1 if it fails any of these:
No way to evaluate output quality? Disqualified. If you can't tell whether v1.1 is better than v1.0, you can't iterate. Every production AI feature needs an eval harness — and if the feature has no measurable correctness criterion, you're shipping vibes.
No clear cost ceiling? Disqualified. Before kickoff, write down the maximum monthly LLM bill you'll tolerate at projected scale. If the answer is "we'll see," the feature isn't ready.
No fallback for the bad path? Disqualified. What happens when the LLM API is down for 4 hours? When the response is malformed? When a user asks something the model refuses? If your answer is "the feature breaks," you're not ready for production.
No clear owner after launch? Disqualified. AI features are not fire-and-forget. Prompts drift, models get deprecated, costs creep up. Someone owns this — name them before kickoff.
If your top-scored feature fails any of these, drop to the next one and re-check. The right v1 is the highest-scoring feature that also clears the disqualifier list.
5. What "ship first" actually means
Once you've picked the feature, "ship" has a specific meaning:
- Scope-cut to the user-facing core. The first version does one thing well. Cited Q&A in v1, conversational follow-ups in v2.
- Evals from day one. Even five hand-curated test cases is enough to start. Add as you learn.
- Cost monitoring on every request. Tag every LLM call with feature name and user; export to your dashboard.
- Fallback strategy specified. Stale-cache, downgraded model, or graceful error UI — pick one before kickoff.
- Eval-driven iteration. Changes ship behind a flag, eval suite runs, you compare scores. No vibes.
This is what "production-grade" actually means. None of it is exotic. All of it gets skipped under deadline pressure, which is why most AI features never make it past the demo stage.
6. Get the scorecard template
I'm releasing the scorecard as a markdown template you can drop into Notion, Linear, or a Google Doc. It includes the weighted formula, the four worked examples above, and the disqualifier checklist as a runnable pre-flight.
Download the scorecard template →
Use it on your top three candidate features. Score honestly — there's no point inflating axis 3 to get the answer you wanted.
The shortcut
If your team is asking "which AI feature should we ship first?" — that's the question the £900 AI Integration Audit answers in 5 days, async, no meetings.
I run this scorecard against your codebase, score up to three candidate features, deliver a written architecture and rollout plan for the top recommendation, and project monthly cost at your current MAU and at 10× scale. ~60% of audit clients hire me for the build afterwards. The other 40% take the report and execute internally. Both are fine.
Or if you already know which feature, and you want it live in two weeks, that's the £3,500 AI Feature Sprint. Fixed scope, fixed price, production by week 2.
