Agentic AI7 min readMay 8, 2026

Agentic Pilot-to-Production: A 90-Day Runbook

Most internal AI agents die between the demo and production. A 90-day runbook that takes a pilot from notebook to deployed-behind-IAM with eval harness, observability, and governance — built from real engagement data.

Technova Team

Expert Insights

Share:
Agentic Pilot-to-Production: A 90-Day Runbook

90 days from whiteboard idea to a real agent in production behind your IAM, with eval coverage and ISO-42001-aligned governance docs. That's the spec on our Agentic Pilot-to-Production offer and it's the constraint we've optimised for across multiple production deployments.

This post is the runbook — the sequence of moves, decisions, and deliverables that separate the pilots that ship from the pilots that die. It's written for the engineering leader, head of AI, or product owner running a real attempt, not for the executive doing exploratory reading.

The 90-day spine

Three phases, each with one critical exit criterion.

PhaseWeeksExit Criterion
Discovery1–2Use case locked, data accessible, eval design agreed
Build3–8Agent passing golden set, observable, behind IAM
Hardening + Production9–12Eval-gated CI/CD live, governance package signed off

Most failed pilots fail because they don't actually exit a phase before starting the next. The discovery exit criterion in particular is non-negotiable.

Weeks 1–2: Discovery

The most common failure mode at week 1 is starting to code. Resist.

What discovery has to produce:

Use case lock. One sentence describing the user, the action, and the success criterion. "Customer support agent that responds to Tier-1 inquiries with 90% accuracy on golden set, escalating to human at 80% confidence floor." Not "an AI assistant for customer support." Specificity is the deliverable.

Data inventory. What data the agent needs, where it lives, who owns it, what classification it carries. This is also where you discover the data doesn't exist in queryable form (the FAQ is in a Confluence space nobody's updated since 2023, the product catalog has six versions, the customer history lives in a CRM the IT team can't grant access to in the timeline). Better to discover at week 1 than week 6.

Architecture decision. Model choice (frontier vs sovereign vs hybrid), retrieval pattern (RAG vs tool-use vs both), deployment surface (web vs API vs MCP-bound), tenancy (multi-tenant vs single-tenant). These decisions cascade — making them at week 2 versus week 5 saves 3 weeks of rework.

Eval design. Not the eval itself yet — the design. What does success look like? What inputs go in the golden set? What edge cases go in the adversarial set? Who curates them?

Cost model. Estimated cost per request, expected request volume, monthly cost ceiling. Without this, the pilot becomes a finance issue at week 8.

The output of weeks 1–2 is a 4–8 page scoping document. If you can't produce one, you don't have a use case yet.

Weeks 3–8: Build

Six weeks of engineering. The shape of the work:

Week 3: Skeleton. End-to-end skeleton in production-shaped infrastructure. Hello-world quality response, but routed through your real infrastructure (auth, deployment, telemetry stub). Skeleton-first prevents the late-stage discovery that something foundational doesn't work.

Weeks 4–5: Capability. The actual agent capability. Prompt engineering, retrieval, tool use. By end of week 5, the agent should be doing the right thing on a happy-path query.

Weeks 6–7: Robustness. Edge cases, error handling, escalation paths, fallbacks. The 80/20 inversion — the first 80% of capability happens in 20% of the time; the last 20% takes the rest.

Week 8: Eval pass. Run against the golden set and adversarial set. If pass rates don't hit the agreed thresholds, this is where the timeline slips. Build a buffer; assume it slips.

Production-shaped infrastructure from day one. Three concrete things this means:

  1. Behind your IAM. OIDC or SAML integration to your identity provider, day-one. Not a "we'll add it later" — later is week 11 and SSO integration takes a week, which you don't have.

  2. Through your AI Gateway. Vercel AI Gateway or equivalent. Cost telemetry and provider failover from day one. You'll need both.

  3. MCP-bound. Tool access through a Model Context Protocol layer with auditable permissions. Not direct API calls in the agent code. The audit trail you'll need at week 12 has to be built at week 3.

Weeks 9–10: Hardening

The phase that separates pilots that ship from pilots that pretend to ship.

Eval gates in CI/CD. Every PR runs the eval suite. Drops below the threshold block deploy. This is a culture change as much as a tooling change — engineers will resist eval-gating until they've seen it catch a regression once.

Red-team pass. Adversarial inputs designed to break the agent. Prompt injection, jailbreaks, ambiguous inputs that should escalate, edge cases that should fail gracefully. We use a separate engineer who didn't build the agent to run the red-team — fresh eyes catch what the builder rationalises away.

Cost ceiling enforcement. Daily and monthly limits at the gateway. Anomaly alerts on cost-per-request. Test the ceiling — synthetically push traffic to validate the ceiling actually throttles before production goes live.

Security review. Penetration test on the agent endpoints. Authentication boundaries. Data-leakage testing — does the agent ever return retrieved context that the user shouldn't see based on their permissions?

Weeks 11–12: Production

Production cutover and handover.

Phased rollout. 10% of traffic for 24 hours. 50% for 48 hours. 100%. Each phase gated on metrics holding within tolerance.

Governance package. AI inventory entry, risk classification, monitoring plan, incident response runbook, retirement criteria. Auditor-recognisable format. This is what makes the agent legitimate inside the organisation, not just operational.

Knowledge transfer. Pair-programming sessions with the client team. Documentation written for in-house ownership. The team that built it isn't the team that runs it; the handover has to be deliberate.

Operations runbook. What metrics to watch daily, what to do when each one trips, how to roll back, who to call.

By end of week 12: the agent is in production, instrumented, governed, and the client team can operate it. Codenovai-side, the engagement has often transitioned to a Fractional AI Team retainer for the second and third agents — the playbook compounds, the per-agent cost drops.

What goes wrong (and how to avoid it)

Three failure patterns that account for most pilot failures we've seen:

Failure 1: Scope creep at week 6

The pilot was scoped for one use case. By week 6, stakeholders have asked "can it also do X?" three times. Saying yes once costs two weeks. Saying yes three times costs the timeline.

Defence: lock scope at end of week 2. Track scope-change requests in a backlog for "v2" without expanding v1.

Failure 2: Data access blocked by IT

The agent needs data from a system whose owner won't grant access in the timeline. By the time this surfaces, you're at week 5 and have built around an assumption that's now invalid.

Defence: data access is a discovery exit criterion, not an assumption. If access isn't granted by end of week 2, the use case is wrong or the timeline is wrong.

Failure 3: Eval set built too late

The team starts with "we'll evaluate by hand at the end." By week 10, the eval is informal, no thresholds, no quantitative answer to "is it good enough."

Defence: eval set design at week 1, eval set populated to 50% by end of week 4, full population by end of week 6, eval-gated CI/CD by week 9.

Where Codenovai fits

We run this 90-day program as our Agentic Pilot-to-Production offer at a starting price of USD 165,000 (single-agent) / USD 320,000 (multi-agent). The runbook above is the one we use. The eval harness and observability stack we ship is the same one we run on OpenClaw and Hisabi.ai.

Book a scoping call — we typically write back with a yes/no on fit within 48 hours, and the fixed Statement of Work follows within 5 business days.

The same five reasons in roughly the same order: (1) eval coverage doesn't exist, so behaviour drift is invisible until a customer complains; (2) the agent never gets behind SSO and IAM, so security blocks it; (3) cost runaway risk has no ceiling, so finance blocks it; (4) governance documentation isn't audit-ready, so compliance blocks it; (5) the team that built the pilot can't operate it day-to-day, so it dies of neglect. The 'AI capability' was never the problem.

It's the empirically tight floor for a single-agent production deploy with enterprise governance. We've done it in 60 days when the use case was tightly scoped and the data was clean; we've seen 180-day attempts fail because scope expanded mid-flight. 90 days assumes the use case is locked at scoping, the data is mostly accessible, and the IAM integration follows a known pattern. Multi-agent or cross-system orchestration adds 30–60 days.

A production eval harness is three things: a curated golden set (50–500 representative cases with known-good outputs), an adversarial set (edge cases, attempted prompt injections, ambiguous inputs), and a measurement pipeline that runs the harness pre-deploy and continuously post-deploy with drift alerts. Tools like Braintrust, Langfuse, or Arize provide the pipeline. The golden set is custom and is the deliverable that takes longest.

Three layers. Layer 1: per-tenant rate limits at the application gateway. Layer 2: daily and monthly cost ceilings at the AI Gateway level (Vercel AI Gateway has these built in). Layer 3: anomaly detection on cost telemetry — if average cost-per-request doubles unexpectedly, alert before the bill arrives. We've shipped agents that absorbed 10× a normal traffic day without exceeding their cost ceiling because layer 2 throttled gracefully.

From the agency side: 1 senior AI engineer (full-time), 1 LLMOps/evals engineer (half-time), 1 strategist (quarter-time, mostly weeks 1–3 and 11–12). From the client side: 1 product owner (point of contact, week-by-week decisions), 1 engineering liaison (IAM, infra, deployment), and access to the data owner. Engagements that don't have a named client product owner go sideways by week 4.

Enjoyed this article?

Subscribe to our newsletter for more expert insights on AI, web development, and business growth in Dubai.