Why Your GenAI Pilot Didn't Make It to Production
Most GenAI pilots succeed. Most productions fail. The gap is not the model — it is four structural problems that only surface at production scale.
There is a specific kind of GenAI project post-mortem we have sat through more times than we would like. The pilot produced impressive results. The demo went well. The steering committee approved the business case. Then — nothing. Months later, the project is quietly filed under “learnings” and the team moves on to the next pilot. The model was fine. The use case was real. Something else broke it.
By 2024, Gartner estimated that fewer than 30% of GenAI pilots had reached production. In our experience working with clients across financial services, enterprise software, and operations-heavy industries in Malaysia and Southeast Asia, the figure tracks: we have seen organisations run four or five well-executed pilots, and ship none of them. The reasons are consistent, predictable, and fixable — but only if you know what you are looking for before you start.
Failure Mode 1: Hallucination at Scale
In a pilot, humans review every output. A team of five people evaluating 200 responses over three weeks catches the model’s errors as part of the evaluation process. The errors are noted, the prompt is adjusted, and the evaluation continues. By the time the pilot concludes, the team has seen the model’s worst behaviour and has manually filtered it out of the results they present upward.
In production, nobody reviews every output. A system processing 10,000 requests per day with a 2% hallucination rate produces 200 wrong answers every day — some of them consequential. A customer service bot that confidently cites a return policy that does not exist. A contract analysis tool that misses a clause because the model generated a plausible summary rather than a faithful one. A knowledge base Q&A system that produces an internally consistent but factually incorrect answer to a compliance question.
The 2% figure is not hypothetical. GPT-4, evaluated on the TruthfulQA benchmark, achieved approximately 59% truthfulness at launch — meaning roughly 40% of its responses to adversarial questions were incorrect or misleading. On domain-specific enterprise content that differs from the training distribution, error rates for factual claims are typically higher than on general benchmarks.
The production answer to hallucination is not a better model — it is system design. RAG grounds outputs in retrieved documents, reducing but not eliminating hallucination. Output validation checks flag responses that contain no grounding citations or that contradict retrieved content. Confidence thresholds route low-confidence outputs to human review rather than to the end user. Human review triggers for high-stakes output categories (regulatory content, financial figures, medical information) catch errors before they cause harm. None of these are heroic engineering efforts. All of them must be designed before go-live, not added as patches after the first incident.
Failure Mode 2: Prompt Brittleness
A prompt that works on 50 test cases is not a prompt that works. It is a prompt that works on 50 test cases.
Prompts do not generalise the way code does. A function that handles a sorted integer list will handle any sorted integer list. A prompt that correctly extracts invoice data from fifty sample invoices will fail on the fifty-first invoice that has a slightly different format, a non-standard field name, a table spanning two pages, or a scanned image with variable quality. The variance in real-world unstructured data is larger than any pilot dataset captures.
We have seen this failure mode appear in three ways. The first is silent degradation: the prompt works on 95% of production inputs and fails quietly on 5%, producing wrong outputs that are not obviously wrong and so are not caught until a downstream consequence makes them visible. The second is edge-case brittleness: the prompt fails on a specific class of inputs that did not appear in the pilot dataset — a particular document format, a language mix, an unusually long or short input — and the failure is acute rather than gradual. The third is adversarial fragility: in customer-facing applications, some users will discover inputs that break the prompt, whether deliberately or accidentally, and the results are typically embarrassing.
The fix is to treat prompts as engineering artefacts, not as configuration. Prompts should be version-controlled. Every prompt should have an associated test suite: a set of inputs with expected outputs that runs on every change to the prompt. Regression testing should run before any prompt change reaches production. This is not optional overhead — it is the minimum discipline needed to maintain a production LLM system. Teams that skip it spend their maintenance budget on incident response.
Failure Mode 3: Context Window Economics
In a pilot, demonstrating that an LLM can process a full contract document by stuffing it into the context window is fine. The demo is compelling. The cost is invisible because the usage volume is low.
At 10,000 requests per day, with each request containing 10,000 tokens of document content plus a 1,000-token system prompt plus a 500-token user query, the arithmetic changes. At GPT-4o’s pricing of $2.50 per million input tokens (as of mid-2024), that is approximately 115 million input tokens per day — roughly $288 per day, or approximately $8,640 per month, just for input tokens, before output tokens, infrastructure, and operational costs. For a use case that was evaluated on a 20-request-per-day pilot, this number was never modelled.
Context window economics drive architectural decisions that cannot be retrofitted easily. If you have designed a system that loads full documents into context for every request, moving to a RAG architecture that retrieves only relevant passages requires re-engineering the core retrieval and prompt assembly logic. The time to make that decision is before you build, when you model the production cost at realistic volume.
Chunking strategy — how you split documents for retrieval — becomes one of the most important engineering decisions in a RAG system once context window cost is factored in. Chunks that are too small lose necessary context and produce poor retrieval quality. Chunks that are too large waste tokens and money on content that is not relevant to the query. The optimal strategy is document-type-specific and requires empirical testing on real data, not theoretical reasoning.
Failure Mode 4: No Production Owner
Pilots are owned by a project team. The project team is assembled for the pilot, produces results, and disbands when the pilot concludes. If the pilot is approved for production, the question of who owns it is typically deferred — someone will figure that out during implementation.
Nobody figures it out. Or rather, it gets figured out by default: the last team that touched it becomes responsible for it, which is usually the team least equipped to own it long-term.
Production GenAI systems require an operational owner with four specific responsibilities. Someone who monitors output quality continuously — not as a project activity but as an ongoing operational function, with dashboards, alerting, and a response process. Someone who manages prompt updates when the model’s outputs degrade or the use case requirements evolve, which happens every few weeks in an active production system. Someone who handles model update incidents when the foundation model provider updates the underlying model and the prompts break in ways that are subtle enough to miss for days. And someone who owns the escalation path when the system produces an output that creates a customer or compliance problem — who gets called, what the response protocol is, and what the rollback procedure looks like.
Asking those four questions before go-live and getting a named person’s answer to each of them is a better predictor of production success than anything in the model evaluation results.
What Production-Ready GenAI Actually Requires
Production-ready GenAI is not a better version of a pilot. It is a different engineering problem. The components that make a pilot succeed — a capable model, a working prompt, a convincing demo — are necessary but nowhere near sufficient.
The additional requirements are: an evaluation framework that runs continuously in production against a maintained golden dataset, not just at pilot time. Cost monitoring with alerting on anomalies, so that a prompt change that triples token consumption is caught before it reaches the monthly invoice. Graceful degradation when the model API is slow, expensive, or returning errors — what does the system do, and is that behaviour acceptable to users? Human review triggers for the output categories where errors are consequential. An operational runbook that documents what to do when things go wrong, written before go-live and tested before it is needed.
The discipline gap between pilot and production is not insurmountable. But it requires acknowledging it exists before the pilot concludes, and budgeting for it before the engineering work begins. Teams that discover these requirements after go-live spend the first three months of production in reactive mode, patching a system that was never designed to handle what production actually demands.
The organisations that are running stable, valuable GenAI systems in production today are not the ones that ran the best pilots. They are the ones that asked the hard questions — about hallucination rates, prompt maintenance, context economics, and operational ownership — before the pilot report was presented upward.
Related Reading
- The True Total Cost of a GenAI Deployment — Cost surprises are one of the most common reasons pilots stall before reaching production.
- Building a GenAI Centre of Excellence — The organisational structure that prevents teams from repeating the same pilot failures indefinitely.
- Nematix Generative AI Services — How Nematix helps engineering teams move GenAI from a successful pilot to a stable production system.
Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.