Nov 01, 2024

The True Total Cost of a GenAI Deployment

API costs are visible. The hidden costs of GenAI — prompt engineering labour, vector storage, evaluation overhead, model churn — are where budgets break.

Every GenAI business case we have reviewed in the past two years has the same structure: a line item for API costs, a headline ROI figure, and a business case that looks compelling until you run it against what the system actually costs to build and operate. The API costs are real and easy to calculate. The other costs are real and rarely appear in the initial business case, because nobody asks about them until they show up on an invoice or in a headcount request.

This is the “it’s just API calls” trap. It produces two predictable outcomes: projects that are approved on a business case that does not survive contact with reality, and projects that are cancelled when the real costs emerge mid-implementation. Neither outcome is useful. Here is what the full cost picture actually looks like.

Token Costs: The Visible Tip

Token costs are the most visible component and the one most business cases get right. Every major LLM provider publishes pricing per million tokens, and the arithmetic is straightforward once you know your expected usage volume.

A concrete calculation: a document processing system handling 1,000 daily requests, each with a 4,000-token document plus a 500-token system prompt as input, and a 500-token summary as output, uses 4.5 million input tokens and 500,000 output tokens per day. At GPT-4o pricing as of late 2024 ($2.50 per million input tokens, $10.00 per million output tokens), that is $11.25 in input costs and $5.00 in output costs per day — approximately $492 per month. Scale to 10,000 daily requests and the figure reaches $4,920 per month, for token costs alone.

The calculation is not complex. What is frequently missed: input tokens include everything in the prompt, not just the user’s query. The system prompt, the retrieved document chunks in a RAG pipeline, the conversation history in a chat application — all of these count. A well-engineered system minimises unnecessary context. A naively designed system routinely sends three to five times more tokens than the task requires, and the cost compounds with volume.

Embedding costs add a separate line item. Generating vector embeddings for a corpus of 100,000 documents at OpenAI’s text-embedding-3-small pricing ($0.02 per million tokens) is inexpensive for the initial embedding run, but re-embedding when the corpus updates — or when the embedding model changes — is a recurring cost that should appear in the operating model.

Prompt Engineering Labour: The Hidden Maintenance Cost

Prompt engineering is frequently treated as a one-time project activity: define the task, write a prompt, test it, ship it. This is accurate for the first deployment. It is completely wrong as a description of what prompt engineering costs in a production system.

Production prompts require ongoing maintenance. When the foundation model provider updates the underlying model — which happens without advance notice, multiple times per year — prompt behaviour changes. Sometimes subtly. Sometimes enough to break outputs that were previously reliable. Detecting the change, diagnosing which prompts are affected, rewriting and retesting the affected prompts, and validating the fix is a significant engineering effort, and it happens repeatedly.

Prompt maintenance also responds to business requirements: the taxonomy of document types expands, the output format is standardised differently by a downstream consumer, a new edge case is discovered in production and must be handled. Each of these requires a prompt engineer or a senior engineer to diagnose, implement, and test.

A mid-complexity GenAI system with ten to fifteen distinct prompts will realistically require one senior engineer spending 20–40% of their time on prompt maintenance and evaluation in steady state. At market rates for a senior engineer in Malaysia of approximately MYR 8,000–12,000 per month, the loaded cost of that 30% allocation is MYR 2,400–3,600 per month — not a rounding error, and not a cost that appeared in the original API-cost business case.

Vector Infrastructure: Embedding Storage and Retrieval

RAG-based systems require a vector database to store and query embeddings. The infrastructure cost is easy to overlook because it appears at a different vendor from the LLM API and is typically smaller in absolute terms — until the corpus grows.

Pinecone’s managed hosting charges by the number of vectors stored (approximately $0.096 per million vectors per month for their standard tier as of 2024, with query costs on top). A corpus of 10 million document chunks — a large but not unusual enterprise knowledge base — costs roughly $960 per month in storage alone, before query volume costs.

Re-embedding is the cost that most business cases miss entirely. Embedding models improve. When OpenAI released text-embedding-3-small in early 2024, it outperformed text-embedding-ada-002 on most retrieval benchmarks at lower cost. Migrating to the new model required re-embedding the entire corpus — a significant one-time compute cost, plus the storage and validation work to confirm that retrieval quality had improved rather than degraded. This re-embedding cycle is not a one-time event; it is a recurring cost that should be planned for as embedding technology continues to evolve.

Self-hosted alternatives like pgvector eliminate the per-vector storage cost but require infrastructure engineering and operational capacity. The choice between managed and self-hosted vector infrastructure is partly a cost question and partly a capability question: managed services are cheaper at low scale, self-hosted wins at high scale if you have the operational capacity to run it.

Evaluation and Quality Assurance: The Cost of Getting It Right

Evaluation is where the most consistent underinvestment occurs, and where the cost of skipping it is paid most expensively later.

Building a golden dataset — a curated set of questions with known correct answers, covering the full range of the production use case — requires domain expert time. For a regulatory document Q&A system, building a 500-question golden dataset with verified correct answers might require two to three weeks of a compliance expert’s time. That expert is not cheap, and the dataset requires maintenance as the underlying content changes.

Running RAGAS evaluations on a production-scale RAG pipeline produces metrics on faithfulness, answer relevancy, and context precision that surface quality issues before they reach users. Setting up the evaluation pipeline is an engineering effort — typically one to two weeks for a senior engineer — and running it continuously as prompts and retrieval parameters change is an ongoing cost.

Human spot-check programmes — where domain experts review a random sample of production outputs weekly — catch degradation that automated metrics miss. A programme that involves five hours per week of expert reviewer time, at market rates, is a modest but real ongoing cost that must appear in the operating model.

The alternative to this investment is not avoiding the cost — it is paying it as incident response after a consequential error surfaces in production.

Human Review and Escalation: Mandatory in Regulated Contexts

Any production GenAI system deployed in a regulated context — financial services, healthcare, legal, insurance — requires a human review layer for a defined category of outputs. This is not optional. Regulators do not currently accept “the model said so” as a sufficient explanation for consequential decisions.

The cost of the human review layer includes: the staff time for reviewers (which scales with volume), the tooling to present outputs for review efficiently (typically a custom interface that is not free to build or maintain), and the workflow design to integrate review into the user journey without creating a bottleneck that destroys the efficiency benefit of automation.

A conservative estimate for a document processing system in financial services: if 15% of outputs require human review, and each review takes three minutes of a qualified reviewer’s time, then 1,000 daily requests generates 150 daily reviews consuming 7.5 hours of reviewer time. At a loaded cost of MYR 40 per hour for a qualified reviewer, that is MYR 300 per day, approximately MYR 9,000 per month — substantially more than the token costs for the same volume.

Model Update Churn: The Invisible Recurring Expense

Foundation model providers update their models. They do not always announce it in advance. They do not guarantee that prompt behaviour is preserved across updates. And when the behaviour changes, the cost falls entirely on the customer.

A model update incident follows a predictable pattern: outputs start degrading in a way that is subtle enough to miss for one to three days, someone notices, diagnosis begins (is this a data issue? a prompt issue? a model issue?), the root cause is identified as a model update, every prompt in the system is retested against the golden dataset, the broken prompts are rewritten and retested, the fix is deployed. The end-to-end process for a mid-complexity system typically takes three to five engineering days.

OpenAI updated the underlying model behind the gpt-4-turbo endpoint multiple times in 2024 without major announcements. Teams that were relying on specific prompt behaviours discovered the updates through output degradation rather than release notes. Teams with comprehensive evaluation pipelines detected the changes quickly. Teams without them discovered them through user complaints.

The cost of a model update incident should be estimated and included in the operating model. For a team of three engineers, five days of incident response is fifteen engineering days per incident, plus the cost of any user-facing quality degradation during the detection window.

A Realistic Total Cost Framework

Building a credible business case for a production GenAI system requires modelling six cost categories: API token costs at projected volume, prompt engineering labour in steady state (not just at build time), vector infrastructure (storage, query, and re-embedding cycles), evaluation and quality assurance (tooling build, golden dataset creation, ongoing evaluation runs), human review and escalation (reviewer time, tooling, workflow), and model update response budget (estimated engineering days per incident, at estimated incident frequency).

Token costs are typically 15–30% of the total cost of operating a production GenAI system at enterprise scale. The remaining 70–85% is labour — engineering and domain expert time — and infrastructure. Business cases that model only token costs are underestimating total cost by a factor of three to seven.

GenAI can still be worth it. We have seen it transform document processing workflows that previously required large manual teams, improve decision support quality in ways that have measurable business impact, and enable products that could not have been built without it. But the ROI calculation that justifies the investment must be based on the full cost. The organisations that have built durable GenAI capability know what they are actually spending, because they modelled it before they started.

ROI of GenAI: Measuring What a Deployment Actually Returns — Once you have modelled the full cost, the next step is measuring whether the return justifies it.
Why Your GenAI Pilot Didn’t Make It to Production — Unmodelled costs are one of the most consistent reasons pilots fail to reach production.
Nematix Generative AI Services — How Nematix helps organisations build accurate business cases and right-size their GenAI investments.

Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.