May 01, 2025

LLM Gateway Architecture for Production

When multiple teams use multiple LLM providers across many applications, you need an LLM gateway. Here is what a production-grade gateway does.

A single team with a single LLM integration can manage API keys and costs directly. An organisation with five teams, three LLM providers, and twenty applications cannot — not without infrastructure that sits between the applications and the providers. The LLM gateway is that infrastructure layer, and the teams that invest in it early are the ones that can scale GenAI across the organisation without losing control of costs, security, or reliability.

The pattern is not novel. API gateways for HTTP services have existed for decades. What is specific to LLMs is the shape of the problems: token-based cost attribution, model-specific routing logic, semantic caching, and the particular resilience requirements that come from provider outages and rate limit behaviour that differs significantly between OpenAI, Anthropic, and Google.

What a Gateway Does

At its core, an LLM gateway is a reverse proxy. Every application in the organisation makes LLM requests to a single internal endpoint — the gateway — rather than directly to the provider APIs. The gateway handles authentication, routing, policy enforcement, logging, and resilience. Applications do not need to know which provider they are talking to, which model is handling a given request, or what the current rate limit situation looks like.

Application A ─┐
Application B ─┤──▶  LLM Gateway  ──▶  OpenAI
Application C ─┤                  ──▶  Anthropic
Application D ─┘                  ──▶  Google Vertex AI

The gateway owns the provider credentials. No application team holds API keys directly. This eliminates the credential sprawl that occurs when each team manages its own provider relationships — a common situation that makes it impossible to rotate credentials across the organisation when a key is compromised, or to enforce consistent usage policies.

Beyond the security benefit, the gateway is the single point at which you can observe, control, and optimise everything that touches the LLM layer.

Model Routing: The Core Value Proposition

The routing layer is where most of the economic value of an LLM gateway is realised. Not every request needs the most capable model. Routing intelligently to the cheapest model that can handle a given request meaningfully reduces cost at scale.

A simple routing policy might look like this:

Task Type	Routed To	Rationale
Intent classification	GPT-4o mini / Claude Haiku	Low complexity, latency-sensitive
FAQ response generation	GPT-4o mini / Claude Haiku	Structured output, constrained domain
Complex document analysis	GPT-4o / Claude Sonnet	Multi-step reasoning required
Long-context document processing	Gemini 1.5 Pro	1M token context window
Code generation	GPT-4o / Claude Sonnet	Code-specific capability

Routing decisions can be driven by several signals: explicit request parameters from the calling application (the application knows the task type and sets a task_class header), automated classification of the incoming request by the gateway’s own lightweight classifier, latency requirements (time-sensitive requests route to faster models), or departmental cost budgets (when a department’s budget is 80% consumed, route to cheaper models automatically).

In practice, well-implemented model routing reduces LLM spend by 30–60% compared to routing everything to the highest-capability model. The savings come from volume: the majority of requests in most enterprise GenAI deployments are classification, extraction, or retrieval tasks that smaller models handle with equivalent quality.

Cost Control: Enforced at the Gateway

Without a gateway, LLM costs accumulate wherever requests happen to originate, and no single team has visibility into the full picture. A batch processing job that runs overnight against a 128K-context model at $15 per million input tokens can generate thousands of dollars of cost in a single run — a fact that typically surfaces on the monthly invoice rather than in real time.

The gateway enforces cost controls at the point of request.

Per-application token budgets — each registered application has a monthly or daily token limit. When the budget is 80% consumed, the gateway begins routing to cheaper fallback models. When it is 100% consumed, requests are throttled until the budget resets or is manually increased.

Per-department cost attribution — every request carries a department or cost centre identifier in its headers. The gateway logs input and output token counts with the department attribution. Real-time dashboards show spend by department, application, model, and time period. Departments can see their own spend without requiring access to the provider billing console.

Cost alerts — configurable thresholds trigger alerts when spend in a rolling window exceeds a defined level. An application that suddenly generates 10x its normal token volume is anomalous; the alert surfaces this within minutes rather than at month-end.

This level of cost governance is effectively impossible to implement when applications communicate directly with providers. The gateway makes it straightforward.

Rate Limiting: Protecting Both Sides

Rate limiting in an LLM gateway serves two distinct purposes: protecting the organisation’s provider quota from being consumed by a single application or user, and protecting downstream providers from requests that exceed their rate limits and would be rejected.

Provider rate limits vary significantly by account tier and model. GPT-4o at the Tier 4 OpenAI account level supports 10,000 requests per minute and 2,000,000 tokens per minute — but a large batch job and several interactive applications consuming from the same quota simultaneously can saturate these limits quickly.

The gateway implements rate limiting at three levels:

Per-user limits: an individual user’s requests are throttled to prevent a single user from consuming disproportionate quota. Relevant for multi-tenant applications where user behaviour is unpredictable.

Per-application limits: each application has a maximum sustained request rate. This prevents a single poorly-optimised application from crowding out other applications.

Global limits: the gateway tracks the organisation’s aggregate token consumption rate against the known provider rate limits, and queues or sheds load before hitting the limit — rather than allowing requests to reach the provider and receive 429 errors.

Queue management for burst traffic is handled with a configurable back-pressure mechanism: requests that arrive when the rate is saturated are queued with a configurable timeout. If the queue drains within the timeout, the request is served. If not, the caller receives an appropriate error with a Retry-After header. This is more graceful than the 429 responses that come from direct provider calls under load.

Semantic Caching

Not all LLM requests are unique. In high-volume applications — particularly FAQ chatbots, document classification pipelines, and internal knowledge search — a significant fraction of incoming queries are semantically similar to queries the system has already answered. Semantic caching allows the gateway to return a cached response for queries that are close enough to a previously-answered query, without hitting the LLM at all.

Semantic caching works by embedding the incoming query, comparing the embedding to a cache of recent query embeddings, and returning the cached response if a nearest neighbour is found within a configurable similarity threshold. GPTCache is the most commonly used open-source implementation; Redis with a vector index plugin is a lower-infrastructure alternative.

The economics are compelling for the right use cases. A FAQ chatbot where 40% of incoming queries are paraphrases of the same ten questions can reduce its effective LLM call volume by that same 40% — a direct cost reduction with no quality tradeoff, since the cached responses are drawn from previously validated outputs.

The key design decision is the similarity threshold. A threshold too high (requiring very close semantic match) produces a low cache hit rate. A threshold too low (accepting queries that are similar but not equivalent) risks returning incorrect cached responses. The right threshold is determined empirically for each application; we typically start at a cosine similarity of 0.95 and adjust based on the observed hit rate and any incorrect-cache reports from evaluation.

Semantic caching is inappropriate for use cases where responses must be fresh — any application where the correct answer changes frequently, or where the user expects a response tailored to their specific phrasing rather than a closest-match retrieval.

Observability: Every Request, Attributed and Logged

Every request that passes through the gateway is logged with a structured record:

{
  "timestamp": "2025-05-14T09:23:41.882Z",
  "request_id": "req_8f3a9c2e",
  "application_id": "compliance-assistant",
  "department": "legal",
  "user_id_hash": "sha256:a3f8...",
  "model_requested": "gpt-4o",
  "model_served": "gpt-4o-2024-08-06",
  "input_tokens": 1842,
  "output_tokens": 347,
  "latency_ms": 1823,
  "cost_usd": 0.01384,
  "cache_hit": false,
  "provider": "openai"
}

User identifiers are hashed before logging — the gateway never stores raw PII. The structured log feeds into the cost dashboard, the rate limit monitoring, and the anomaly detection layer.

The observability layer is what makes all the other gateway functions useful. Cost control without visibility is guesswork. Rate limiting without observability means you discover problems when requests start failing rather than before they fail. Anomaly detection — a sudden spike in a specific application’s token consumption, a change in the output token distribution that might indicate prompt drift — requires the time-series data that comes from logging every request.

Fallback and Resilience

LLM providers experience outages. Rate limits are hit. Response times degrade during peak periods. The gateway handles this transparently to the calling applications.

The fallback chain is configured per model class: primary model fails → route to secondary model at the same capability tier → if that fails → route to a lower-capability model with a flag in the response indicating degraded mode. Callers that need to know they are in degraded mode (to change their behaviour accordingly) can inspect the response header; callers that do not care receive a successful response regardless.

Circuit breaker logic prevents cascading failures: if the primary model fails five consecutive requests within a 30-second window, the gateway stops routing to it for two minutes and directs all traffic to the fallback. This avoids the latency overhead of repeatedly attempting a failing endpoint under sustained outage conditions.

The result is that from the calling application’s perspective, the LLM layer is a single reliable endpoint. The gateway absorbs the complexity of provider availability, rate limit management, and model selection.

Build vs. Buy

LiteLLM is the leading open-source LLM gateway. It supports over 100 providers, handles model routing, implements basic rate limiting and cost tracking, and can be self-hosted. For teams with engineering capacity, it is the most flexible starting point — you own the infrastructure and can extend the routing logic for your specific requirements.

Portkey and Helicone are managed gateway services. Lower operational overhead, faster time to deployment, with managed caching, observability dashboards, and routing rules available through a web interface. The tradeoff is the standard managed-service tradeoff: less flexibility, an ongoing subscription cost, and your request logs leaving your infrastructure.

The case for building custom: proprietary models running on-premise, deep integration with internal authentication and authorisation systems, or routing logic that is specific enough to your use case that it cannot be expressed in a general-purpose gateway’s configuration. The case for LiteLLM or a managed service: everything else. The gateway’s core functions — routing, rate limiting, cost tracking, caching, fallback — are not differentiated. The value is in having them, not in having built them yourself.

Boring Infrastructure, Real Returns

The LLM gateway is unglamorous infrastructure. It does not appear in the product demo. It has no end-user experience to speak of. It is not where the interesting model engineering happens.

It is, however, the infrastructure that makes GenAI manageable at organisational scale. Without it, API keys proliferate, costs are invisible until they appear on an invoice, rate limit errors surface in production, and every team solves the same resilience and fallback problems independently. With it, all of those problems are solved once, in one place, and every application benefits.

The teams that invest in the gateway early spend less time fighting operational problems later. The teams that skip it spend more time than they expected managing the consequences of not having it.

Building a Production RAG System — The downstream retrieval system that the gateway routes requests into and protects from overload.
The True Total Cost of a GenAI Deployment — Cost control is one of the gateway’s core functions, and this post explains why that matters at production scale.
Nematix Generative AI Services — How Nematix designs and implements LLM gateway infrastructure for organisations scaling GenAI across multiple teams.

Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.