Jul 01, 2024

GenAI in Financial Services: Use Cases That Work

Not every GenAI use case survives a regulated financial services environment. Here are the ones that do and what makes each one production-viable.

By mid-2024, every bank and insurer of any scale in Southeast Asia had a GenAI working group. Most were running the same four pilots: a customer service chatbot, a document extraction tool, some form of analyst assistance, and a regulatory summarisation prototype. The working groups looked similar, the vendor shortlists overlapped, and the slide decks describing “the transformative potential of AI” were nearly interchangeable.

What varied dramatically was the outcome. Some of these pilots made it to production. Most did not. The gap was not model quality — the underlying technology was broadly comparable across institutions. The gap was whether the use case was designed for the specific constraints of a regulated financial services environment: structured inputs, constrained outputs, human review on anything consequential, and an audit trail that satisfies the examiners.

Here is what we have observed actually reaching production across the institutions we work with in Malaysia and broader Southeast Asia — and why.

Use Case 1: Document Processing for KYC and Loan Applications

Document processing has the highest production success rate of any GenAI use case in financial services, and the reason is structural: the inputs are constrained, the expected outputs are constrained, and humans review every extraction before any consequential action is taken.

The workflow that works is a four-stage pipeline. First, document ingestion and OCR — Azure Document Intelligence, AWS Textract, or a comparable service — converts physical or scanned documents into machine-readable text. Second, an LLM performs structured extraction: pulling named fields (applicant name, IC number, employer, monthly income, property address) from the extracted text and returning them as a JSON object with a defined schema. Third, the extracted data passes through a validation layer that checks field formats, cross-field consistency, and flags anomalies. Fourth — and this is the step that determines whether the system is production-viable — a human reviewer sees the original document alongside the extraction and approves, corrects, or rejects.

What makes this work is the last step. The LLM is not making decisions; it is reducing data entry labour by 60–80% on the extraction task, while a trained human retains decision authority. Institutions that skip the human review step to maximise automation savings consistently discover accuracy problems that create remediation costs in excess of the savings. The human review layer is not a concession to caution — it is what makes the system legally defensible under Bank Negara Malaysia’s Anti-Money Laundering and Anti-Terrorism Financing Act (AMLA) requirements.

The failure mode is attempting to use this pipeline for documents it was not designed for without revalidating accuracy first: handwritten forms, heavily degraded scans, documents in non-standard layouts, or bilingual documents that switch language mid-page. Each of these requires specific handling, and deploying the pipeline without testing on representative samples from the actual document population is how accuracy surprises appear in production.

Use Case 2: Analyst Assistance — Fraud Narrative Generation and Credit Memo Drafting

The second most reliable path to production is LLM-assisted drafting for analyst workflows. The pattern is consistent: structured data inputs enter the system, the LLM generates a draft narrative in a defined template, and a qualified analyst reviews, edits, and approves before the document is used.

In fraud investigation, this means: transaction data, account history, alert details, and prior investigation notes feed into a prompt; the LLM generates a structured investigation narrative covering the flagged behaviour, context, and preliminary assessment. The fraud analyst uses this as a starting point rather than a blank page. Documented deployments of this pattern report 40–60% reduction in time spent on narrative drafting per case. The analyst’s time shifts from writing to reviewing and editing — a higher-value activity that also produces better documentation, because analysts are faster at correcting errors than generating text from scratch.

In credit memo drafting, the same pattern applies: financial ratios, credit bureau outputs, business description, and analyst notes generate a first-draft credit memo. The credit officer reviews and edits. The output quality is consistent; the expertise required to evaluate it is not replaced.

What works is generation from structured inputs with a defined output template. What fails is expecting the LLM to surface insights the data does not contain. An LLM will generate a plausible-sounding credit narrative even when the underlying data is ambiguous or insufficient. Analysts who treat the LLM output as a starting point — to be verified and corrected — get value. Analysts who treat it as an authoritative summary get risk.

Use Case 3: Regulatory Reporting Summarisation

Regulatory circulars from Bank Negara Malaysia, the Securities Commission, and Labuan FSA arrive frequently, run long, and require compliance teams to identify specific action items, assess applicability to the institution’s business, and draft internal memos communicating the implications.

This is a strong GenAI use case because the output goes to internal teams who have the expertise to verify it. The workflow: a new circular arrives, the LLM produces a structured summary (key changes, affected product lines, required actions, implementation timeline), a compliance officer reviews and validates, and a polished internal memo is distributed. Teams using this pattern report reducing the time from circular receipt to internal communication by 50–70%.

The same pattern applies to Basel III quarterly narrative sections, JKDM statistical returns commentary, and internal audit responses. In each case, the LLM generates a draft from structured data and prior narrative templates; a qualified human reviews and takes ownership.

The failure mode is using this as a substitute for legal review. Regulatory interpretation — deciding how a new requirement applies to a specific product structure or business arrangement — requires qualified legal and compliance judgment. An LLM summary can tell you what the regulation says; it cannot reliably tell you what it means for your specific situation. The institutions that run into trouble are the ones where the LLM summary becomes the de facto legal interpretation because no one with the appropriate qualification saw it before it was acted upon.

Use Case 4: Customer Service Deflection

Customer service chatbots were the first GenAI use case that most financial institutions deployed, and they remain the most publicly visible. The production success rate is reasonable when the scope is deliberately narrow — and poor when it is not.

What works: FAQ-type queries where the answer can be retrieved from a structured knowledge base and the range of valid responses is bounded. Account balance enquiries directed to secure channels. Product eligibility screening based on defined criteria (minimum income, age, residency). Branch hours and location lookup. Appointment booking through an integrated calendar system. For these query types, a well-scoped chatbot handles the 80% of volume that requires no contextual judgment, freeing human agents for the 20% that does.

What fails: expecting the chatbot to handle edge cases, complaints, disputes, or any situation that requires contextual judgment it was not explicitly designed for. The institutions that have had chatbot incidents — a customer receiving incorrect information about loan eligibility, a complaint being handled in a way that created a liability — consistently deployed systems where the scope was defined too broadly and the escalation triggers were defined too narrowly. Scope discipline is not a technology question; it is a product design question that must be answered before any line of code is written.

The practical recommendation: define the chatbot’s scope as the set of query types you are confident it can handle with 95%+ accuracy, then build escalation to a human agent as the default for everything else. The boundary between “bot handles” and “human handles” should be explicit, tested, and conservative.

Use Case 5: Internal Knowledge Search via RAG

Of all the GenAI deployments we have seen reach production in financial services, internal knowledge search has the best ratio of value delivered to implementation risk. Staff query internal policy documents, product manuals, underwriting guidelines, compliance policy libraries, and operational procedures in plain language. The system retrieves relevant passages and generates a grounded response. The user can see the source documents.

The risk profile is low because the output is consumed by a qualified staff member who makes the actual decision. If the retrieval returns a wrong passage or the LLM generates a slightly inaccurate summary, the staff member — who knows the domain — is likely to notice. The failure is a minor inefficiency, not a consequential error.

The ROI is consistently positive. Financial institutions have large volumes of internal documentation that is practically inaccessible because the search tools are inadequate. Staff who would otherwise spend 20–30 minutes finding the right policy clause or product specification can retrieve it in seconds. The productivity gain is real and measurable.

Internal knowledge search is frequently the right first GenAI deployment for a financial institution — lower risk than any customer-facing use case, high adoption because the value is immediately obvious to staff, and a practical way to build internal capability in deploying and operating RAG systems before applying that capability to higher-stakes contexts.

Use Cases to Avoid in Regulated Financial Services (For Now)

Not every GenAI use case is ready for a regulated financial services environment, and being clear about what not to build is as important as identifying what works.

Autonomous credit decisioning — using an LLM to make or approve credit decisions without a qualified human in the review loop — fails on regulatory grounds in Malaysia and across most of Southeast Asia before it fails on accuracy grounds. BNM’s FTIS and the fair lending principles embedded in Malaysian financial regulation require explainable, auditable decisions with appropriate human accountability. An LLM that decides whether to approve a loan application is not a production-viable system in 2024; it is a regulatory incident waiting to be discovered.

Real-time trading signals generated by LLMs face a different set of problems: the latency requirements of trading systems are incompatible with LLM inference times, and the hallucination rate of LLMs is categorically unacceptable in a context where a single wrong output costs real money in milliseconds. Purpose-built quantitative models remain the appropriate tool for this use case.

Customer-facing financial advice without human oversight — personalised investment recommendations, insurance coverage advice, retirement planning — requires professional qualification and regulatory authorisation in Malaysia. An LLM that gives investment advice is not providing advice; it is generating plausible-sounding text that may or may not reflect the customer’s actual situation. The liability for acting on that text lands on the institution, not the model.

The Pattern That Predicts Success

Across the use cases that reach production, the pattern is consistent. The inputs are structured or structurable. The output has a defined format and a bounded range of correct answers. A qualified human reviews and approves before any consequential action is taken. The audit trail records every input, output, review decision, and timestamp.

This is not a limitation of ambition. It is a recognition of where LLMs are genuinely reliable and where they are not. The institutions that have built genuine GenAI capability in Southeast Asian financial services are the ones that started with narrow, high-reliability use cases, built the infrastructure and operational patterns to run them well, and then expanded. The institutions still stuck at the pilot stage are frequently the ones that started by trying to solve the most ambitious use case first.

Start with document processing. Start with internal knowledge search. Build the human-in-the-loop workflows and the audit infrastructure. The more complex use cases become tractable once that foundation exists.

Automating KYC Document Processing with GenAI — A deep dive into one of the most production-ready use cases covered here, with full pipeline architecture and compliance guidance.
AI in Financial Services: From Pilot to Production — Learn what separates the GenAI use cases that survive a regulated environment from the ones that stall at pilot stage.
Nematix Generative AI Services — See how Nematix helps financial institutions move GenAI use cases from proof of concept into production.

See how Nematix drives end-to-end digital banking transformation for financial institutions across Southeast Asia.