“Your loan is approved under Section 42 of the Banking Act 2025.”
One problem: there is no Section 42.
That single hallucination triggered a regulator investigation and a six-figure penalty. In high-stakes domains like finance, healthcare, legal and compliance zero-error tolerance is the rule. Your assistant must always ground its answers in real, verifiable evidence.
1 – Why high-stakes domains punish guesswork
Regulatory fines, licence suspensions, lawsuits
Patient harm or misdiagnosis
Massive reputational damage and loss of trust
When the error budget is effectively 0%, traditional “chat style” LLMs are not enough.
2 – The three-layer defense against hallucination
2.1 Retrieval-Augmented Generation (RAG)
What it does – Pulls fresh text from authoritative sources (regulations, peer-reviewed papers, SOPs) before answering.
Win – Grounds every claim in evidence; supports “latest version” answers.
Risk – Garbage in, garbage out. A bad retriever seeds bad context.
2.2 Guardrail filter
What it does – Post-processes the draft answer. Blocks responses that:
lack citations
creep into forbidden advice (medical, legal)
include blanket “always/never” claims
Win – Catches risky output before it reaches the user.
Risk – Over-filtering if rules are too broad or vague.
2.3 Question sanitizer
What it does – Rewrites the user prompt, removing ambiguity and hidden assumptions so retrieval hits the right documents.
Win – Sharper queries ⇒ cleaner answers.
Risk – Requires strong NLU to keep the chat natural.
Raw prompt
> “Is this drug safe for kids?”
Sanitized prompt
> “According to current Therapeutic Goods Administration (Australia) guidelines, what is the approved dosage and contraindication list for Drug X in children aged 6–12 years?”
✅ Figure: Good example – Sanitization adds age range, official source, and specific drug name
Rule of thumb:Use all three layers. One patch isn’t enough.
3 – Reference architecture
Vector store & embeddings – Pick models that benchmark well on MTEB; keep the DB pluggable (FAISS, Pinecone, Azure Cognitive Search).
Retriever tuning – Measure recall@k, MRR, NDCG; test different chunk sizes and hybrid search.
Foundation model & versioning – Record the model hash in every call; monitor LiveBench for regressions.