Tech Tips 6 min read intermediate

RAG Grounding: 7 Ways to Stop LLM Hallucinations in Production

A practitioner's guide to grounding retrieval-augmented generation systems. Covers fixing retrieval first, hybrid dense-plus-keyword search, cross-encoder reranking, contextual compression, refusal prompting, verified citations, Chain-of-Verification, confidence-threshold abstention, and measuring faithfulness with RAGAS.

Marcus Rivera

Jun 9, 2026

Retrieval-augmented generation was supposed to fix hallucinations. Feed the model real documents, the theory went, and it would stop making things up. Anyone who has shipped a RAG system to production knows the reality is messier: a model handed irrelevant context will hallucinate just as confidently as one handed nothing, and a model handed contradictory context will often pick the wrong side.

Grounding is not a feature you switch on. It is a discipline you engineer into every stage of the pipeline. Here are seven techniques — roughly in the order they pay off — for keeping a RAG system honest in production.

1. Fix retrieval before you touch the prompt

Most "hallucinations" in a RAG system are retrieval failures wearing a costume. If the right chunk never reaches the model, no amount of prompt engineering will save you. Garbage in, confident garbage out.

Start by measuring retrieval in isolation. For a sample of real queries, check whether the document containing the answer appears in your top-k results at all. If recall is poor, the generation layer is irrelevant — you are debugging the wrong half of the system.

The usual culprit is naive chunking. Splitting documents on a fixed character count routinely severs a fact from the sentence that qualifies it. Prefer structure-aware chunking that respects headings, paragraphs, and table boundaries, and keep a little overlap between chunks so context isn't guillotined at the seam.

2. Go hybrid: dense plus keyword

Dense vector search is excellent at semantic similarity and terrible at exact matches. Ask it for error code E-4042 or a specific product SKU and it will happily return passages that are thematically close and literally wrong.

Hybrid search runs a dense (embedding) retriever and a sparse keyword retriever such as BM25 in parallel, then fuses the results. The keyword side anchors exact identifiers; the dense side catches paraphrases. For most enterprise corpora — full of part numbers, acronyms, and proper nouns — hybrid is not a luxury, it is the baseline.

3. Rerank, then compress

Retrieval optimizes for recall: cast a wide net, get the answer somewhere in the top 20. Generation needs precision: the answer should be in the top 3, near the top of the prompt. Bridge the two with a cross-encoder reranker.

A reranker scores each retrieved chunk against the query jointly rather than comparing pre-computed vectors, which makes it far more accurate at ordering — at the cost of running only on the handful of candidates retrieval already surfaced.

Then apply contextual compression: strip retrieved passages down to the sentences that actually bear on the query before they reach the model. This does two things at once. It removes distractor text that invites the model to wander, and it leaves more of the context window for reasoning.

4. Make the prompt refuse to guess

Even with perfect context, a model's default instinct is to be helpful — and "helpful" too often means "answer anyway." Counter that explicitly in the system prompt:

Answer only using the provided context. If the context does not contain the answer, reply: "I don't have enough information to answer that." Do not use prior knowledge. Cite the source for every claim.

Three instructions are doing work here. Only using the provided context closes the door on training-data recall. The explicit abstention string gives the model a sanctioned way to say "I don't know" instead of inventing one. And the citation requirement, covered next, turns grounding into something you can audit.

5. Demand citations — and verify them

Ask the model to attach a source to every claim, ideally a chunk ID or document reference it can only know from the retrieved context. Citations are not just for the reader; they are a grounding signal you can check programmatically.

After generation, run a lightweight verification pass: for each cited claim, confirm the cited chunk actually supports it. A claim whose "source" doesn't contain the relevant text is a hallucination caught before it reaches the user. This attribution check is cheap relative to the cost of shipping a wrong answer with a fake footnote.

6. Add a verification loop for high-stakes answers

When correctness matters more than latency, borrow Chain-of-Verification (CoVe), introduced by Meta AI researchers in 2023 (Dhuliawala et al., arXiv:2309.11495). The method has four steps: the model drafts an initial answer, plans a set of verification questions to fact-check that draft, answers those questions independently so the answers aren't biased by the original response, and then produces a final answer revised in light of what it found.

The independence step is the clever bit. By answering each verification question in isolation, the model is less likely to simply rationalize its first attempt. The paper reported reduced hallucinations across list-based questions, closed-book QA, and long-form generation. The trade-off is obvious — several extra model calls per answer — so reserve CoVe for the queries where a wrong answer is expensive.

A lighter-weight cousin is the separate verifier model: a second, often smaller, model whose only job is to classify each statement as supported, partially supported, or unsupported against the retrieved context. It doesn't write prose; it audits.

7. Calibrate confidence and let the system abstain

The most underrated grounding technique is permission to stay silent. Set a retrieval confidence threshold: if the top reranked chunk scores below a cutoff, don't generate an answer at all — route to a fallback, a human, or an honest "I couldn't find this."

A system that answers 85% of queries correctly and abstains on the rest is far more trustworthy than one that answers 100% and is wrong 15% of the time. Users forgive "I don't know." They do not forgive being confidently misled.

Measure it, or you're guessing

You cannot improve what you don't measure. Build an evaluation set of real queries with known answers and track grounding-specific metrics rather than vague "quality." Open frameworks like RAGAS score dimensions that matter here: faithfulness (does the answer stay within the retrieved context?), answer relevance, and context precision and recall. Run them on every pipeline change. A reranker that improves faithfulness but tanks recall is a regression, and you will only see that trade-off if you are watching both numbers.

The Bottom Line

Grounding a RAG system is mostly unglamorous engineering: better chunking, hybrid retrieval, a reranker, a compression step, a prompt that knows how to say no, citations you actually verify, and the discipline to measure faithfulness on every change. None of it is exotic, and that is the point. Hallucinations in production are rarely a mysterious failure of the model — they are a retrieval gap, a missing abstention path, or an unverified citation. Close those gaps in order, instrument the result, and the model stops guessing because you have finally stopped letting it.

rag llm developer-tools enterprise-ai

More in Tech Tips

Tech Tips

LiteLLM: One Unified API for Every LLM Provider in 2026

LiteLLM is an open-source gateway that gives developers a single OpenAI-format interface to call 100+ LLM providers. This tutorial covers installing the SDK and Proxy Server, switching providers by changing a model string, unified exception handling, streaming, and adding cost tracking, observability, virtual keys, and budgets.

By Marcus Rivera · 7 min · Jul 17, 2026

Tech Tips

Langfuse: LLM Observability That Debugs Your AI Agents

Langfuse is an open-source, MIT-licensed LLM observability platform acquired by ClickHouse in January 2026. It provides hierarchical tracing, prompt management, evaluations, and datasets. Its OpenTelemetry-based Python SDK v3 uses the @observe decorator and integrates with LangChain, the OpenAI SDK, Anthropic, and LiteLLM.

By Marcus Rivera · 6 min · Jul 16, 2026

Tech Tips

Unsloth: Fine-Tune LLMs 2x Faster on a Single GPU

Unsloth is an open-source library that fine-tunes open LLMs (Llama, Qwen, Mistral, Gemma, gpt-oss) roughly 2x faster and with up to 70% less VRAM than a stock Hugging Face setup, without sacrificing accuracy. It achieves this with custom OpenAI Triton kernels and a manual backpropagation engine, and fuses LoRA with 4-bit quantization. It runs on any NVIDIA GPU with CUDA Capability 7.0+, including the free Colab T4. Install with 'pip install unsloth' and use FastLanguageModel.from_pretrained plus get_peft_model to attach LoRA adapters before training with trl's SFTTrainer.

By Marcus Rivera · 6 min · Jul 10, 2026