AEGIS-RAG

See Project with Full Code on GitHub

Reliable Retrieval-Augmented Generation for policy documents β€” with hybrid retrieval, cross-encoder reranking, and safe abstention when evidence is weak.

πŸš€ Try the live demo Β· πŸ’» View on GitHub

AEGIS-RAG demo: a confident grounded answer, an LLM-refusal abstention on out-of-scope content, and a score-gated abstention triggered by raising the rerank threshold.

The Objective

AEGIS-RAG is a production-grade Retrieval-Augmented Generation system for reliable question answering over policy documents. It prioritizes retrieval quality, ranking precision, and safe abstention over flashy generation β€” designed for high-stakes domains where a confidently wrong answer is worse than no answer at all.

The Architecture & Methodology

The system follows a three-stage pipeline with strict separation of concerns:

  1. Hybrid Retrieval β€” Dense embeddings (E5-large-v2) + sparse BM25, fused via Reciprocal Rank Fusion with weighted contributions.
  2. Cross-Encoder Reranking β€” ms-marco-MiniLM-L-6-v2 reorders the top candidates by query-passage relevance.
  3. Grounded Generation β€” llama3.1:8b with a strict system prompt that forces a canonical fallback string when the retrieved context doesn’t support an answer.

This separation enables independent optimization of each stage and makes failure modes traceable.

The Differentiator: Two-Layer Abstention

Most RAG systems either generate confident hallucinations or refuse based on a single retrieval threshold. AEGIS uses two complementary safety layers:

  • Pre-LLM score gating β€” abstains when the top rerank score (or fused RRF score) falls below a configured threshold. Cheap, runs before any generation.
  • Post-retrieval LLM grounding β€” the system prompt instructs the model to emit a canonical fallback string when the retrieved chunks don’t actually answer the question. The application detects this and surfaces it as an abstention.

The live demo exposes both layers via a slider, so you can watch them fire on different question types β€” out-of-scope queries, ambiguous queries, and queries where retrieval succeeds but evidence is weak.

Key Performance Metrics

MetricHybridHybrid + Rerank
Recall@5~0.820.85
nDCG@5~0.710.80
Context Precisionβ€”0.83
Latency~95 ms~720 ms

Reranking lifts nDCG@5 by 9 points at a 7.6Γ— latency cost β€” an acceptable trade for policy and compliance domains where ranking precision matters more than throughput.

Core Insights

  • Retrieval quality dominates β€” generation tweaks gave marginal gains; retrieval upgrades gave step-function improvements.
  • Reranking is non-negotiable for precision β€” but throughput-sensitive deployments should serve the hybrid profile and reserve reranking for high-stakes queries.
  • Safe abstention is cheaper than fact-checking β€” refusing weak answers eliminates a category of failures that would otherwise require post-hoc validation.
  • Fine-tuning was unnecessary β€” for small, structured corpora, off-the-shelf embeddings + reranking outperformed any fine-tuning we tested, at a fraction of the cost.

What Didn’t Work

  • NaΓ―ve top-k retrieval without RRF fusion produced inconsistent rankings between BM25 and dense, especially on multi-hop questions.
  • Pure dense retrieval missed exact-wording matches (policy section numbers, defined terms) that BM25 caught trivially.
  • Aggressive abstention thresholds initially caused over-refusal on valid edge cases β€” tuning the slider against a held-out eval set was essential.
  • Semantic chunking outperformed recursive chunking for policy text, but at higher ingestion cost. Documented in the failure-mode taxonomy in the repo.

Technical Stack

Python Β· LangChain Β· ChromaDB Β· Sentence-Transformers (E5-large-v2) Β· BM25 Β· Cross-Encoders (ms-marco-MiniLM-L-6-v2) Β· Ollama (llama3.1:8b) Β· RAGAS Β· Streamlit Β· Hugging Face Spaces


πŸš€ Try the live demo Β· πŸ’» View on GitHub

Scroll to Top