RAG Meta-Agent

See Project with Full Code on GitHub

A strictly-typed LangGraph orchestration engine routing complex user intent across three isolated retrieval systems, backed by a custom 3-axis evaluation harness.

🚀 Try the live demo · 💻 View on GitHub

The Objective

RAG Meta-Agent is a production-grade multi-system orchestrator designed to move beyond single-database, monolithic retrieval. Real-world user intent often spans distinct domains—requiring a system that can intelligently decide whether to query a financial compliance index, an e-commerce catalog, or the live web. The objective was to build a routing pipeline that maintains strict boundary control, sub-second orchestration latency, and deterministic reliability without suffering from state-drift or hallucinations.

The Architecture & Methodology

The system operates on a deterministic state machine with strict separation of concerns:

Plan (Intent & Tool Selection): An LLM classifies the user’s intent and selects the appropriate tools using explicit R1-R5 routing rules.
Execute (Dynamic Dispatch): Tools are executed against external APIs (Modal-hosted AEGIS and ESCI endpoints, plus Tavily web search).
Reflect (Quality Control): The agent evaluates the retrieved context. It can loop back for more tools, proceed to generation, or safely abstain if context is missing.
Compose (Grounded Generation): Synthesizes the final answer with strict, verifiable source citations.

All state transitions are strictly governed by Pydantic models (AgentState, ToolCall, PlannerOutput), ensuring zero “shape-drift” as data moves between nodes.

The Differentiator: 3-Axis Evaluation & Strict State Management

Most agent architectures are tested on “vibes” or slow manual review. The RAG Meta-Agent ships with a custom, automated evaluation framework (judge.py) testing 50 gold-standard queries against three strict axes:

Routing Precision: Deterministic set comparison to ensure the correct databases were queried.
Citation Accuracy: Fuzzy token overlap verifying that the LLM explicitly cited the exact text used.
Fact Presence: An LLM-as-a-judge checking if the required gold facts are present in the final output.

Isolating the deterministic checks from the LLM-judged checks significantly reduces evaluation variance and API costs.

Key Performance Metrics

Metric	v1 Baseline	v2 Optimized	Δ (Improvement)
Routing Correctness	0.840	0.980	+0.140
Citation Correctness	0.730	0.860	+0.130
Answer Correctness	0.758	0.820	+0.062
Overall Score	0.776	0.887	+0.111

The v2 optimizations included fuzzy citation matching, explicit deterministic routing triggers, and refined abstention thresholds in the underlying retrieval tiers.

Core Insights

Strict typing is non-negotiable for LangGraph: Enforcing Pydantic schemas over standard dictionaries eliminated the silent data-mutation bugs that typically plague multi-node agent architectures.
Orchestration overhead must be minimized: By utilizing Cerebras (gpt-oss-120b), forcing JSON-mode, and setting reasoning_effort="low", the planning and reflection nodes execute in milliseconds. This preserves the latency budget for the actual database retrieval.
Evaluation pipelines break at scale: Running heavy LLM-judge evaluations easily shatters standard cloud rate limits (e.g., 100k tokens-per-day ceilings). Engineering local fallbacks or utilizing high-throughput providers is necessary for automated testing.

What Didn’t Work

Multi-tool synthesis is inherently difficult for current LLMs. While the agent routes “multi-intent” queries correctly 90% of the time, the composer struggles to genuinely synthesize across two distinct tool results. It tends to summarize them sequentially rather than weaving the insights together natively.
Pure token overlap for citation validation caused false negatives. The evaluation harness initially used strict token set fractions to verify citations, which heavily penalized minor LLM paraphrasing. Switching to a fuzzy substring/BM25-style overlap dramatically improved evaluation accuracy.
Sequential execution bottlenecks. Running multiple tool calls in a standard for loop caused unnecessary latency stacking. Moving to concurrent asyncio.gather dispatches is required for high-QPS production environments.

Technical Stack

Python · LangGraph · W&B Weave · Pydantic · Cerebras (gpt-oss-120b) · OpenAI SDK · Modal (FastAPI) · Tavily API · Streamlit · Hugging Face Spaces

🚀 Try the live demo · 💻 View on GitHub