RAG Meta-Agent

A strictly-typed LangGraph orchestration engine routing complex user intent across three isolated retrieval systems, backed by a custom 3-axis evaluation harness.
馃殌 Try the live demo 路 馃捇 View on GitHub
The Objective
RAG Meta-Agent is a production-grade multi-system orchestrator designed to move beyond single-database, monolithic retrieval. Real-world user intent often spans distinct domains鈥攔equiring a system that can intelligently decide whether to query a financial compliance index, an e-commerce catalog, or the live web. The objective was to build a routing pipeline that maintains strict boundary control, sub-second orchestration latency, and deterministic reliability without suffering from state-drift or hallucinations.
The Architecture & Methodology
The system operates on a deterministic state machine with strict separation of concerns:
- Plan (Intent & Tool Selection): An LLM classifies the user’s intent and selects the appropriate tools using explicit
R1-R5routing rules. - Execute (Dynamic Dispatch): Tools are executed against external APIs (Modal-hosted AEGIS and ESCI endpoints, plus Tavily web search).
- Reflect (Quality Control): The agent evaluates the retrieved context. It can loop back for more tools, proceed to generation, or safely abstain if context is missing.
- Compose (Grounded Generation): Synthesizes the final answer with strict, verifiable source citations.
All state transitions are strictly governed by Pydantic models (AgentState, ToolCall, PlannerOutput), ensuring zero “shape-drift” as data moves between nodes.
The Differentiator: 3-Axis Evaluation & Strict State Management
Most agent architectures are tested on “vibes” or slow manual review. The RAG Meta-Agent ships with a custom, automated evaluation framework (judge.py) testing 50 gold-standard queries against three strict axes:
- Routing Precision: Deterministic set comparison to ensure the correct databases were queried.
- Citation Accuracy: Fuzzy token overlap verifying that the LLM explicitly cited the exact text used.
- Fact Presence: An LLM-as-a-judge checking if the required gold facts are present in the final output.
Isolating the deterministic checks from the LLM-judged checks significantly reduces evaluation variance and API costs.
Key Performance Metrics
| Metric | v1 Baseline | v2 Optimized | 螖 (Improvement) |
| Routing Correctness | 0.840 | 0.980 | +0.140 |
| Citation Correctness | 0.730 | 0.860 | +0.130 |
| Answer Correctness | 0.758 | 0.820 | +0.062 |
| Overall Score | 0.776 | 0.887 | +0.111 |
The v2 optimizations included fuzzy citation matching, explicit deterministic routing triggers, and refined abstention thresholds in the underlying retrieval tiers.
Core Insights
- Strict typing is non-negotiable for LangGraph: Enforcing Pydantic schemas over standard dictionaries eliminated the silent data-mutation bugs that typically plague multi-node agent architectures.
- Orchestration overhead must be minimized: By utilizing Cerebras (
gpt-oss-120b), forcing JSON-mode, and settingreasoning_effort="low", the planning and reflection nodes execute in milliseconds. This preserves the latency budget for the actual database retrieval. - Evaluation pipelines break at scale: Running heavy LLM-judge evaluations easily shatters standard cloud rate limits (e.g., 100k tokens-per-day ceilings). Engineering local fallbacks or utilizing high-throughput providers is necessary for automated testing.
What Didn’t Work
- Multi-tool synthesis is inherently difficult for current LLMs. While the agent routes “multi-intent” queries correctly 90% of the time, the composer struggles to genuinely synthesize across two distinct tool results. It tends to summarize them sequentially rather than weaving the insights together natively.
- Pure token overlap for citation validation caused false negatives. The evaluation harness initially used strict token set fractions to verify citations, which heavily penalized minor LLM paraphrasing. Switching to a fuzzy substring/BM25-style overlap dramatically improved evaluation accuracy.
- Sequential execution bottlenecks. Running multiple tool calls in a standard
forloop caused unnecessary latency stacking. Moving to concurrentasyncio.gatherdispatches is required for high-QPS production environments.
Technical Stack
Python 路 LangGraph 路 W&B Weave 路 Pydantic 路 Cerebras (gpt-oss-120b) 路 OpenAI SDK 路 Modal (FastAPI) 路 Tavily API 路 Streamlit 路 Hugging Face Spaces