RAG Meta-Agent

See Project with Full Code on GitHub

A strictly-typed LangGraph orchestration engine routing complex user intent across three isolated retrieval systems, backed by a custom 3-axis evaluation harness.

馃殌 Try the live demo馃捇 View on GitHub

The Objective

RAG Meta-Agent is a production-grade multi-system orchestrator designed to move beyond single-database, monolithic retrieval. Real-world user intent often spans distinct domains鈥攔equiring a system that can intelligently decide whether to query a financial compliance index, an e-commerce catalog, or the live web. The objective was to build a routing pipeline that maintains strict boundary control, sub-second orchestration latency, and deterministic reliability without suffering from state-drift or hallucinations.

The Architecture & Methodology

The system operates on a deterministic state machine with strict separation of concerns:

  • Plan (Intent & Tool Selection): An LLM classifies the user’s intent and selects the appropriate tools using explicit R1-R5 routing rules.
  • Execute (Dynamic Dispatch): Tools are executed against external APIs (Modal-hosted AEGIS and ESCI endpoints, plus Tavily web search).
  • Reflect (Quality Control): The agent evaluates the retrieved context. It can loop back for more tools, proceed to generation, or safely abstain if context is missing.
  • Compose (Grounded Generation): Synthesizes the final answer with strict, verifiable source citations.

All state transitions are strictly governed by Pydantic models (AgentState, ToolCall, PlannerOutput), ensuring zero “shape-drift” as data moves between nodes.

The Differentiator: 3-Axis Evaluation & Strict State Management

Most agent architectures are tested on “vibes” or slow manual review. The RAG Meta-Agent ships with a custom, automated evaluation framework (judge.py) testing 50 gold-standard queries against three strict axes:

  1. Routing Precision: Deterministic set comparison to ensure the correct databases were queried.
  2. Citation Accuracy: Fuzzy token overlap verifying that the LLM explicitly cited the exact text used.
  3. Fact Presence: An LLM-as-a-judge checking if the required gold facts are present in the final output.

Isolating the deterministic checks from the LLM-judged checks significantly reduces evaluation variance and API costs.

Key Performance Metrics

Metricv1 Baselinev2 Optimized螖 (Improvement)
Routing Correctness0.8400.980+0.140
Citation Correctness0.7300.860+0.130
Answer Correctness0.7580.820+0.062
Overall Score0.7760.887+0.111

The v2 optimizations included fuzzy citation matching, explicit deterministic routing triggers, and refined abstention thresholds in the underlying retrieval tiers.

Core Insights

  • Strict typing is non-negotiable for LangGraph: Enforcing Pydantic schemas over standard dictionaries eliminated the silent data-mutation bugs that typically plague multi-node agent architectures.
  • Orchestration overhead must be minimized: By utilizing Cerebras (gpt-oss-120b), forcing JSON-mode, and setting reasoning_effort="low", the planning and reflection nodes execute in milliseconds. This preserves the latency budget for the actual database retrieval.
  • Evaluation pipelines break at scale: Running heavy LLM-judge evaluations easily shatters standard cloud rate limits (e.g., 100k tokens-per-day ceilings). Engineering local fallbacks or utilizing high-throughput providers is necessary for automated testing.

What Didn’t Work

  • Multi-tool synthesis is inherently difficult for current LLMs. While the agent routes “multi-intent” queries correctly 90% of the time, the composer struggles to genuinely synthesize across two distinct tool results. It tends to summarize them sequentially rather than weaving the insights together natively.
  • Pure token overlap for citation validation caused false negatives. The evaluation harness initially used strict token set fractions to verify citations, which heavily penalized minor LLM paraphrasing. Switching to a fuzzy substring/BM25-style overlap dramatically improved evaluation accuracy.
  • Sequential execution bottlenecks. Running multiple tool calls in a standard for loop caused unnecessary latency stacking. Moving to concurrent asyncio.gather dispatches is required for high-QPS production environments.

Technical Stack

Python 路 LangGraph 路 W&B Weave 路 Pydantic 路 Cerebras (gpt-oss-120b) 路 OpenAI SDK 路 Modal (FastAPI) 路 Tavily API 路 Streamlit 路 Hugging Face Spaces


馃殌 Try the live demo馃捇 View on GitHub

Scroll to Top