Amazon ESCI Search

See Project with Full Code on GitHub

Two-stage hybrid retrieval over 480k Amazon products — Matryoshka 64-dim FAISS + SPLADE + BM25 → cross-encoder reranking. Production-deployed.

🚀 Try the live demo · 💻 View on GitHub · ⚙️ Raw API

Amazon ESCI demo: a query for 'wireless gaming mouse' returns ranked products in ~1 second, with per-result retrieval-signal breakdown showing which retrievers (BM25 / SPLADE / Dense) found each product and at what rank.

⏱️ Demo cold-start: ~30-60s on first request, ~1 sec after warmup.

The Objective

Build a production-style e-commerce search engine on the Amazon Shopping Queries dataset (ESCI), maximizing ranking precision (nDCG) while compressing vector storage 12× from 768 → 64 dimensions — without sacrificing recall.

The challenge: standard embedding models lose ~50% of their recall when truncated to 64 dimensions. Matryoshka representation learning is supposed to fix that. Does it?

The Architecture & Methodology

A two-stage pipeline with strict separation of concerns:

  1. Hybrid Candidate Generation
  • Dense retrieval: BGE-base fine-tuned with Matryoshka loss, truncated to 64 dims, indexed in FAISS
  • Lexical retrieval: BM25 over stemmed product text
  • Neural sparse: SPLADE-cocondenser-ensembledistil with pre-encoded inverted index
  • All three fused via weighted Reciprocal Rank Fusion (RRF) with separately-tuned weights
  1. Cross-Encoder Reranking
  • Top 200 candidates rescored by mxbai-rerank-base-v1 (deployed demo uses ms-marco-MiniLM-L-6-v2 for CPU latency)
  • Final top-K returned with full per-retriever signal breakdown

The Differentiator: Matryoshka @ 64 Dimensions

Most embedding models collapse when truncated. Matryoshka fine-tuning teaches the model to encode the most important features in the first N dimensions, so truncation is principled rather than lossy.

StrategyBaseline 64-dim Recall@200Matryoshka 64-dim Recall@200Δ
Dense Only0.430.74+73%
Dense + BM250.560.78+41%
Dense + SPLADE0.660.81+22%
Dense + BM25 + SPLADE0.700.81+16%

At 64 dimensions, hybrid Matryoshka matches the recall of a 768-dim baseline — at one-twelfth the storage and roughly 4× the QPS. All experiments tracked in MLflow on DagsHub.

Key Performance Metrics

StageRecall@200nDCG@20QPS
Retrieval (hybrid)0.81480.484669.63
After Reranking0.53785.72

Reranking adds ~5 points of nDCG@20 at a 12× QPS cost — a clear precision/throughput tradeoff that should be made per-deployment.

Production Deployment

The system is deployed end-to-end, not just benchmarked:

  • FastAPI backend containerized and deployed on Modal (CPU, scales to zero, ~1s warm latency)
  • Streamlit frontend on Hugging Face Spaces, calling the Modal API
  • Persistent Modal Volume holds the 1.7GB of artifacts (FAISS index, SPLADE matrix, BM25, fine-tuned matryoshka weights) — image stays small, cold starts stay reasonable
  • Per-result signal breakdown in the API response: every result shows which retrievers found it and at what rank, exposing the hybrid retrieval dynamics that are usually hidden

Core Insights

  • Matryoshka representation learning is real and matters — 64-dim Recall@200 = 0.81 vs 0.43 for naive truncation of the same base model
  • Hybrid > monolithic — every retriever contributes complementary signal; SPLADE catches what BM25 misses, dense catches what both miss
  • Reranking is non-negotiable for precision — but throughput-sensitive deployments should reserve it for high-stakes queries
  • Production deployment exposes bugs benchmarks never see — silent vector-space mismatches, CUDA-pickled tensors on CPU containers, sklearn version skew. All documented in the failure-mode log

What Didn’t Work

  • Loading the base BGE model in production (instead of the fine-tuned matryoshka) was a silent recall bug — query embeddings ended up in a different vector space than the indexed products. Caught only when productionizing
  • Pure RRF without weighted fusion underperformed; SPLADE and BM25 needed lower weights than dense to avoid drowning out semantic signal on long-tail queries
  • Cross-encoder on CPU at full top-200 was 16-second-per-query — unusable for live demo. Reduced to top-100 with smaller cross-encoder for ~1s demo latency
  • Reranking improves nDCG more than Recall — diminishing returns past Recall@10; the value is reordering, not retrieving

Technical Stack

Python · PyTorch · Sentence-Transformers (BGE-base + Matryoshka loss) · FAISS · BM25 (custom CSR-tensor implementation) · SPLADE · Cross-Encoders · MLflow + DagsHub · FastAPI · Modal (serverless deployment) · Streamlit · Hugging Face Spaces


🚀 Try the live demo · 💻 View on GitHub · 📊 MLflow experiment log

Scroll to Top