Amazon ESCI Search

See Project with Full Code on GitHub

Two-stage hybrid retrieval over 480k Amazon products — Matryoshka 64-dim FAISS + SPLADE + BM25 → cross-encoder reranking. Production-deployed.

🚀 Try the live demo · 💻 View on GitHub · ⚙️ Raw API

Amazon ESCI demo: a query for 'wireless gaming mouse' returns ranked products in ~1 second, with per-result retrieval-signal breakdown showing which retrievers (BM25 / SPLADE / Dense) found each product and at what rank.

⏱️ Demo cold-start: ~30-60s on first request, ~1 sec after warmup.

The Objective

Build a production-style e-commerce search engine on the Amazon Shopping Queries dataset (ESCI), maximizing ranking precision (nDCG) while compressing vector storage 12× from 768 → 64 dimensions — without sacrificing recall.

The challenge: standard embedding models lose ~50% of their recall when truncated to 64 dimensions. Matryoshka representation learning is supposed to fix that. Does it?

The Architecture & Methodology

A two-stage pipeline with strict separation of concerns:

Hybrid Candidate Generation

Dense retrieval: BGE-base fine-tuned with Matryoshka loss, truncated to 64 dims, indexed in FAISS
Lexical retrieval: BM25 over stemmed product text
Neural sparse: SPLADE-cocondenser-ensembledistil with pre-encoded inverted index
All three fused via weighted Reciprocal Rank Fusion (RRF) with separately-tuned weights

Cross-Encoder Reranking

Top 200 candidates rescored by mxbai-rerank-base-v1 (deployed demo uses ms-marco-MiniLM-L-6-v2 for CPU latency)
Final top-K returned with full per-retriever signal breakdown

The Differentiator: Matryoshka @ 64 Dimensions

Most embedding models collapse when truncated. Matryoshka fine-tuning teaches the model to encode the most important features in the first N dimensions, so truncation is principled rather than lossy.

Strategy	Baseline 64-dim Recall@200	Matryoshka 64-dim Recall@200	Δ
Dense Only	0.43	0.74	+73%
Dense + BM25	0.56	0.78	+41%
Dense + SPLADE	0.66	0.81	+22%
Dense + BM25 + SPLADE	0.70	0.81	+16%

At 64 dimensions, hybrid Matryoshka matches the recall of a 768-dim baseline — at one-twelfth the storage and roughly 4× the QPS. All experiments tracked in MLflow on DagsHub.

Key Performance Metrics

Stage	Recall@200	nDCG@20	QPS
Retrieval (hybrid)	0.8148	0.4846	69.63
After Reranking	—	0.5378	5.72

Reranking adds ~5 points of nDCG@20 at a 12× QPS cost — a clear precision/throughput tradeoff that should be made per-deployment.

Production Deployment

The system is deployed end-to-end, not just benchmarked:

FastAPI backend containerized and deployed on Modal (CPU, scales to zero, ~1s warm latency)
Streamlit frontend on Hugging Face Spaces, calling the Modal API
Persistent Modal Volume holds the 1.7GB of artifacts (FAISS index, SPLADE matrix, BM25, fine-tuned matryoshka weights) — image stays small, cold starts stay reasonable
Per-result signal breakdown in the API response: every result shows which retrievers found it and at what rank, exposing the hybrid retrieval dynamics that are usually hidden

Core Insights

Matryoshka representation learning is real and matters — 64-dim Recall@200 = 0.81 vs 0.43 for naive truncation of the same base model
Hybrid > monolithic — every retriever contributes complementary signal; SPLADE catches what BM25 misses, dense catches what both miss
Reranking is non-negotiable for precision — but throughput-sensitive deployments should reserve it for high-stakes queries
Production deployment exposes bugs benchmarks never see — silent vector-space mismatches, CUDA-pickled tensors on CPU containers, sklearn version skew. All documented in the failure-mode log

What Didn’t Work

Loading the base BGE model in production (instead of the fine-tuned matryoshka) was a silent recall bug — query embeddings ended up in a different vector space than the indexed products. Caught only when productionizing
Pure RRF without weighted fusion underperformed; SPLADE and BM25 needed lower weights than dense to avoid drowning out semantic signal on long-tail queries
Cross-encoder on CPU at full top-200 was 16-second-per-query — unusable for live demo. Reduced to top-100 with smaller cross-encoder for ~1s demo latency
Reranking improves nDCG more than Recall — diminishing returns past Recall@10; the value is reordering, not retrieving

Technical Stack

Python · PyTorch · Sentence-Transformers (BGE-base + Matryoshka loss) · FAISS · BM25 (custom CSR-tensor implementation) · SPLADE · Cross-Encoders · MLflow + DagsHub · FastAPI · Modal (serverless deployment) · Streamlit · Hugging Face Spaces

🚀 Try the live demo · 💻 View on GitHub · 📊 MLflow experiment log