
🎯 The Objective
This project builds an industry-style search and relevance pipeline using the rigorous Amazon Shopping Queries Dataset (ESCI). The goal is to retrieve the most relevant products for a given e-commerce query by implementing a high-performance, two-stage search architecture.
🏗️ The Architecture & Methodology
The pipeline was engineered to balance rapid candidate retrieval with high-precision reranking, utilizing advanced representation learning:
Matryoshka Fine-Tuning: The core engineering addition was fine-tuning a Matryoshka bi-encoder using MultipleNegativesRankingLoss (MNRL). This forces the most critical semantic information into the earliest dimensions of the vector, enabling the use of highly compressed 64-dimensional embeddings without destroying retrieval quality.
Stage 1 (Candidate Generation): Implemented a Hybrid Retrieval stack to maximize coverage. This combined a Dense bi-encoder, lexical search (BM25) restricted strictly to product titles for precision, and learned sparse expansion (SPLADE) to capture synonyms. These methods were robustly merged using Weighted RRF fusion.
Stage 2 (Reranking): Applied a Cross-Encoder to the retrieved candidates to heavily optimize top-K precision.
📊 Key Performance Metrics
By leveraging a Hybrid architecture at just 64 dimensions, the system achieved phenomenal e-commerce metrics entirely on consumer-grade hardware.
- Reranker nDCG@20: 0.5395
- Retrieval Recall@200 (64-dim): 81.25%
- Retrieval QPS (Queries Per Second): 70.51
💡 Core Insights & Business Impact
- Solving Dimensionality Collapse: Standard baseline embedding models truncated to 64 dimensions suffered a catastrophic recall collapse, dropping to a Recall@200 of 0.4270. The Matryoshka fine-tuned model successfully preserved semantic meaning, jumping to a dense-only Recall of 0.7392 at the exact same size.
- Drastic Cost Reduction: Serving 64-dimensional vectors rather than standard 768-dimensional vectors significantly reduces index size, memory footprint, and latency—translating to massive infrastructure cost savings in a production environment.
- Strategic Retrieval Rules: By treating both “Exact” and “Substitute” items as positives during candidate generation, the pipeline maximizes coverage and surfaces profitable product alternatives rather than aggressively filtering them out.
⚙️ Technical Stack
- Languages & Tools: Python, SentenceTransformers, FAISS, MLflow.
- Techniques: Bi-encoders, Cross-encoders, Hybrid Retrieval, Learned Sparse Expansion (SPLADE), Matryoshka Representation Learning, Contrastive Learning (MNRL).