Architecture
Measured per-stage latency on the live deploy (rrf3_rerank pipeline, 119-query overall mean): 315 ms p95 end-to-end, well under the 2 s target. Stage timings are emitted on every search via Server-Sent Events.
| Stage | Pipeline step | Budget | Model / library |
|---|---|---|---|
| 0 | Intent router | 150 ms | Gemini 2.5 Flash-Lite + Instructor (lru-cached) |
| 1A | BM25 retrieval | 0.2 ms | BM25S BM25+, k1=1.5, b=0.75 |
| 1B | Dense retrieval | 2.3 ms | BGE-M3 (FP16) → in-memory NumPy cosine |
| 1C | KG retrieval | 25 ms | Cypher templates over Neo4j AuraDB Free |
| 2 | RRF3 fusion | 0.1 ms | k=60, BM25 + dense + KG ranks |
| 3 | Cross-encoder rerank | 285 ms | bge-reranker-v2-m3 FP16, top-25 funnel |
| 4 | Per-stage scores → SSE | 5 ms | FastAPI StreamingResponse |
query
│
▼
intent (Gemini)
│
┌──────┼──────┐
▼ ▼ ▼
BM25 dense KG ← three parallel channels
│ │ │
└──────┼──────┘
▼
RRF k=60 ← rank-fusion (RRF3)
│
▼
cross-encoder rerank ← bge-reranker-v2-m3, top-25 funnel
│
▼
results + per-stage
scores via SSE