Evaluation
Locked snapshot of every offline metric: corpus + KG composition, the full ablation matrix across 119 queries on two strata, and the per-stage latency budget on a single RTX 4060.
Best nDCG@10
0.557
RRF3 (BM25 + dense + KG)
Recall@100
0.700
target ≥ 0.97
MRR@10
0.591
target ≥ 0.65
Full pipeline
315 ms
target p95 < 2000 ms
Corpus
- Profiles
- 1,782
- Jobs
- 1,370
- Canonical skills
- 4,553
- HAS_SKILL edges
- 30,369
- REQUIRES_SKILL edges
- 5,129
- Eval queries
- 119 50 candidate · 50 job · 19 paraphrase
Knowledge graph
- Person nodes
- 1,782
- Job nodes
- 1,370
- Skill nodes
- 4,553
- Other nodes
- 1,433 834 role · 415 desig · 53 ind · 131 loc
- Total relationships
- 45,731 HAS_SKILL · REQUIRES_SKILL · CAN_FILL · IS_DESIGNATION · AT_LOCATION · IN_INDUSTRY
Ablation matrix
7 retrieval configurations × 3 strata. Highlighted cells are the best nDCG@10 per stratum.
Overall (mean of candidate + job tasks)
| Configuration | nDCG@10 | R@10 | R@100 | MRR@10 | Latency |
|---|---|---|---|---|---|
BM25 | 0.540 | 0.514 | 0.704 | 0.567 | 0.2 ms |
BGE-M3 dense | 0.527 | 0.516 | 0.703 | 0.548 | 2.3 ms |
RRF (BM25 + dense) | 0.551 | 0.521 | 0.696 | 0.585 | 2.5 ms |
Cross-encoder rerank top-25 | 0.538 | 0.535 | 0.696 | 0.551 | 285 ms |
KG channel only | 0.425 | 0.443 | 0.686 | 0.456 | 25 ms |
RRF3 (BM25 + dense + KG) | 0.557 | 0.540 | 0.700 | 0.591 | 30 ms |
Full pipeline (RRF3 + rerank top-25) | 0.544 | 0.536 | 0.700 | 0.565 | 315 ms |
Original stratum (100 lexical-anchor queries)
| Configuration | nDCG@10 | R@10 | R@100 | MRR@10 | Latency |
|---|---|---|---|---|---|
BM25 | 0.604 | 0.580 | 0.770 | 0.634 | 0.1 ms |
BGE-M3 dense | 0.589 | 0.583 | 0.759 | 0.612 | 2.3 ms |
RRF (BM25 + dense) | 0.618 | 0.589 | 0.760 | 0.657 | 2.5 ms |
Cross-encoder rerank top-25 | 0.598 | 0.596 | 0.760 | 0.614 | 285 ms |
KG channel only | 0.476 | 0.497 | 0.747 | 0.512 | 25 ms |
RRF3 (BM25 + dense + KG) | 0.621 | 0.601 | 0.756 | 0.663 | 30 ms |
Full pipeline (RRF3 + rerank top-25) | 0.605 | 0.597 | 0.756 | 0.630 | 315 ms |
Paraphrase stratum (19 Gemini-generated queries)
| Configuration | nDCG@10 | R@10 | R@100 | MRR@10 | Latency |
|---|---|---|---|---|---|
BM25 | 0.201 | 0.162 | 0.359 | 0.211 | — |
BGE-M3 dense | 0.201 | 0.162 | 0.406 | 0.211 | — |
RRF (BM25 + dense) | 0.201 | 0.162 | 0.358 | 0.211 | — |
Cross-encoder rerank top-25 | 0.220 | 0.215 | 0.358 | 0.219 | — |
KG channel only | 0.158 | 0.158 | 0.368 | 0.158 | — |
RRF3 (BM25 + dense + KG) | 0.217 | 0.215 | 0.400 | 0.216 | — |
Full pipeline (RRF3 + rerank top-25) | 0.224 | 0.215 | 0.400 | 0.224 | — |
Latency budget per stage
- Intent classification150 ms
- Query encode (BGE-M3)1.7 ms
- BM25 retrieve top-1000.1 ms
- Dense cosine search0.6 ms
- KG Cypher (skill overlap)25 ms
- RRF fusion2.5 ms
- Cross-encoder rerank (top-25)285 ms
- Total — full pipeline315 ms
Findings
- RRF3 (BM25 + dense + KG) is the new best on the lexical original stratum (nDCG@10 = 0.621).
- Full pipeline (RRF3 + cross-encoder rerank top-25) is the new best on the paraphrase stratum (nDCG@10 = 0.224).
- Each retrieval stage handles a different query distribution; the multi-stage funnel is robust across both strata.
- Cross-encoder is the only stage that lifts paraphrase metrics, because bge-reranker-v2-m3 is trained on natural-language pairs (MS-MARCO, MIRACL).
- Paraphrase stratum is 19/100 (Gemini Flash-Lite RPD quota) — backfill pending; relative ranking already established.