retrieval_basic benchmark. Adjust them only when profiling identifies a
specific problem. This page documents the parameters most likely to need
adjustment, with the rationale for each, drawn from
retrieval-tuning.md
and the engine-specific docs.
Threshold defaults
Khora’s retrieval defaults follow one principle: retrieve broadly, then rank precisely. An earlier version of Khora filtered aggressively at every stage and produced a 25.5% zero-result rate on theretrieval_basic benchmark. Descriptive
queries such as “wrought-iron tower built for the 1889 World’s Fair” returned
nothing because the 0.5 cosine floor discarded a 0.35-similarity Eiffel Tower
chunk before ranking saw it.
The current defaults lower thresholds to a noise floor (0.05 for chunk and
entity similarity), let RRF fusion and reranking determine relevance, and add a
zero-result fallback that re-queries with min_similarity=0.0. Before raising a
threshold, confirm the underlying issue is filtering rather than something
upstream such as ranking or reranker quality.
Where parameters are set
| Surface | Examples |
|---|---|
engine_kwargs={...} to Khora() | vectorcypher_config, storage_backend |
Per-call to recall() / remember() | temporal_filter, hybrid_alpha, bulk_mode, min_similarity |
KhoraConfig / KHORA_* env vars | Storage HNSW, LLM model, pipeline chunking, query defaults |
Tuning by symptom
| Symptom | First parameter to try |
|---|---|
| 25% of queries return nothing | Lower min_similarity to 0.0 and min_chunk_similarity / min_entity_similarity to 0.05. These are already the defaults. Verify they have not been overridden. |
| LLM extraction bill is the dominant cost | Lower skeleton_core_ratio (VectorCypher) or KHORA_PIPELINES_EXTRACTION_IMPORTANCE_RATIO (shared). Default 0.7. Drop to 0.3 for ~3× cost reduction at the price of denser-but-less-precise on-demand expansion. |
| Recent docs buried under older ones | Raise KHORA_QUERY_RECENCY_WEIGHT (default 0.35) and/or shorten KHORA_QUERY_RECENCY_DECAY_DAYS (default 7.0). |
| Graph queries miss multi-hop answers | Raise graph_default_depth from 2 to 3. Accept ~3–10× more candidates per entry entity and a P95 latency bump. |
| Paraphrased queries miss exact-term matches | Lower fusion_hybrid_alpha from 0.7 toward 0.3 to weight BM25 over vectors. Or leave alpha alone and ensure keyword search is enabled (it now runs in HYBRID mode by default). |
| Bulk ingest is slow | Pass bulk_mode=True to remember_batch() to defer the HNSW build until after the batch (~3–5× faster), and lower extraction_batch_size / raise max_concurrent_extractions if the LLM provider isn’t your bottleneck. |
| Cross-encoder rerank dominates P95 | It auto-skips when fewer than 5 candidates are available. Otherwise set enable_reranking=False to fall back to raw RRF, losing ~5–10pp precision but saving 1–3s per query. |
| Memory pressure on pgvector | Enable KHORA_STORAGE_USE_HALFVEC=true (default) for ~50% memory savings with minor recall loss. Requires pgvector ≥ 0.7. |
VectorCypher
VectorCypher’s tunables live onVectorCypherConfig and are passed in as
engine_kwargs={"vectorcypher_config": VectorCypherConfig(...)}. The
defaults below were chosen against the retrieval_basic benchmark. The
“When to adjust” column lists the symptom or workload that warrants
overriding them.
Extraction cost
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
skeleton_core_ratio | 0.70 | LLM bill is the dominant cost, OR you need maximum precision and have budget. | Lower → fewer LLM calls during ingest (0.7→0.3 is roughly 2× cost reduction), at the cost of denser entity graph for queries that hit non-core chunks. Set to 1.0 to recover legacy GraphRAG full-extraction behavior. |
conversation_skeleton_ratio | 0.90 | Conversational corpora where you can’t afford to drop dialog turns. | Higher than skeleton_core_ratio because chat context degrades more when chunks are skipped. Lower it for cost, raise it for fidelity. |
lazy_entity_expansion | True | Disable only if you’ve measured that on-demand expansion is hurting recall latency. | Enabled trades a small per-query expansion cost for a much smaller ingest bill. Disabling forces upfront extraction of everything. |
extraction_batch_size | 5 | LLM provider rate-limits you, or you’re hitting JSON-schema timeouts on large batches. | Larger batches amortize per-call overhead but increase tail latency and rate-limit blast radius. |
max_concurrent_extractions | 20 | Your provider can take more concurrency (raise) or rate-limits (lower). | Linear with extraction throughput, ungated by Khora. Your LLM provider is the limit. |
min_extraction_tokens | 50 | Conversational data dominated by short messages where extraction adds noise. | Raise to skip more chunks. Lower if you have many short but information-dense chunks (e.g., support tickets). |
max_chunks_in_flight | None | Memory pressure during large batches. | None = unlimited. Set a finite number to cap memory at the cost of pipeline parallelism. |
Fusion and routing
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
routing_enabled | True | Disable only if you’ve measured the router making bad decisions on your workload. | Routing varies fusion weights by query complexity. Without it, all queries use the moderate default 0.6 / 0.4. |
fusion_vector_weight / fusion_graph_weight | 0.6 / 0.4 | Vector-dominant corpora (research text, paraphrased queries) → raise vector. Relationship-heavy corpora (org charts, code) → raise graph. | These are the moderate tier weights. Raising vector toward 0.8 captures more paraphrase recall. Raising graph toward 0.6 emphasizes multi-hop connections. |
fusion_simple_* weights | 0.8 / 0.2 (vec/graph) | Simple factual lookups should lean even harder on vector. | Routed queries classified SIMPLE get this profile. Lower simple-vector-weight if your simple queries actually benefit from graph context. |
fusion_complex_* weights | 0.4 / 0.6 (vec/graph) | Complex multi-entity queries should lean on graph traversal. | Routed queries classified COMPLEX get this profile. |
fusion_rrf_k | 60 | You need either sharper rank-1 dominance (lower k) or smoother contributions across the result set (raise k). | RRF formula 1 / (k + rank). k=60 is the canonical sweet spot for multi-channel fusion. k=1 is aggressive (rank 1 is ~5× rank 5). k=100 is smooth. |
fusion_hybrid_alpha | 0.7 | Proper-noun-heavy queries → lower toward 0.3 to weight BM25. Paraphrased natural-language queries → raise toward 0.9. | The classic vector ↔ BM25 dial. The default leans semantic. |
Graph traversal
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
graph_default_depth | 2 | Multi-hop queries (“how is X connected to Y?”) want depth 3. Single-hop queries don’t need depth 2. | Each additional hop typically yields ~3–10× more candidate chunks per entry entity. Latency rises with the chunk count, not the depth itself. |
graph_max_depth | 4 | Hard cap. Raise if your domain genuinely needs 5+ hops (uncommon). | Acts as a guardrail against runaway expansion in pathological queries. |
graph_max_entry_entities | 10 | Dense graphs with many valid entry points → raise. Noisy vector search injecting bad entities → lower. | More entry entities = broader graph coverage but more candidates to fuse. |
retriever_min_entity_similarity | 0.3 | Paraphrased entity mentions miss the threshold → lower. False-positive entities dominate → raise. | The default is intentionally lenient. Graph search has its own zero-entity fallback that retries at 0.0 if nothing is found. |
Temporal
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
temporal_recency_weight | 0.2 | Recency-sensitive corpora (chat, tickets, news) → raise toward 0.4–0.5. Evergreen reference docs → lower toward 0.05. | Higher weight pushes recent results up the ranking even when their relevance is marginally lower. Too high and you’ll surface fresh-but-irrelevant docs. |
temporal_recency_decay_days | 30 | Fast-moving data (Slack: ~3 days, email: ~7) → shorten. Slow-moving (legal, manuals) → lengthen to 90+. | Half-life: at the decay window, recency contribution drops to 50%. At 2× the window, 25%. |
recency_decay_type | "exponential" | Switch to linear only if you want a constant penalty per day rather than half-life behavior. | Exponential matches the Ebbinghaus forgetting curve and is the conventional choice. |
BM25 and reranking
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
enable_bm25_channel | False | Technical corpora where exact terms matter (proper nouns, codes, IDs). | Adds a third RRF channel. Note: separate from fusion_hybrid_alpha, which already mixes BM25 into the vector channel. Only enable this when you want BM25 as its own first-class channel. |
enable_reranking | False | High-precision queries. Disable if rerank latency dominates and you can live with raw RRF. | Cross-encoder reranking improves precision ~5–10pp at the cost of 1–3s of latency. Auto-skipped when fewer than 5 candidates are available. |
reranking_top_n | 50 | Higher for harder-to-rank pools, lower to reduce rerank cost. | Cross-encoder is linear in top_n. |
reranking_blend_weight | 0.7 | Lower if the reranker is overriding good RRF rankings. | 0.7 = trust reranker 70%, original RRF 30%. Set to 1.0 to ignore RRF after rerank, 0.0 to disable reranker influence entirely. |
enable_llm_reranking | False | Temporal queries where ordering nuance matters and cross-encoder isn’t sufficient. | Listwise LLM rerank is expensive, gated by llm_reranking_confidence_threshold to fire only when the top-2 rank gap is small. |
llm_reranking_confidence_threshold | 0.1 | Raise to fire LLM rerank more often (more cost, more accuracy on tight calls). | Default fires only when the gap between rank 1 and rank 2 is < 0.1, i.e., when cross-encoder isn’t confident. |
Per-call parameters
BeyondVectorCypherConfig, these arguments are passed per call to
recall() / remember() / remember_batch():
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
temporal_filter (per recall()) | None | Always pass when your query has time bounds. SQL pushdown is cheaper than post-filtering. | Carries occurred_after/occurred_before/author/channel/tags. The fields filter at the database, not in Python. |
hybrid_alpha (per recall()) | 0.7 (HYBRID mode) | Proper-noun queries → lower toward 0.3. Paraphrased queries → leave near 0.7 or raise. | Vector vs BM25 blend for the RRF stage. |
min_similarity (per recall()) | 0.0 | Raise only if low-similarity noise is leaking into results. | Default 0.0 matches the retrieve-broadly default. Raise toward 0.3+ for strict filtering. |
bulk_mode (per remember_batch()) | False | Initial seed loads or any large one-shot batch. | Defers HNSW index build until after the batch, then rebuilds. ~3–5× faster ingest for bulk ops, not appropriate for incremental writes. |
deduplicate (per remember_batch()) | True | Disable only if you’re certain there are no duplicates and want to skip the checksum step. | The checksum step is inexpensive and is typically left enabled. |
chunk_strategy (per remember() / batch) | None (uses config default) | Override per-document when the source format diverges from the global default. | conversation for chat, recursive for code/markdown, semantic for prose, fixed for uniform input. |
max_concurrent (per remember_batch()) | 20 | Constrained downstream resources → lower. | Caps parallel document processing in batch ingest. |
Shared parameters
These live onKhoraConfig (env-var prefix KHORA_*) and affect every
engine. Most have stable defaults. The rationale below covers the ones with
real workload-dependent tradeoffs.
Retrieval thresholds
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
min_chunk_similarity | 0.05 | Raise only if low-similarity noise is leaking into results, but check first that your ranking isn’t the real issue. | This is the cosine noise floor that pgvector filters at. Default was 0.3. Benchmark analysis showed that floor caused a 25.5% zero-result rate, so it was lowered. |
min_entity_similarity | 0.05 | Same reasoning as chunk similarity. | Was 0.3 before the same benchmark fix. |
entity_linking_fuzzy_threshold | 0.6 | Raise toward 0.8 if fuzzy matching is producing too many spurious entity links. | Was 0.8 before retrieval tuning. Lower thresholds let more candidate links through to the disambiguation step. |
entity_linking_embedding_threshold | 0.4 | Same reasoning as fuzzy threshold. | Was 0.7 before retrieval tuning. |
Fusion weights (default RRF, used outside VectorCypher’s routed weights)
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
KHORA_QUERY_VECTOR_WEIGHT | 0.5 | Semantic-heavy corpora → raise. | Default fusion in HYBRID mode now includes keyword search. These three weights sum to a balanced blend. |
KHORA_QUERY_GRAPH_WEIGHT | 0.3 | Relationship-heavy corpora → raise. | |
KHORA_QUERY_KEYWORD_WEIGHT | 0.2 | Technical/proper-noun corpora → raise. | Note: keyword search now runs in HYBRID mode (previously gated to ALL mode). This is a benchmark-driven default change. |
MMR diversity and reranking
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
enable_diversity | True | Disable only if you’ve measured MMR removing genuinely relevant results. | Maximal Marginal Relevance reduces same-document dominance in result sets. Backed by Rust acceleration with NumPy/Python fallbacks. |
diversity_lambda | 0.7 | Lower toward 0.3 for more diversity (good for exploration). Raise toward 1.0 for pure relevance. | 1.0 = pure relevance, 0.0 = pure diversity. |
enable_reranking | True | Disable to save 1–3s per query when you can live with raw RRF (≈5–10pp precision loss). | Auto-skipped when fewer than 5 candidates are present. |
reranking_top_n | 50 | Higher for harder-to-rank pools, lower to reduce cost. | Cross-encoder is linear in top_n. |
HyDE (query expansion)
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
enable_hyde | "auto" | Force "always" for descriptive/paraphrased query workloads. Force "never" when latency is critical. | auto fires HyDE when query-understanding flags the query as complex or temporal. Each HyDE call adds an LLM hop (~1–2s). |
hyde_num_hypotheticals | 1 | Raise to 2–3 for diversity at the cost of more LLM calls. | More hypotheticals = better paraphrase recall, linearly more LLM cost. |
enable_hyde_cypher | False | Opt-in. Enable only after an A/B run on hand-curated structured queries. | Experimental in v0.12.0+. Asks an LLM to pick a parameterized Cypher template (recent_by_type, entity_relationships, cooccurrence) and execute it as an extra retrieval channel. |
Pipeline (extraction and chunking)
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
KHORA_PIPELINES_CHUNKING_STRATEGY | "semantic" | recursive for hierarchical content (code, markdown). fixed for uniform-format documents. conversation for chat. | Semantic chunking uses an LLM to split at natural boundaries (higher ingest cost, better retrieval). Fixed is fastest and predictable. |
KHORA_PIPELINES_CHUNK_SIZE | 512 | Long-form documents → raise to 1024 for more context per chunk. Short messages → drop to 256. | Larger chunks = fewer total chunks (cheaper ingest) but coarser retrieval. |
KHORA_PIPELINES_CHUNK_OVERLAP | 50 | Raise for entity-dense content to reduce boundary loss. Lower for sparse narrative to save storage. | More overlap = redundancy across chunks, less context loss. |
KHORA_PIPELINES_SELECTIVE_EXTRACTION | True | Disable only on small corpora (under 5K docs) where every chunk matters. | The cost lever. Disabling means 100% of chunks get LLM extraction, full but ~10× more expensive. |
KHORA_PIPELINES_EXTRACTION_IMPORTANCE_RATIO | 0.7 | Lower for cost-sensitive workloads, raise toward 1.0 for maximum precision. | Top fraction of chunks (by importance) sent to extraction. |
KHORA_PIPELINES_EXTRACTION_MIN_IMPORTANCE | 0.2 | Chunks above this importance are always extracted, regardless of the ratio cap. | Floor that prevents the ratio cap from skipping high-importance chunks. |
Conversation chunking
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
conversation_time_gap_minutes | 15 | Raise for naturally batched conversations (email-like threads). Lower for rapid chat. | Idle gaps longer than this start a new conversation chunk. |
conversation_max_group_size | 50 | Lower for fine-grained retrieval. Raise for more context per chunk. | Caps the number of messages bundled into one chunk. |
conversation_min_group_size | 2 | Rarely changed. | Groups below this size are merged with adjacent ones. |
Storage (pgvector)
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
KHORA_STORAGE_HNSW_M | 24 | Raise to 32+ for very high recall on billion-vector indexes. Lower to 16 for memory-constrained deployments. | Max connections per HNSW layer. Higher M = denser graph = better recall + more memory + slower build. |
KHORA_STORAGE_HNSW_EF_CONSTRUCTION | 128 | Raise to 256 if you can pay the build time and want maximum index quality. | Build-time search width. Higher = better index quality, slower build. |
KHORA_STORAGE_HNSW_EF_SEARCH | 100 | Raise to 200+ for benchmarks or precision-critical workloads. Lower to 50 for latency-sensitive workloads. | Query-time search width. Linear in latency, sub-linear in recall. |
KHORA_STORAGE_USE_HALFVEC | True | Disable only if you need exact float32 (uncommon) or have pgvector < 0.7. | Float16 quantization saves ~50% memory with negligible recall loss in practice. Graceful fallback when unsupported. |
Storage (LanceDB embedded)
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
lance_index | "auto" | Force ivf_pq for over 1M chunks where memory matters. Force hnsw for under 1M with strict latency. Force brute for small corpora where exact search is fine. | auto picks based on table size. The auto-detection is conservative. |
retrain_factor | 2.0 | Raise to 3+ to retrain less often (cheaper). Lower to 1.5 to keep the index fresh on growing corpora. | IVF-PQ retraining fires when row count ≥ factor × (rows at last training). Set ≤ 1.0 to disable retraining entirely. |
Neo4j (VectorCypher only)
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
entity_write_concurrency | 12 | Raise on bulk loads where Neo4j has headroom. | Caps concurrent entity-write transactions. |
relationship_write_concurrency | 8 | Same reasoning as entity writes. | |
query_timeout | 5.0 s | Lower to fail fast on slow graph queries. Raise for deep traversals on large graphs. | Per-transaction read timeout (1–300 s, None disables). |
LLM
| Parameter | Default | When to adjust | Tradeoff |
|---|---|---|---|
KHORA_LLM_EXTRACTION_MODEL | unset (falls back to model) | Set to a cheap fast model (Haiku, Gemini Flash) for extraction, reserving the primary model for generation. | The single biggest LLM-cost lever. Extraction is high-volume and tolerates a smaller model. |
KHORA_LLM_MAX_CONCURRENT_LLM_CALLS | 10 | Raise if your provider can take more concurrency. | Global cap across the whole library. Throttle to respect rate limits. |
KHORA_LLM_EMBEDDING_MODEL | text-embedding-3-small | Switch to -large for higher quality at higher cost. Must match embedding_dimension. | Changing model dimension requires schema migration. |