Tuning

Khora exposes ~150 tunable parameters. Most ship with defaults tuned against the retrieval_basic benchmark. Adjust them only when profiling identifies a specific problem. This page documents the parameters most likely to need adjustment, with the rationale for each.

Threshold defaults

Khora’s retrieval defaults follow one principle: retrieve broadly, then rank precisely. An earlier version of Khora filtered aggressively at every stage and produced a 25.5% zero-result rate on the retrieval_basic benchmark. Descriptive queries such as “wrought-iron tower built for the 1889 World’s Fair” returned nothing because the 0.5 cosine floor discarded a 0.35-similarity Eiffel Tower chunk before ranking saw it. The current defaults lower thresholds to a noise floor (0.0 for chunk similarity and 0.05 for entity similarity), let RRF fusion and reranking determine relevance, and add a zero-result fallback that re-queries with min_similarity=0.0. Before raising a threshold, confirm the underlying issue is filtering rather than something upstream such as ranking or reranker quality.

Where parameters are set

Surface	Examples
`engine_kwargs={...}` to `Khora()`	`vectorcypher_config`, `storage_backend`
Per-call to `recall()` / `remember()`	`temporal_filter`, `hybrid_alpha`, `bulk_mode`, `min_similarity`
`KhoraConfig` / `KHORA_*` env vars	Storage HNSW, LLM model, pipeline chunking, query defaults

Per-call > engine constructor > config > env. See Configuration for the complete env-var reference.

Tuning by symptom

Symptom	First parameter to try
25% of queries return nothing	Lower `min_similarity` and `min_chunk_similarity` to `0.0` and `min_entity_similarity` to `0.05`. These are already the defaults. Verify they have not been overridden.
LLM extraction bill is the dominant cost	Lower `skeleton_core_ratio` (VectorCypher, default 0.50) or `KHORA_PIPELINES_EXTRACTION_IMPORTANCE_RATIO` (generic pipeline, default 0.7). Drop toward 0.3 for a large cost reduction at the price of denser-but-less-precise on-demand expansion.
Recent docs buried under older ones	Raise `KHORA_QUERY_RECENCY_WEIGHT` (default 0.35) and/or shorten `KHORA_QUERY_RECENCY_DECAY_DAYS` (default 30.0; 7 is the conversational-recency opt-in).
Graph queries miss multi-hop answers	Raise `graph_default_depth` from 2 to 3. Accept ~3–10× more candidates per entry entity and a P95 latency bump.
Paraphrased queries miss exact-term matches	Lower `fusion_hybrid_alpha` from 0.7 toward 0.3 to weight lexical matching over vectors, and/or enable the separate BM25 keyword channel with `enable_bm25_channel=True` (off by default on the VectorCypher recall path).
Bulk ingest is slow	Pass `bulk_mode=True` to `remember_batch()` to defer the HNSW build until after the batch (~3–5× faster), and lower `extraction_batch_size` / raise `max_concurrent_extractions` if the LLM provider isn’t your bottleneck.
Cross-encoder rerank dominates P95	It auto-skips when fewer than 5 candidates are available. Otherwise set `enable_reranking=False` to fall back to raw RRF, losing ~5–10pp precision but saving 1–3s per query.
Memory pressure on pgvector	Enable `KHORA_STORAGE_USE_HALFVEC=true` (default) for ~50% memory savings with minor recall loss. Requires pgvector ≥ 0.7.

VectorCypher

VectorCypher’s tunables live on VectorCypherConfig and are passed in as engine_kwargs={"vectorcypher_config": VectorCypherConfig(...)}. The defaults below were chosen against the retrieval_basic benchmark. The “When to adjust” column lists the symptom or workload that warrants overriding them.

Extraction cost

Parameter	Default	When to adjust	Tradeoff
`skeleton_core_ratio`	`0.50`	LLM bill is the dominant cost, OR you need maximum precision and have budget.	Lower → fewer LLM calls during ingest, at the cost of a sparser entity graph for queries that hit non-core chunks. `0.7+` is the quality opt-in; set to `1.0` for full extraction over every chunk (what the removed `graphrag` engine did).
`conversation_skeleton_ratio`	`0.90`	Conversational corpora where you can’t afford to drop dialog turns.	Higher than `skeleton_core_ratio` because chat context degrades more when chunks are skipped. Lower it for cost, raise it for fidelity.
`lazy_entity_expansion`	`True`	Disable only if you’ve measured that on-demand expansion is hurting recall latency.	Enabled trades a small per-query expansion cost for a much smaller ingest bill. Disabling forces upfront extraction of everything.
`extraction_batch_size`	`5`	LLM provider rate-limits you, or you’re hitting JSON-schema timeouts on large batches.	Larger batches amortize per-call overhead but increase tail latency and rate-limit blast radius.
`max_concurrent_extractions`	`20`	Your provider can take more concurrency (raise) or rate-limits (lower).	Linear with extraction throughput, ungated by Khora. Your LLM provider is the limit.
`min_extraction_tokens`	`50`	Conversational data dominated by short messages where extraction adds noise.	Raise to skip more chunks. Lower if you have many short but information-dense chunks (e.g., support tickets).
`max_chunks_in_flight`	`None`	Memory pressure during large batches.	`None` = unlimited. Set a finite number to cap memory at the cost of pipeline parallelism.

Fusion and routing

Parameter	Default	When to adjust	Tradeoff
`routing_enabled`	`True`	Disable only if you’ve measured the router making bad decisions on your workload.	Routing varies fusion weights by query complexity. Without it, all queries use the moderate default `0.6 / 0.4`.
`fusion_vector_weight` / `fusion_graph_weight`	`0.6 / 0.4`	Vector-dominant corpora (research text, paraphrased queries) → raise vector. Relationship-heavy corpora (org charts, code) → raise graph.	These are the moderate tier weights. Raising vector toward 0.8 captures more paraphrase recall. Raising graph toward 0.6 emphasizes multi-hop connections.
`fusion_simple_*` weights	`0.8 / 0.2` (vec/graph)	Simple factual lookups should lean even harder on vector.	Routed queries classified SIMPLE get this profile. Lower simple-vector-weight if your simple queries actually benefit from graph context.
`fusion_complex_*` weights	`0.4 / 0.6` (vec/graph)	Complex multi-entity queries should lean on graph traversal.	Routed queries classified COMPLEX get this profile.
`fusion_rrf_k`	`60`	You need either sharper rank-1 dominance (lower k) or smoother contributions across the result set (raise k).	RRF formula `1 / (k + rank)`. `k=60` is the canonical sweet spot for multi-channel fusion. `k=1` is aggressive (rank 1 is ~5× rank 5). `k=100` is smooth.
`fusion_hybrid_alpha`	`0.7`	Proper-noun-heavy queries → lower toward 0.3 to weight BM25. Paraphrased natural-language queries → raise toward 0.9.	The classic vector ↔ BM25 dial. The default leans semantic.

Graph traversal

Parameter	Default	When to adjust	Tradeoff
`graph_default_depth`	`2`	Multi-hop queries (“how is X connected to Y?”) want depth 3. Single-hop queries don’t need depth 2.	Each additional hop typically yields ~3–10× more candidate chunks per entry entity. Latency rises with the chunk count, not the depth itself.
`graph_max_depth`	`4`	Hard cap. Raise if your domain genuinely needs 5+ hops (uncommon).	Acts as a guardrail against runaway expansion in pathological queries.
`graph_max_entry_entities`	`10`	Dense graphs with many valid entry points → raise. Noisy vector search injecting bad entities → lower.	More entry entities = broader graph coverage but more candidates to fuse.
`retriever_min_entity_similarity`	`0.3`	Paraphrased entity mentions miss the threshold → lower. False-positive entities dominate → raise.	The default is intentionally lenient. Graph search has its own zero-entity fallback that retries at 0.0 if nothing is found.

Temporal

Parameter	Default	When to adjust	Tradeoff
`temporal_recency_weight`	`0.35`	Recency-sensitive corpora (chat, tickets, news) → raise toward 0.4–0.5. Evergreen reference docs → lower toward 0.05.	Higher weight pushes recent results up the ranking even when their relevance is marginally lower. Too high and you’ll surface fresh-but-irrelevant docs.
`temporal_recency_decay_days`	`30`	Fast-moving data (Slack: ~3 days, email: ~7) → shorten. Slow-moving (legal, manuals) → lengthen to 90+.	Half-life: at the decay window, recency contribution drops to 50%. At 2× the window, 25%.
`recency_decay_type`	`"exponential"`	Switch to `linear` only if you want a constant penalty per day rather than half-life behavior.	Exponential matches the Ebbinghaus forgetting curve and is the conventional choice.

BM25 and reranking

Parameter	Default	When to adjust	Tradeoff
`enable_bm25_channel`	`False`	Technical corpora where exact terms matter (proper nouns, codes, IDs).	Adds a third RRF channel. Note: separate from `fusion_hybrid_alpha`, which already mixes BM25 into the vector channel. Only enable this when you want BM25 as its own first-class channel.
`enable_reranking`	`True`	On by default. Disable if rerank latency dominates and you can live with raw RRF.	Cross-encoder reranking improves precision ~5–10pp at the cost of 1–3s of latency. Auto-skipped when fewer than 5 candidates are available.
`reranking_top_n`	`50`	Higher for harder-to-rank pools, lower to reduce rerank cost.	Cross-encoder is linear in `top_n`.
`reranking_blend_weight`	`0.7`	Lower if the reranker is overriding good RRF rankings.	`0.7` = trust reranker 70%, original RRF 30%. Set to 1.0 to ignore RRF after rerank, 0.0 to disable reranker influence entirely.
`enable_llm_reranking`	`False`	Temporal queries where ordering nuance matters and cross-encoder isn’t sufficient.	Listwise LLM rerank is expensive, gated by `llm_reranking_confidence_threshold` to fire only when the top-2 rank gap is small.
`llm_reranking_confidence_threshold`	`0.1`	Raise to fire LLM rerank more often (more cost, more accuracy on tight calls).	Default fires only when the gap between rank 1 and rank 2 is < 0.1, i.e., when cross-encoder isn’t confident.

Per-call parameters

Beyond VectorCypherConfig, these arguments are passed per call to recall() / remember() / remember_batch():

Parameter	Default	When to adjust	Tradeoff
`temporal_filter` (per `recall()`)	`None`	Always pass when your query has time bounds. SQL pushdown is cheaper than post-filtering.	Carries `occurred_after`/`occurred_before`/`author`/`channel`/`tags`. The fields filter at the database, not in Python.
`hybrid_alpha` (per `recall()`)	`0.7` (HYBRID mode)	Proper-noun queries → lower toward 0.3. Paraphrased queries → leave near 0.7 or raise.	Vector vs BM25 blend for the RRF stage.
`min_similarity` (per `recall()`)	`0.0`	Raise only if low-similarity noise is leaking into results.	Default `0.0` matches the retrieve-broadly default. Raise toward `0.3+` for strict filtering.
`bulk_mode` (per `remember_batch()`)	`False`	Initial seed loads or any large one-shot batch.	Defers HNSW index build until after the batch, then rebuilds. ~3–5× faster ingest for bulk ops, not appropriate for incremental writes.
`deduplicate` (per `remember_batch()`)	`True`	Disable only if you’re certain there are no duplicates and want to skip the checksum step.	The checksum step is inexpensive and is typically left enabled.
`chunk_strategy` (per `remember()` / batch)	`None` (uses config default)	Override per-document when the source format diverges from the global default.	`conversation` for chat, `recursive` for code/markdown, `semantic` for prose, `fixed` for uniform input.
`max_concurrent` (per `remember_batch()`)	`20`	Constrained downstream resources → lower.	Caps parallel document processing in batch ingest.

Shared parameters

These live on KhoraConfig (env-var prefix KHORA_*) and affect every engine. Most have stable defaults. The rationale below covers the ones with real workload-dependent tradeoffs.

Retrieval thresholds

Parameter	Default	When to adjust	Tradeoff
`min_chunk_similarity`	`0.0`	Raise only if low-similarity noise is leaking into results, but check first that your ranking isn’t the real issue.	This is the cosine noise floor that pgvector filters at. Default was `0.3`, lowered to `0.0` (no floor) after benchmark analysis showed the old floor caused a 25.5% zero-result rate.
`min_entity_similarity`	`0.05`	Same reasoning as chunk similarity.	Was `0.3` before the same benchmark fix.
`entity_linking_fuzzy_threshold`	`0.5`	Raise toward 0.8 if fuzzy matching is producing too many spurious entity links.	Was 0.8 before retrieval tuning. Lower thresholds let more candidate links through to the disambiguation step.
`entity_linking_embedding_threshold`	`0.4`	Same reasoning as fuzzy threshold.	Was 0.7 before retrieval tuning.

Fusion weights

Parameter	Default	When to adjust	Tradeoff
`KHORA_QUERY_VECTOR_WEIGHT`	`0.6`	Semantic-heavy corpora → raise.	The default `recall()` path fuses vector + graph only.
`KHORA_QUERY_GRAPH_WEIGHT`	`0.4`	Relationship-heavy corpora → raise.
`KHORA_QUERY_KEYWORD_WEIGHT`	`0.3`	Technical/proper-noun corpora → raise.	Fills the BM25 slot but is inert until you enable the keyword channel (`enable_bm25_channel=True` / `KHORA_QUERY_ENABLE_BM25_CHANNEL=true`).

MMR diversity and reranking

Parameter	Default	When to adjust	Tradeoff
`enable_diversity`	`True`	Disable only if you’ve measured MMR removing genuinely relevant results.	Maximal Marginal Relevance reduces same-document dominance in result sets. Backed by Rust acceleration with NumPy/Python fallbacks.
`diversity_lambda`	`0.5`	Lower toward 0.3 for more diversity (good for exploration). Raise toward 1.0 for pure relevance.	`1.0` = pure relevance, `0.0` = pure diversity.
`enable_reranking`	`True`	Disable to save 1–3s per query when you can live with raw RRF (≈5–10pp precision loss).	Auto-skipped when fewer than 5 candidates are present.
`reranking_top_n`	`50`	Higher for harder-to-rank pools, lower to reduce cost.	Cross-encoder is linear in `top_n`.

HyDE (query expansion)

Parameter	Default	When to adjust	Tradeoff
`enable_hyde`	`"auto"`	Force `"always"` for descriptive/paraphrased query workloads. Force `"never"` when latency is critical.	`auto` fires HyDE when query-understanding flags the query as complex or temporal. Each HyDE call adds an LLM hop (~1–2s).
`hyde_num_hypotheticals`	`1`	Raise to 2–3 for diversity at the cost of more LLM calls.	More hypotheticals = better paraphrase recall, linearly more LLM cost.
`enable_hyde_cypher`	`False`	Currently inert, leave off.	The `khora.query.hyde_cypher` module (LLM-picked parameterized Cypher templates) exists but is not wired into retrieval, so setting this has no effect on `recall()`.

Pipeline (extraction)

Parameter	Default	When to adjust	Tradeoff
`KHORA_PIPELINES_CHUNKING_STRATEGY`	`"semantic"`	`recursive` for hierarchical content (code, markdown). `fixed` for uniform-format documents. `conversation` for chat.	Semantic chunking uses an LLM to split at natural boundaries (higher ingest cost, better retrieval). Fixed is fastest and predictable.
`KHORA_PIPELINES_CHUNK_SIZE`	`512`	Long-form documents → raise to 1024 for more context per chunk. Short messages → drop to 256.	Larger chunks = fewer total chunks (cheaper ingest) but coarser retrieval.
`KHORA_PIPELINES_CHUNK_OVERLAP`	`50`	Raise for entity-dense content to reduce boundary loss. Lower for sparse narrative to save storage.	More overlap = redundancy across chunks, less context loss.
`KHORA_PIPELINES_SELECTIVE_EXTRACTION`	`True`	Disable only on small corpora (under 5K docs) where every chunk matters.	The cost lever. Disabling means 100% of chunks get LLM extraction, full but ~10× more expensive.
`KHORA_PIPELINES_EXTRACTION_IMPORTANCE_RATIO`	`0.7`	Lower for cost-sensitive workloads, raise toward 1.0 for maximum precision.	Top fraction of chunks (by importance) sent to extraction.
`KHORA_PIPELINES_EXTRACTION_MIN_IMPORTANCE`	`0.2`	Chunks above this importance are always extracted, regardless of the ratio cap.	Floor that prevents the ratio cap from skipping high-importance chunks.

Conversation chunking

Parameter	Default	When to adjust	Tradeoff
`conversation_time_gap_minutes`	`15`	Raise for naturally batched conversations (email-like threads). Lower for rapid chat.	Idle gaps longer than this start a new conversation chunk.
`conversation_max_group_size`	`50`	Lower for fine-grained retrieval. Raise for more context per chunk.	Caps the number of messages bundled into one chunk.
`conversation_min_group_size`	`2`	Rarely changed.	Groups below this size are merged with adjacent ones.

Storage (pgvector)

Parameter	Default	When to adjust	Tradeoff
`KHORA_STORAGE_HNSW_M`	`24`	Raise to 32+ for very high recall on billion-vector indexes. Lower to 16 for memory-constrained deployments.	Max connections per HNSW layer. Higher M = denser graph = better recall + more memory + slower build.
`KHORA_STORAGE_HNSW_EF_CONSTRUCTION`	`128`	Raise to 256 if you can pay the build time and want maximum index quality.	Build-time search width. Higher = better index quality, slower build.
`KHORA_STORAGE_HNSW_EF_SEARCH`	`100`	Raise to 200+ for benchmarks or precision-critical workloads. Lower to 50 for latency-sensitive workloads.	Query-time search width. Linear in latency, sub-linear in recall.
`KHORA_STORAGE_USE_HALFVEC`	`True`	Disable only if you need exact float32 (uncommon) or have pgvector < 0.7.	Float16 quantization saves ~50% memory with negligible recall loss in practice. Graceful fallback when unsupported.

Storage (LanceDB)

Parameter	Default	When to adjust	Tradeoff
`lance_index`	`"auto"`	Force `ivf_pq` for over 1M chunks where memory matters. Force `hnsw` for under 1M with strict latency. Force `brute` for small corpora where exact search is fine.	`auto` picks based on table size. The auto-detection is conservative.
`retrain_factor`	`2.0`	Raise to 3+ to retrain less often (cheaper). Lower to 1.5 to keep the index fresh on growing corpora.	IVF-PQ retraining fires when row count ≥ `factor × (rows at last training)`. Set ≤ 1.0 to disable retraining entirely.

Neo4j (VectorCypher only)

Parameter	Default	When to adjust	Tradeoff
`entity_write_concurrency`	`12`	Raise on bulk loads where Neo4j has headroom.	Caps concurrent entity-write transactions.
`relationship_write_concurrency`	`8`	Same reasoning as entity writes.
`query_timeout`	`5.0` s	Lower to fail fast on slow graph queries. Raise for deep traversals on large graphs.	Per-transaction read timeout (1–300 s, `None` disables).

LLM

Parameter	Default	When to adjust	Tradeoff
`KHORA_LLM_EXTRACTION_MODEL`	unset (falls back to `model`)	Set to a cheap fast model (Haiku, Gemini Flash) for extraction, reserving the primary model for generation.	The single biggest LLM-cost lever. Extraction is high-volume and tolerates a smaller model.
`KHORA_LLM_MAX_CONCURRENT_LLM_CALLS`	`10`	Raise if your provider can take more concurrency.	Global cap across the whole library. Throttle to respect rate limits.
`KHORA_LLM_EMBEDDING_MODEL`	`text-embedding-3-small`	Switch to `-large` for higher quality at higher cost. Must match `embedding_dimension`.	Changing model dimension requires schema migration.

Getting started

Concepts

Operations

Experimental Features

Integrations

Reference

Examples

Threshold defaults

Where parameters are set

Tuning by symptom

VectorCypher

Extraction cost

Fusion and routing

Graph traversal

Temporal

BM25 and reranking

Per-call parameters

Shared parameters

Retrieval thresholds

Fusion weights

MMR diversity and reranking

HyDE (query expansion)

Pipeline (extraction)

Conversation chunking

Storage (pgvector)

Storage (LanceDB)

Neo4j (VectorCypher only)

LLM

​Threshold defaults

​Where parameters are set

​Tuning by symptom

​VectorCypher

​Extraction cost

​Fusion and routing

​Graph traversal

​Temporal

​BM25 and reranking

​Per-call parameters

​Shared parameters

​Retrieval thresholds

​Fusion weights

​MMR diversity and reranking

​HyDE (query expansion)

​Pipeline (extraction)

​Conversation chunking

​Storage (pgvector)

​Storage (LanceDB)

​Neo4j (VectorCypher only)

​LLM

Threshold defaults

Where parameters are set

Tuning by symptom

VectorCypher

Extraction cost

Fusion and routing

Graph traversal

Temporal

BM25 and reranking

Per-call parameters

Shared parameters

Retrieval thresholds

Fusion weights

MMR diversity and reranking

HyDE (query expansion)

Pipeline (extraction)

Conversation chunking

Storage (pgvector)

Storage (LanceDB)

Neo4j (VectorCypher only)

LLM