Skip to main content
Khora exposes ~150 tunable parameters. Most ship with defaults tuned against the retrieval_basic benchmark. Adjust them only when profiling identifies a specific problem. This page documents the parameters most likely to need adjustment, with the rationale for each, drawn from retrieval-tuning.md and the engine-specific docs.

Threshold defaults

Khora’s retrieval defaults follow one principle: retrieve broadly, then rank precisely. An earlier version of Khora filtered aggressively at every stage and produced a 25.5% zero-result rate on the retrieval_basic benchmark. Descriptive queries such as “wrought-iron tower built for the 1889 World’s Fair” returned nothing because the 0.5 cosine floor discarded a 0.35-similarity Eiffel Tower chunk before ranking saw it. The current defaults lower thresholds to a noise floor (0.05 for chunk and entity similarity), let RRF fusion and reranking determine relevance, and add a zero-result fallback that re-queries with min_similarity=0.0. Before raising a threshold, confirm the underlying issue is filtering rather than something upstream such as ranking or reranker quality.

Where parameters are set

SurfaceExamples
engine_kwargs={...} to Khora()vectorcypher_config, storage_backend
Per-call to recall() / remember()temporal_filter, hybrid_alpha, bulk_mode, min_similarity
KhoraConfig / KHORA_* env varsStorage HNSW, LLM model, pipeline chunking, query defaults
Per-call > engine constructor > config > env. See Configuration for the complete env-var reference.

Tuning by symptom

SymptomFirst parameter to try
25% of queries return nothingLower min_similarity to 0.0 and min_chunk_similarity / min_entity_similarity to 0.05. These are already the defaults. Verify they have not been overridden.
LLM extraction bill is the dominant costLower skeleton_core_ratio (VectorCypher) or KHORA_PIPELINES_EXTRACTION_IMPORTANCE_RATIO (shared). Default 0.7. Drop to 0.3 for ~3× cost reduction at the price of denser-but-less-precise on-demand expansion.
Recent docs buried under older onesRaise KHORA_QUERY_RECENCY_WEIGHT (default 0.35) and/or shorten KHORA_QUERY_RECENCY_DECAY_DAYS (default 7.0).
Graph queries miss multi-hop answersRaise graph_default_depth from 2 to 3. Accept ~3–10× more candidates per entry entity and a P95 latency bump.
Paraphrased queries miss exact-term matchesLower fusion_hybrid_alpha from 0.7 toward 0.3 to weight BM25 over vectors. Or leave alpha alone and ensure keyword search is enabled (it now runs in HYBRID mode by default).
Bulk ingest is slowPass bulk_mode=True to remember_batch() to defer the HNSW build until after the batch (~3–5× faster), and lower extraction_batch_size / raise max_concurrent_extractions if the LLM provider isn’t your bottleneck.
Cross-encoder rerank dominates P95It auto-skips when fewer than 5 candidates are available. Otherwise set enable_reranking=False to fall back to raw RRF, losing ~5–10pp precision but saving 1–3s per query.
Memory pressure on pgvectorEnable KHORA_STORAGE_USE_HALFVEC=true (default) for ~50% memory savings with minor recall loss. Requires pgvector ≥ 0.7.

VectorCypher

VectorCypher’s tunables live on VectorCypherConfig and are passed in as engine_kwargs={"vectorcypher_config": VectorCypherConfig(...)}. The defaults below were chosen against the retrieval_basic benchmark. The “When to adjust” column lists the symptom or workload that warrants overriding them.

Extraction cost

ParameterDefaultWhen to adjustTradeoff
skeleton_core_ratio0.70LLM bill is the dominant cost, OR you need maximum precision and have budget.Lower → fewer LLM calls during ingest (0.7→0.3 is roughly 2× cost reduction), at the cost of denser entity graph for queries that hit non-core chunks. Set to 1.0 to recover legacy GraphRAG full-extraction behavior.
conversation_skeleton_ratio0.90Conversational corpora where you can’t afford to drop dialog turns.Higher than skeleton_core_ratio because chat context degrades more when chunks are skipped. Lower it for cost, raise it for fidelity.
lazy_entity_expansionTrueDisable only if you’ve measured that on-demand expansion is hurting recall latency.Enabled trades a small per-query expansion cost for a much smaller ingest bill. Disabling forces upfront extraction of everything.
extraction_batch_size5LLM provider rate-limits you, or you’re hitting JSON-schema timeouts on large batches.Larger batches amortize per-call overhead but increase tail latency and rate-limit blast radius.
max_concurrent_extractions20Your provider can take more concurrency (raise) or rate-limits (lower).Linear with extraction throughput, ungated by Khora. Your LLM provider is the limit.
min_extraction_tokens50Conversational data dominated by short messages where extraction adds noise.Raise to skip more chunks. Lower if you have many short but information-dense chunks (e.g., support tickets).
max_chunks_in_flightNoneMemory pressure during large batches.None = unlimited. Set a finite number to cap memory at the cost of pipeline parallelism.

Fusion and routing

ParameterDefaultWhen to adjustTradeoff
routing_enabledTrueDisable only if you’ve measured the router making bad decisions on your workload.Routing varies fusion weights by query complexity. Without it, all queries use the moderate default 0.6 / 0.4.
fusion_vector_weight / fusion_graph_weight0.6 / 0.4Vector-dominant corpora (research text, paraphrased queries) → raise vector. Relationship-heavy corpora (org charts, code) → raise graph.These are the moderate tier weights. Raising vector toward 0.8 captures more paraphrase recall. Raising graph toward 0.6 emphasizes multi-hop connections.
fusion_simple_* weights0.8 / 0.2 (vec/graph)Simple factual lookups should lean even harder on vector.Routed queries classified SIMPLE get this profile. Lower simple-vector-weight if your simple queries actually benefit from graph context.
fusion_complex_* weights0.4 / 0.6 (vec/graph)Complex multi-entity queries should lean on graph traversal.Routed queries classified COMPLEX get this profile.
fusion_rrf_k60You need either sharper rank-1 dominance (lower k) or smoother contributions across the result set (raise k).RRF formula 1 / (k + rank). k=60 is the canonical sweet spot for multi-channel fusion. k=1 is aggressive (rank 1 is ~5× rank 5). k=100 is smooth.
fusion_hybrid_alpha0.7Proper-noun-heavy queries → lower toward 0.3 to weight BM25. Paraphrased natural-language queries → raise toward 0.9.The classic vector ↔ BM25 dial. The default leans semantic.

Graph traversal

ParameterDefaultWhen to adjustTradeoff
graph_default_depth2Multi-hop queries (“how is X connected to Y?”) want depth 3. Single-hop queries don’t need depth 2.Each additional hop typically yields ~3–10× more candidate chunks per entry entity. Latency rises with the chunk count, not the depth itself.
graph_max_depth4Hard cap. Raise if your domain genuinely needs 5+ hops (uncommon).Acts as a guardrail against runaway expansion in pathological queries.
graph_max_entry_entities10Dense graphs with many valid entry points → raise. Noisy vector search injecting bad entities → lower.More entry entities = broader graph coverage but more candidates to fuse.
retriever_min_entity_similarity0.3Paraphrased entity mentions miss the threshold → lower. False-positive entities dominate → raise.The default is intentionally lenient. Graph search has its own zero-entity fallback that retries at 0.0 if nothing is found.

Temporal

ParameterDefaultWhen to adjustTradeoff
temporal_recency_weight0.2Recency-sensitive corpora (chat, tickets, news) → raise toward 0.4–0.5. Evergreen reference docs → lower toward 0.05.Higher weight pushes recent results up the ranking even when their relevance is marginally lower. Too high and you’ll surface fresh-but-irrelevant docs.
temporal_recency_decay_days30Fast-moving data (Slack: ~3 days, email: ~7) → shorten. Slow-moving (legal, manuals) → lengthen to 90+.Half-life: at the decay window, recency contribution drops to 50%. At 2× the window, 25%.
recency_decay_type"exponential"Switch to linear only if you want a constant penalty per day rather than half-life behavior.Exponential matches the Ebbinghaus forgetting curve and is the conventional choice.

BM25 and reranking

ParameterDefaultWhen to adjustTradeoff
enable_bm25_channelFalseTechnical corpora where exact terms matter (proper nouns, codes, IDs).Adds a third RRF channel. Note: separate from fusion_hybrid_alpha, which already mixes BM25 into the vector channel. Only enable this when you want BM25 as its own first-class channel.
enable_rerankingFalseHigh-precision queries. Disable if rerank latency dominates and you can live with raw RRF.Cross-encoder reranking improves precision ~5–10pp at the cost of 1–3s of latency. Auto-skipped when fewer than 5 candidates are available.
reranking_top_n50Higher for harder-to-rank pools, lower to reduce rerank cost.Cross-encoder is linear in top_n.
reranking_blend_weight0.7Lower if the reranker is overriding good RRF rankings.0.7 = trust reranker 70%, original RRF 30%. Set to 1.0 to ignore RRF after rerank, 0.0 to disable reranker influence entirely.
enable_llm_rerankingFalseTemporal queries where ordering nuance matters and cross-encoder isn’t sufficient.Listwise LLM rerank is expensive, gated by llm_reranking_confidence_threshold to fire only when the top-2 rank gap is small.
llm_reranking_confidence_threshold0.1Raise to fire LLM rerank more often (more cost, more accuracy on tight calls).Default fires only when the gap between rank 1 and rank 2 is < 0.1, i.e., when cross-encoder isn’t confident.

Per-call parameters

Beyond VectorCypherConfig, these arguments are passed per call to recall() / remember() / remember_batch():
ParameterDefaultWhen to adjustTradeoff
temporal_filter (per recall())NoneAlways pass when your query has time bounds. SQL pushdown is cheaper than post-filtering.Carries occurred_after/occurred_before/author/channel/tags. The fields filter at the database, not in Python.
hybrid_alpha (per recall())0.7 (HYBRID mode)Proper-noun queries → lower toward 0.3. Paraphrased queries → leave near 0.7 or raise.Vector vs BM25 blend for the RRF stage.
min_similarity (per recall())0.0Raise only if low-similarity noise is leaking into results.Default 0.0 matches the retrieve-broadly default. Raise toward 0.3+ for strict filtering.
bulk_mode (per remember_batch())FalseInitial seed loads or any large one-shot batch.Defers HNSW index build until after the batch, then rebuilds. ~3–5× faster ingest for bulk ops, not appropriate for incremental writes.
deduplicate (per remember_batch())TrueDisable only if you’re certain there are no duplicates and want to skip the checksum step.The checksum step is inexpensive and is typically left enabled.
chunk_strategy (per remember() / batch)None (uses config default)Override per-document when the source format diverges from the global default.conversation for chat, recursive for code/markdown, semantic for prose, fixed for uniform input.
max_concurrent (per remember_batch())20Constrained downstream resources → lower.Caps parallel document processing in batch ingest.

Shared parameters

These live on KhoraConfig (env-var prefix KHORA_*) and affect every engine. Most have stable defaults. The rationale below covers the ones with real workload-dependent tradeoffs.

Retrieval thresholds

ParameterDefaultWhen to adjustTradeoff
min_chunk_similarity0.05Raise only if low-similarity noise is leaking into results, but check first that your ranking isn’t the real issue.This is the cosine noise floor that pgvector filters at. Default was 0.3. Benchmark analysis showed that floor caused a 25.5% zero-result rate, so it was lowered.
min_entity_similarity0.05Same reasoning as chunk similarity.Was 0.3 before the same benchmark fix.
entity_linking_fuzzy_threshold0.6Raise toward 0.8 if fuzzy matching is producing too many spurious entity links.Was 0.8 before retrieval tuning. Lower thresholds let more candidate links through to the disambiguation step.
entity_linking_embedding_threshold0.4Same reasoning as fuzzy threshold.Was 0.7 before retrieval tuning.

Fusion weights (default RRF, used outside VectorCypher’s routed weights)

ParameterDefaultWhen to adjustTradeoff
KHORA_QUERY_VECTOR_WEIGHT0.5Semantic-heavy corpora → raise.Default fusion in HYBRID mode now includes keyword search. These three weights sum to a balanced blend.
KHORA_QUERY_GRAPH_WEIGHT0.3Relationship-heavy corpora → raise.
KHORA_QUERY_KEYWORD_WEIGHT0.2Technical/proper-noun corpora → raise.Note: keyword search now runs in HYBRID mode (previously gated to ALL mode). This is a benchmark-driven default change.

MMR diversity and reranking

ParameterDefaultWhen to adjustTradeoff
enable_diversityTrueDisable only if you’ve measured MMR removing genuinely relevant results.Maximal Marginal Relevance reduces same-document dominance in result sets. Backed by Rust acceleration with NumPy/Python fallbacks.
diversity_lambda0.7Lower toward 0.3 for more diversity (good for exploration). Raise toward 1.0 for pure relevance.1.0 = pure relevance, 0.0 = pure diversity.
enable_rerankingTrueDisable to save 1–3s per query when you can live with raw RRF (≈5–10pp precision loss).Auto-skipped when fewer than 5 candidates are present.
reranking_top_n50Higher for harder-to-rank pools, lower to reduce cost.Cross-encoder is linear in top_n.

HyDE (query expansion)

ParameterDefaultWhen to adjustTradeoff
enable_hyde"auto"Force "always" for descriptive/paraphrased query workloads. Force "never" when latency is critical.auto fires HyDE when query-understanding flags the query as complex or temporal. Each HyDE call adds an LLM hop (~1–2s).
hyde_num_hypotheticals1Raise to 2–3 for diversity at the cost of more LLM calls.More hypotheticals = better paraphrase recall, linearly more LLM cost.
enable_hyde_cypherFalseOpt-in. Enable only after an A/B run on hand-curated structured queries.Experimental in v0.12.0+. Asks an LLM to pick a parameterized Cypher template (recent_by_type, entity_relationships, cooccurrence) and execute it as an extra retrieval channel.

Pipeline (extraction and chunking)

ParameterDefaultWhen to adjustTradeoff
KHORA_PIPELINES_CHUNKING_STRATEGY"semantic"recursive for hierarchical content (code, markdown). fixed for uniform-format documents. conversation for chat.Semantic chunking uses an LLM to split at natural boundaries (higher ingest cost, better retrieval). Fixed is fastest and predictable.
KHORA_PIPELINES_CHUNK_SIZE512Long-form documents → raise to 1024 for more context per chunk. Short messages → drop to 256.Larger chunks = fewer total chunks (cheaper ingest) but coarser retrieval.
KHORA_PIPELINES_CHUNK_OVERLAP50Raise for entity-dense content to reduce boundary loss. Lower for sparse narrative to save storage.More overlap = redundancy across chunks, less context loss.
KHORA_PIPELINES_SELECTIVE_EXTRACTIONTrueDisable only on small corpora (under 5K docs) where every chunk matters.The cost lever. Disabling means 100% of chunks get LLM extraction, full but ~10× more expensive.
KHORA_PIPELINES_EXTRACTION_IMPORTANCE_RATIO0.7Lower for cost-sensitive workloads, raise toward 1.0 for maximum precision.Top fraction of chunks (by importance) sent to extraction.
KHORA_PIPELINES_EXTRACTION_MIN_IMPORTANCE0.2Chunks above this importance are always extracted, regardless of the ratio cap.Floor that prevents the ratio cap from skipping high-importance chunks.

Conversation chunking

ParameterDefaultWhen to adjustTradeoff
conversation_time_gap_minutes15Raise for naturally batched conversations (email-like threads). Lower for rapid chat.Idle gaps longer than this start a new conversation chunk.
conversation_max_group_size50Lower for fine-grained retrieval. Raise for more context per chunk.Caps the number of messages bundled into one chunk.
conversation_min_group_size2Rarely changed.Groups below this size are merged with adjacent ones.

Storage (pgvector)

ParameterDefaultWhen to adjustTradeoff
KHORA_STORAGE_HNSW_M24Raise to 32+ for very high recall on billion-vector indexes. Lower to 16 for memory-constrained deployments.Max connections per HNSW layer. Higher M = denser graph = better recall + more memory + slower build.
KHORA_STORAGE_HNSW_EF_CONSTRUCTION128Raise to 256 if you can pay the build time and want maximum index quality.Build-time search width. Higher = better index quality, slower build.
KHORA_STORAGE_HNSW_EF_SEARCH100Raise to 200+ for benchmarks or precision-critical workloads. Lower to 50 for latency-sensitive workloads.Query-time search width. Linear in latency, sub-linear in recall.
KHORA_STORAGE_USE_HALFVECTrueDisable only if you need exact float32 (uncommon) or have pgvector < 0.7.Float16 quantization saves ~50% memory with negligible recall loss in practice. Graceful fallback when unsupported.

Storage (LanceDB embedded)

ParameterDefaultWhen to adjustTradeoff
lance_index"auto"Force ivf_pq for over 1M chunks where memory matters. Force hnsw for under 1M with strict latency. Force brute for small corpora where exact search is fine.auto picks based on table size. The auto-detection is conservative.
retrain_factor2.0Raise to 3+ to retrain less often (cheaper). Lower to 1.5 to keep the index fresh on growing corpora.IVF-PQ retraining fires when row count ≥ factor × (rows at last training). Set ≤ 1.0 to disable retraining entirely.

Neo4j (VectorCypher only)

ParameterDefaultWhen to adjustTradeoff
entity_write_concurrency12Raise on bulk loads where Neo4j has headroom.Caps concurrent entity-write transactions.
relationship_write_concurrency8Same reasoning as entity writes.
query_timeout5.0 sLower to fail fast on slow graph queries. Raise for deep traversals on large graphs.Per-transaction read timeout (1–300 s, None disables).

LLM

ParameterDefaultWhen to adjustTradeoff
KHORA_LLM_EXTRACTION_MODELunset (falls back to model)Set to a cheap fast model (Haiku, Gemini Flash) for extraction, reserving the primary model for generation.The single biggest LLM-cost lever. Extraction is high-volume and tolerates a smaller model.
KHORA_LLM_MAX_CONCURRENT_LLM_CALLS10Raise if your provider can take more concurrency.Global cap across the whole library. Throttle to respect rate limits.
KHORA_LLM_EMBEDDING_MODELtext-embedding-3-smallSwitch to -large for higher quality at higher cost. Must match embedding_dimension.Changing model dimension requires schema migration.