Retrieval

recall() is not a single similarity lookup. Khora understands the query, searches multiple backends in parallel, fuses the rankings, and optionally reranks, then returns a typed RecallResult.

query ─▶ ① understand ─▶ ② link entities ─▶ ③ search ─▶ ④ fuse ─▶ ⑤ filter ─▶ ⑥ rerank ─▶ ⑦ limit
                                      │ vector ∥ graph ∥ keyword │  (RRF)   (time/MMR)  (optional)

Search modes

The mode kwarg picks which channels run:

Mode	Channels	Best for
`VECTOR`	Semantic similarity only	”What’s similar to X?”
`GRAPH`	Entity-relationship traversal only	”Who works with X?”
`KEYWORD`	BM25 / full-text only	Exact terms, names, acronyms
`HYBRID` (default)	Vector + graph, fused via RRF (keyword channel opt-in)	Balanced, the usual choice
`ALL`	Every channel	Effectively the same as HYBRID today

VectorCypher fuses the vector and graph channels per query; the BM25 keyword channel is opt-in (enable_bm25_channel). See the search-modes guidance for picking one.

How a query flows

Understand: one LLM call classifies intent, extracts entity mentions and temporal references (resolved to ISO-8601), proposes per-query fusion weights, and scores complexity. This shapes everything downstream.
Link entities: query mentions are matched to stored entities by exact, fuzzy (edit-distance), and embedding similarity. Matches seed graph traversal.
Search: the vector (pgvector) and graph (Neo4j) channels run in parallel. The BM25 keyword channel joins them only when enable_bm25_channel is set.
Fuse with RRF: Reciprocal Rank Fusion combines the rankings by rank, not score (scores aren’t comparable across channels): score = Σ weight / (k + rank), with k=60 and default weights vector 0.6 / graph 0.4 (the keyword weight, 0.3, applies only when the BM25 channel is enabled). Chunks surfaced by multiple channels rise to the top.
Filter: apply temporal windows and (optionally) MMR diversity.
Rerank: an optional cross-encoder reorders the top candidates (skipped under 5 results, where RRF order is already fine).
Limit: return the top-k with full provenance.

Recall filters

These are the public knobs on recall(). Everything else is global config (KhoraConfig.query):

from khora import SearchMode

result = await kb.recall(
    "product updates",
    namespace=ns_id,
    limit=10,
    mode=SearchMode.HYBRID,
    min_similarity=0.0,                                      # cast a wide net (see below)
    filter={"source_name": "linear", "metadata.tier": {"$in": ["gold", "silver"]}},
)

limit: cap the response at the engine level (cheaper than over-fetching).
min_similarity: raw cosine cutoff on the semantic channel, applied before normalization (a real quality gate, unlike thresholding chunk.score).
mode: the channel selection above.
filter: a deterministic RecallFilter: exact predicates on system fields and metadata, evaluated as a hard gate alongside the ranking. It’s also the supported way to bound recall by time: filter={"occurred_at": {"$gte": ...}}.
start_time / end_time: deprecated recency window. Prefer filter= (the two can’t be combined). See Filtering by time.

Threshold philosophy: cast a wide net, rank carefully. recall()’s min_similarity default is 0.0 on purpose. Khora’s strength is the ranking pipeline (RRF + entity boosting + reranking), which works better with more candidates. A 0.35-similarity chunk that’s the right answer beats zero results. Raise min_similarity only when you want strict, high-confidence-only matches. (An earlier 0.5 default caused ~25% of queries to return nothing. Lowering it was the fix.)

Diversity, reranking, and HyDE

These are global toggles on KhoraConfig.query (env KHORA_QUERY_*), not per-call:

MMR diversity (enable_diversity, default on): Maximal Marginal Relevance removes near-duplicate chunks after fusion (diversity_lambda=0.5 balances relevance vs. diversity, Rust-accelerated).
Cross-encoder reranking (enable_reranking): neural reorder of the top candidates for precision.
LLM reranking (enable_llm_reranking): an LLM pass on temporal queries.
HyDE (enable_hyde: auto / always / never) generates a hypothetical answer doc and searches its embedding. In auto it fires on complex/temporal queries. A HyDE-Cypher module (KHORA_QUERY_ENABLE_HYDE_CYPHER) for structured “latest X” queries exists but is not yet wired into retrieval, so the flag currently has no effect.

To skip all LLM-side work for latency, set enable_llm_reranking=False and enable_hyde="never". The engine also adapts how many chunks it retrieves to query complexity, from very_focused (≤3 chunks for simple lookups) to broad (15 for multi-hop questions).

Reading the result

recall() returns a RecallResult with chunks, entities, relationships, a deduplicated documents list (every chunk/entity/relationship document_id is guaranteed to appear there), and engine_info.

chunk.score is a normalized rank within the result, not a confidence measure (the top hit is always ~1.0). For “is this corpus actually relevant?”, read result.engine_info:

max_raw_vector_score: raw pre-rerank cosine of the top hit. Below ~0.3 means nothing on-topic; above ~0.5 is a confident match.
abstention_signals: pre-computed “should we even answer?” flags. See Abstention below.

See the grounded-answers and support-ticket graph examples.

Abstention

Sometimes a search turns up nothing solid. The corpus simply doesn’t cover the question. Rather than let your app answer confidently from weak matches, Khora pre-computes a set of “should we decline to answer?” flags. Abstaining just means choosing to say “I don’t know” instead of guessing. They live in result.engine_info["abstention_signals"]:

chunks_empty: no matching text was found.
entities_empty: no matching entities were found.
chunks_below_min: fewer chunks matched than the minimum worth answering from.
top_score_low: even the best match scored low.
combined_score: a single 0–1 blend of the signals above.
should_abstain: the overall verdict. The results are too thin to answer from.

result = await kb.recall(query, namespace=ns)
if (result.engine_info or {}).get("abstention_signals", {}).get("should_abstain"):
    return "I don't have enough in memory to answer that confidently."

top_score_low is computed from the raw, pre-rerank similarity (max_raw_vector_score), not the final reranked score. Reranking squeezes every result into a narrow high band (even off-topic ones), so the raw score is the honest signal of whether anything actually matched. (A graph-only recall, where nothing matched by text, therefore reads top_score_low = true.)

Render a flat LLM-context string with khora.context_text(result, max_chunks=...). For multi-step exploration, use khora.query.agentic.AgenticSearchAgent directly. Agentic search isn’t exposed on recall().

Ingestion

The write path that builds what retrieval searches over.

Engine tuning

Per-engine retrieval knobs and how to tune fusion, decay, and reranking.

Getting started

Concepts

Operations

Experimental Features

Integrations

Reference

Examples

Search modes

How a query flows

Recall filters

Diversity, reranking, and HyDE

Reading the result

Abstention

Ingestion

Engine tuning

​Search modes

​How a query flows

​Recall filters

​Diversity, reranking, and HyDE

​Reading the result

​Abstention

Ingestion

Engine tuning

Search modes

How a query flows

Recall filters

Diversity, reranking, and HyDE

Reading the result

Abstention