Skip to main content
recall() is not a single similarity lookup. Khora understands the query, searches multiple backends in parallel, fuses the rankings, and optionally reranks, then returns a typed RecallResult.
query ─▶ ① understand ─▶ ② link entities ─▶ ③ search ─▶ ④ fuse ─▶ ⑤ filter ─▶ ⑥ rerank ─▶ ⑦ limit
                                      │ vector ∥ graph ∥ keyword │  (RRF)   (time/MMR)  (optional)

Search modes

The mode kwarg picks which channels run:
ModeChannelsBest for
VECTORSemantic similarity only”What’s similar to X?”
GRAPHEntity-relationship traversal only”Who works with X?”
KEYWORDBM25 / full-text onlyExact terms, names, acronyms
HYBRID (default)Vector + graph + keyword, fused via RRFBalanced, the usual choice
ALLEvery channelEffectively the same as HYBRID today
VectorCypher populates all three channels and fuses them per query. See the search-modes guidance for picking one.

How a query flows

  1. Understand: one LLM call classifies intent, extracts entity mentions and temporal references (resolved to ISO-8601), proposes per-query fusion weights, and scores complexity. This shapes everything downstream.
  2. Link entities: query mentions are matched to stored entities by exact, fuzzy (edit-distance), and embedding similarity. Matches seed graph traversal.
  3. Search: vector (pgvector), graph (Neo4j), and keyword (BM25) channels run in parallel.
  4. Fuse with RRF: Reciprocal Rank Fusion combines the rankings by rank, not score (scores aren’t comparable across channels): score = Σ weight / (k + rank), with k=60 and default weights vector 0.5 / graph 0.3 / keyword 0.2. Chunks surfaced by multiple channels rise to the top.
  5. Filter: apply temporal windows and (optionally) MMR diversity.
  6. Rerank: an optional cross-encoder reorders the top candidates (skipped under 5 results, where RRF order is already fine).
  7. Limit: return the top-k with full provenance.

Recall filters

These are the public knobs on recall(). Everything else is global config (KhoraConfig.query):
from khora import SearchMode
from datetime import datetime, timezone, timedelta

result = await kb.recall(
    "product updates",
    namespace=ns_id,
    limit=10,
    mode=SearchMode.HYBRID,
    min_similarity=0.0,                                      # cast a wide net (see below)
    start_time=datetime.now(timezone.utc) - timedelta(days=30),
    end_time=None,
)
  • limit: cap the response at the engine level (cheaper than over-fetching).
  • min_similarity: raw cosine cutoff on the semantic channel, applied before normalization (a real quality gate, unlike thresholding chunk.score).
  • mode: the channel selection above.
  • start_time / end_time: explicit temporal window. Bypasses NLP date detection and is honored on all three engines (both bounds naive or both aware).
Threshold philosophy: cast a wide net, rank carefully. recall()’s min_similarity default is 0.0 on purpose. Khora’s strength is the ranking pipeline (RRF + entity boosting + reranking), which works better with more candidates. A 0.35-similarity chunk that’s the right answer beats zero results. Raise min_similarity only when you want strict, high-confidence-only matches. (An earlier 0.5 default caused ~25% of queries to return nothing. Lowering it was the fix.)

Diversity, reranking, and HyDE

These are global toggles on KhoraConfig.query (env KHORA_QUERY_*), not per-call:
  • MMR diversity (enable_diversity, default on): Maximal Marginal Relevance removes near-duplicate chunks after fusion (diversity_lambda=0.7 balances relevance vs. diversity, Rust-accelerated).
  • Cross-encoder reranking (enable_reranking): neural reorder of the top candidates for precision.
  • LLM reranking (enable_llm_reranking): an LLM pass on temporal queries.
  • HyDE (enable_hyde: auto / always / never) generates a hypothetical answer doc and searches its embedding. In auto it fires on complex/temporal queries. An opt-in HyDE-Cypher channel (KHORA_QUERY_ENABLE_HYDE_CYPHER) runs parameterized graph templates for structured “latest X” queries.
To skip all LLM-side work for latency, set enable_llm_reranking=False and enable_hyde="never". The engine also adapts how many chunks it retrieves to query complexity, from very_focused (≤3 chunks for simple lookups) to broad (15 for multi-hop questions).

Reading the result

recall() returns a RecallResult with chunks, entities, relationships, a deduplicated documents list (every chunk/entity/relationship document_id is guaranteed to appear there), and engine_info.
chunk.score is a normalized rank within the result, not a confidence measure (the top hit is always ~1.0). For “is this corpus actually relevant?”, read result.engine_info:
  • max_raw_vector_score: raw pre-rerank cosine of the top hit. Below ~0.3 means nothing on-topic; above ~0.5 is a confident match.
  • abstention_signals: pre-computed “should we even answer?” flags. See Abstention below.
See the grounded-answers and support-ticket graph examples.

Abstention

Sometimes a search turns up nothing solid. The corpus simply doesn’t cover the question. Rather than let your app answer confidently from weak matches, Khora pre-computes a set of “should we decline to answer?” flags. Abstaining just means choosing to say “I don’t know” instead of guessing. They live in result.engine_info["abstention_signals"]:
  • chunks_empty: no matching text was found.
  • entities_empty: no matching entities were found.
  • chunks_below_min: fewer chunks matched than the minimum worth answering from.
  • top_score_low: even the best match scored low.
  • combined_score: a single 0–1 blend of the signals above.
  • should_abstain: the overall verdict. The results are too thin to answer from.
result = await kb.recall(query, namespace=ns)
if (result.engine_info or {}).get("abstention_signals", {}).get("should_abstain"):
    return "I don't have enough in memory to answer that confidently."
top_score_low is computed from the raw, pre-rerank similarity (max_raw_vector_score), not the final reranked score. Reranking squeezes every result into a narrow high band (even off-topic ones), so the raw score is the honest signal of whether anything actually matched. (A graph-only recall, where nothing matched by text, therefore reads top_score_low = true.)
Render a flat LLM-context string with khora.context_text(result, max_chunks=...). For multi-step exploration, use khora.query.agentic.AgenticSearchAgent directly. Agentic search isn’t exposed on recall().
input

Ingestion

The write path that builds what retrieval searches over.
tune

Engine tuning

Per-engine retrieval knobs and how to tune fusion, decay, and reranking.