Performance & scaling

Khora’s pipelines are built around batching and parallelism. Embedding and extraction run concurrently, database writes are batched, and identical queries are cached. This page covers the levers you control when throughput or latency matters.

Large initial loads: bulk mode

For a one-time bulk import, bulk_mode=True trades steady-state consistency for write throughput: pgvector defers HNSW index creation until after the load, and Neo4j uses larger batches with deferred constraints. Rebuild the indexes when the load finishes:

from khora.storage import StorageSettings
from khora.storage.optimize import ensure_hnsw_indexes

config = KhoraConfig(storage=StorageSettings(bulk_mode=True))
# ... ingest everything ...
await ensure_hnsw_indexes(engine, schema="public")   # idempotent; ef_construction=128
await neo4j_backend.ensure_constraints()              # re-enable Neo4j constraints

Bulk mode is for initial loading, not production traffic: it relaxes write-time validation. Turn it off for steady state.

Large ongoing ingests: prefer batch APIs

Use remember_batch / submit_batch, not remember() in a loop. They delegate to the staged batch pipeline, which gives you:

Smart-mode entity resolution (the default): per-document dedup is an O(1) index lookup. One cross-document resolution pass runs after all docs using token blocking (only entities sharing a name token are compared), turning what was an O(n²)-per-doc cost into O(n·k). This is what lets ingestion scale past tens of thousands of documents without stalling.
Concurrent embed + extract, concurrent LLM extraction batches (≈5 chunks/call), and concurrent embedding sub-batches.
Batch writes: UNWIND + MERGE entity upserts and UNWIND + CREATE relationship writes in Neo4j, plus multi-row INSERT … ON CONFLICT in PostgreSQL, collapsing N+1 patterns into ceil(N/batch_size) round-trips.

Tune ingest concurrency with max_concurrent on the batch call (and the global processor pool via KHORA_PIPELINES_PENDING_PROCESSOR_MAX_CONCURRENT).

Concurrency & connection pooling

When PostgreSQL, pgvector, and the event store share one database URL (the common case), StorageFactory caches the engine by normalized URL so all three reuse one connection pool, a third of the connections three independent pools would open. Pools are sized for concurrent operation (Postgres pool 20 / overflow 30; Neo4j pool 50 by default). Neo4j entity writes use a key-aware gate: batches touching the same (namespace_id, name, entity_type) are serialized to avoid lock-contention retries, while non-overlapping batches stay concurrent.

Query latency

Query cache: an LRU+TTL cache keyed on (query, namespace, mode) short-circuits identical repeat queries (default 5-minute TTL).
Reranking skip: the cross-encoder is skipped below 5 candidate chunks, where RRF order is already fine. This saves seconds on sparse results.
HNSW ef_search: the pgvector backend raises ef_search to 200 per transaction (SET LOCAL) for better recall at negligible latency cost.
Skip LLM-side work: set enable_llm_reranking=False and enable_hyde="never" on KhoraConfig.query when you want the fastest path.

CPU-bound work: Rust acceleration

At scale, the remaining hot spots (cosine similarity, Levenshtein, PageRank, BM25, keyword extraction) are CPU-bound. The optional Rust layer accelerates them 5–40× with automatic NumPy/pure-Python fallback, and frees the GIL so asyncio I/O keeps flowing.

Picking the right backend

Performance is also an architecture choice: the production PostgreSQL + pgvector + Neo4j stack trades infrastructure for multi-hop graph recall and scale, while the embedded sqlite_lance stack keeps everything in-process for local development and evaluation. See VectorCypher and Storage backends.

bolt

Rust acceleration

The native layer behind the CPU-bound speedups.

tune

Engine tuning

Per-engine knobs for extraction ratio, fusion weights, and decay.

​Large initial loads: bulk mode

​Large ongoing ingests: prefer batch APIs

​Concurrency & connection pooling

​Query latency

​CPU-bound work: Rust acceleration

​Picking the right backend

Rust acceleration

Engine tuning

Large initial loads: bulk mode

Large ongoing ingests: prefer batch APIs

Concurrency & connection pooling

Query latency

CPU-bound work: Rust acceleration

Picking the right backend