Large initial loads: bulk mode
For a one-time bulk import,bulk_mode=True trades steady-state consistency for write
throughput: pgvector defers HNSW index creation until after the load, and Neo4j
uses larger batches with deferred constraints. Rebuild the indexes when the load
finishes:
Large ongoing ingests: prefer batch APIs
Useremember_batch / submit_batch, not remember() in
a loop. They delegate to the staged batch pipeline, which gives you:
- Smart-mode entity resolution (the default): per-document dedup is an O(1) index lookup. One cross-document resolution pass runs after all docs using token blocking (only entities sharing a name token are compared), turning what was an O(n²)-per-doc cost into O(n·k). This is what lets ingestion scale past tens of thousands of documents without stalling.
- Concurrent embed + extract, concurrent LLM extraction batches (≈5 chunks/call), and concurrent embedding sub-batches.
- Batch writes:
UNWIND + MERGEentity upserts andUNWIND + CREATErelationship writes in Neo4j, plus multi-rowINSERT … ON CONFLICTin PostgreSQL, collapsing N+1 patterns intoceil(N/batch_size)round-trips.
max_concurrent on the batch call (and the global
processor pool via KHORA_PIPELINES_PENDING_PROCESSOR_MAX_CONCURRENT).
Concurrency & connection pooling
When PostgreSQL, pgvector, and the event store share one database URL (the common case),StorageFactory caches the engine by normalized URL so all three reuse one
connection pool, a third of the connections three independent pools would open.
Pools are sized for concurrent operation (Postgres pool 20 / overflow 30; Neo4j pool
50 by default).
Neo4j entity writes use a key-aware gate: batches touching the same
(namespace_id, name, entity_type) are serialized to avoid lock-contention retries,
while non-overlapping batches stay concurrent.
Query latency
- Query cache: an LRU+TTL cache keyed on
(query, namespace, mode)short-circuits identical repeat queries (default 5-minute TTL). - Reranking skip: the cross-encoder is skipped below 5 candidate chunks, where RRF order is already fine. This saves seconds on sparse results.
- HNSW
ef_search: the pgvector backend raisesef_searchto 200 per transaction (SET LOCAL) for better recall at negligible latency cost. - Skip LLM-side work: set
enable_llm_reranking=Falseandenable_hyde="never"onKhoraConfig.querywhen you want the fastest path.
CPU-bound work: Rust acceleration
At scale, the remaining hot spots (cosine similarity, Levenshtein, PageRank, BM25, keyword extraction) are CPU-bound. The optional Rust layer accelerates them 5–40× with automatic NumPy/pure-Python fallback, and frees the GIL so asyncio I/O keeps flowing.Picking the right backend
Performance is also an architecture choice: the production PostgreSQL + pgvector + Neo4j stack trades infrastructure for multi-hop graph recall and scale, while the embeddedsqlite_lance stack keeps everything in-process for local development and evaluation.
See VectorCypher and
Storage backends.
bolt
Rust acceleration
The native layer behind the CPU-bound speedups.
tune
Engine tuning
Per-engine knobs for extraction ratio, fusion weights, and decay.