Extraction reads your documents and pulls out entities and relationships.
The two lists every remember() takes, entity_types and relationship_types,
advise khora which entities and relationships to extract.
This flat list can work for simple content, but says nothing about how to describe a type,
when two mentions are the same entity, or what to infer.
ExpertiseConfig is the next level up: a
complete, reusable domain ontology that tells Khora not only which types to
extract but how to describe them, how to recognise the same entity across sources,
what new edges to infer, how confident to be, and what prompt to extract with.
# The minimum every remember() needs — the types as anonymous, inline lists:
await kb.remember(text, namespace=ns,
entity_types=["PERSON", "ORG"], relationship_types=["WORKS_AT"])
The sections below build a reusable ExpertiseConfig (call it ontology) and pass
it via expertise= on the same call. The next section
shows the whole define-then-use in one block.
What an ontology bundles
One ExpertiseConfig carries all of this, most of it optional, with sensible defaults:
| Piece | What it controls |
|---|
entity_types | The types to extract, each with an attribute schema, identifiers, and aliases |
relationship_types | Edge types, with source/target constraints and direction |
correlation_rules | Cross-source entity unification (merge the same person from Slack and email) |
inference_rules | New edges derived from existing ones (a when → then graph-pattern DSL) |
confidence | Per-ontology thresholds that filter low-confidence output |
expansion | How aggressively to unify entities and infer relationships |
system_prompt / extraction_prompt | The Jinja2 prompt templates the extractor renders |
events / facts | Event / atomic-fact extraction toggles (default on) |
name / version / extends | Identity, versioning, and inheritance from other ontologies |
A minimal ontology in Python
The three building blocks (ExpertiseConfig, EntityTypeConfig,
RelationshipTypeConfig) are importable straight from khora:
from khora import ExpertiseConfig, EntityTypeConfig, RelationshipTypeConfig
ontology = ExpertiseConfig(
name="product_engineering",
description="People, teams, and services in a product org.",
system_prompt="Extract the people, teams, and services discussed in engineering docs.",
entity_types=[
EntityTypeConfig(name="PERSON", description="An engineer or stakeholder."),
EntityTypeConfig(name="TEAM", description="An engineering team."),
EntityTypeConfig(name="SERVICE", description="A deployable service or component."),
],
relationship_types=[
RelationshipTypeConfig(name="MEMBER_OF", description="A person belongs to a team.",
source_types=["PERSON"], target_types=["TEAM"]),
RelationshipTypeConfig(name="OWNS", description="A team owns a service.",
source_types=["TEAM"], target_types=["SERVICE"]),
],
)
await kb.remember(
text,
namespace=ns,
expertise=ontology,
entity_types=ontology.get_entity_type_names(), # still required — see below
relationship_types=ontology.get_relationship_type_names(),
)
entity_types and relationship_types are required on every remember() even
when you pass expertise=. The expertise object doesn’t replace them. Pass
ontology.get_entity_type_names() / get_relationship_type_names() so the lists
stay in sync with the ontology. (The field is name=, not type=.)
Entity types: attributes, identifiers, aliases
An EntityTypeConfig is more than a label. Three fields shape extraction and matching:
attributes: {required: [...], optional: [...]}, a soft schema that nudges the
LLM to pull the fields you care about.
identifiers: the attributes that identify the same entity across documents
(e.g. email for a person, repo_url for a service). These drive deduplication.
aliases: alternative type labels, so a model that emits COMPANY still lands
on your ORGANIZATION type.
- name: PERSON
description: "An engineer or stakeholder."
attributes:
required: [name]
optional: [email, role]
identifiers: [email, name] # two people sharing an email are one entity
aliases: [ENGINEER, EMPLOYEE]
Relationship types
source_types / target_types constrain which entity types an edge may connect
(["*"] means any). bidirectional: true marks symmetric edges like
COLLABORATES_WITH.
- name: MEMBER_OF
source_types: [PERSON]
target_types: [TEAM]
- name: COLLABORATES_WITH
source_types: [PERSON]
target_types: [PERSON]
bidirectional: true
A relationship type’s source_types / target_types must reference entity type
names that exist in the same ontology (or *). An edge that points at an
undeclared type is dropped.
Cross-source entity unification (correlation rules)
Correlation rules merge the same real-world entity seen through different sources,
the canonical “the Slack @ada and the Gmail ada@acme.com are one person” problem.
A rule matches on match_fields (or a regex pattern), scoped to entity_types:
correlation_rules:
- name: dedupe_people_by_email
description: "People sharing an email are the same entity."
match_fields: [email]
entity_types: [PERSON]
confidence: 0.9
Tie a rule to stable identifiers (email, url, id) at high confidence
(0.85–0.95). Match on names only at lower confidence (0.7–0.8), since names collide.
Inferring new edges (inference rules)
Inference rules derive relationships that were never written down, from ones that
were. Each rule is a when → then: a list of conditions to match, and the edge to
create. The matcher walks chains of relationships and supports three linking shapes:
| Shape | Pattern | Example |
|---|
| Transitive | prev.target → next.source | A MEMBER_OF Team, Team OWNS Service ⇒ A CONTRIBUTES_TO Service |
| Shared-target | prev.target == next.target | A MEMBER_OF Team, B MEMBER_OF Team ⇒ A COLLABORATES_WITH B |
| Shared-source | prev.source == next.source | A OWNS X, A OWNS Y ⇒ relate X and Y |
then.source / then.target pick which matched entities form the new edge, by ordinal
(first.source, second.target, …):
inference_rules:
- name: teammates_collaborate # shared-target
when:
- {relationship: MEMBER_OF, source_type: PERSON, target_type: TEAM}
- {relationship: MEMBER_OF, source_type: PERSON, target_type: TEAM}
then: {relationship: COLLABORATES_WITH, source: first.source, target: second.source}
confidence: 0.5
- name: contribute_to_owned_services # transitive
when:
- {relationship: MEMBER_OF, source_type: PERSON, target_type: TEAM}
- {relationship: OWNS, source_type: TEAM, target_type: SERVICE}
then: {relationship: CONTRIBUTES_TO, source: first.source, target: second.target}
confidence: 0.6
Inference only runs when an ontology with inference_rules is loaded and
expansion.relationship_inference is on with a non-none inference_mode. A plain
remember() with no expertise does no inference. The inferrer logs “No expertise
or inference rules configured, skipping inference” and creates nothing. Keep 2–4
rules per ontology. Broad rules can explode the edge count.
Confidence thresholds
The ontology carries its own thresholds, and the extractor drops anything below them,
a per-ontology precision knob:
confidence:
min_entity: 0.5 # extracted entities below this are discarded
min_relationship: 0.5 # extracted relationships below this are discarded
min_inferred: 0.4 # inferred relationships below this are discarded
Expansion behavior
expansion controls the optional expansion phase of ingestion:
expansion:
enabled: true
inference_mode: smart # smart | incremental | batch | none
relationship_inference: true
cross_tool_unification: true
depth: 2 # transitive inference passes
preload_existing: true # smart mode: load existing entities for cross-doc dedup
Authoring in YAML and loading it
For anything beyond a couple of types, YAML is the natural home. It keeps the whole
ontology in one reviewable file. Load it with ExpertiseLoader:
from khora.extraction.skills import ExpertiseLoader
ontology = ExpertiseLoader().load_file("ontologies/product_engineering.yaml")
# or a bundled starting point:
ontology = ExpertiseLoader().load_builtin("general")
await kb.remember(text, namespace=ns, expertise=ontology,
entity_types=ontology.get_entity_type_names(),
relationship_types=ontology.get_relationship_type_names())
A complete ontology pulling the pieces together:
name: product_engineering
version: "1.0.0"
description: "People, teams, and services in a product org."
system_prompt: |
You extract the people, teams, and services discussed in internal engineering
documents. Capture a person's email or a service's repository URL as an identifier
when present, so the same entity mentioned in different documents is recognised as one.
entity_types:
- name: PERSON
description: "An engineer or stakeholder."
attributes: {required: [name], optional: [email, role]}
identifiers: [email, name]
aliases: [ENGINEER, EMPLOYEE]
- name: TEAM
description: "An engineering team or squad."
attributes: {required: [name], optional: [mission]}
identifiers: [name]
- name: SERVICE
description: "A deployable service or component."
attributes: {required: [name], optional: [repo_url, language]}
identifiers: [repo_url, name]
aliases: [COMPONENT, MICROSERVICE]
relationship_types:
- {name: MEMBER_OF, source_types: [PERSON], target_types: [TEAM]}
- {name: OWNS, source_types: [TEAM], target_types: [SERVICE]}
- {name: CONTRIBUTES_TO, source_types: [PERSON], target_types: [SERVICE]}
- {name: COLLABORATES_WITH, source_types: [PERSON], target_types: [PERSON], bidirectional: true}
correlation_rules:
- name: dedupe_people_by_email
match_fields: [email]
entity_types: [PERSON]
confidence: 0.9
inference_rules:
- name: teammates_collaborate
when:
- {relationship: MEMBER_OF, source_type: PERSON, target_type: TEAM}
- {relationship: MEMBER_OF, source_type: PERSON, target_type: TEAM}
then: {relationship: COLLABORATES_WITH, source: first.source, target: second.source}
confidence: 0.5
- name: contribute_to_owned_services
when:
- {relationship: MEMBER_OF, source_type: PERSON, target_type: TEAM}
- {relationship: OWNS, source_type: TEAM, target_type: SERVICE}
then: {relationship: CONTRIBUTES_TO, source: first.source, target: second.target}
confidence: 0.6
confidence: {min_entity: 0.5, min_relationship: 0.5, min_inferred: 0.4}
expansion: {enabled: true, inference_mode: smart, relationship_inference: true, depth: 2}
Ingesting one document (“Ada (ada@acme.com) and Bob are engineers on the Payments
team. The Payments team owns the billing-api service.”) with this ontology extracts
PERSON/TEAM/SERVICE entities and the MEMBER_OF / OWNS edges, then infers
Ada COLLABORATES_WITH Bob and Ada/Bob CONTRIBUTES_TO billing-api.
Composing ontologies with extends
Ontologies are versioned and inherit. A config can extends one or more parents. The
loader resolves the chain and merges: entity/relationship types add-or-override by
name, rules combine (later wins on a name clash), prompts/confidence/expansion are
overridden by the child:
name: hiring
extends: [ontologies/base_people.yaml] # inherits PERSON, ORGANIZATION, …
system_prompt: |
{{ parent_prompt }}
Additionally, extract job candidates and roles.
Known types: {% for t in entity_types %}{{ t.name }} {% endfor %}
entity_types:
- name: CANDIDATE
description: "A job applicant."
System prompts are Jinja2 templates rendered against the ontology, so a child can
wrap its parent’s prompt with {{ parent_prompt }} and iterate the merged
{{ entity_types }}. Loading hiring above yields the types PERSON, ORGANIZATION,
CANDIDATE and a prompt that embeds the parent’s text plus the live type list.
The built-in type hierarchy
Inference and relationship rules match through a built-in subtype map, so a rule
written for a general type also fires on specific ones: EMPLOYEE and
EXTERNAL_PERSON satisfy a rule expecting PERSON. COMPANY / DEPARTMENT / TEAM
satisfy ORGANIZATION, and CALL satisfies EVENT. You can write rules against broad
types and still match a richly-typed graph.
Engine notes
VectorCypher supports ontology-driven typed extraction via the entity_types /
relationship_types kwargs, and runs expansion, so it honours correlation_rules,
inference_rules, and the expansion block.
conveyor_belt
Ingestion
Where the ontology plugs in: the three-phase write path.
code
Workloads example
A runnable ExpertiseConfig in the resume-search walkthrough.
menu_book
API reference
ExpertiseConfig and the ontology dataclasses.
search
Retrieval
The read path that queries what extraction produced.