Context
Semantic search quality is limited by what goes into the embeddings. Today, the entity serializer only embeds URN, type, name, and description. Properties (column names, tags, owners), attached documents, and freshness signals are all ignored. This means:
- Searching for a column name like "bounce_rate" won't find the table that has it
- Searching for "incident" won't surface entities whose runbooks describe incidents
- A freshly updated entity ranks the same as one untouched for a year
Scope
1. Embed entity properties and tags
The properties JSONB field often contains the most useful metadata — column names, schema details, owners, tags, labels. The entity serializer (core/chunking/serializer.go) should flatten and include relevant properties in the text sent to the embedding provider.
Example: a BigQuery table entity with properties: {columns: ["user_id", "session_duration", "bounce_rate"], owner: "analytics-team", tags: ["pii", "tier-1"]} should produce an embedding that understands "bounce_rate", "analytics-team", and "tier-1".
2. Cross-embed entity + document content
When a document is attached to an entity, the document's content should enrich the entity's embedding context. If a runbook for table:user_sessions mentions "incident", "SLA", and "late-arriving events", searching for those terms should boost that entity in semantic results.
Approach options:
- At embedding time: When an entity is embedded, also pull its document content into the embedding context (heavier, richer)
- At search time: When semantic search returns document chunks, propagate their scores to the parent entity (lighter, but less precise)
3. Freshness decay in ranking
Add a mild freshness boost to search and context assembly scoring. Entities with a recent updated_at get a small multiplier. This is not a popularity signal — it's an objective liveness indicator.
This applies to:
SearchEntities hybrid ranking (RRF score adjustment)
AssembleContext entity scoring (alongside intent weights)
Design Considerations
- Property embedding should be selective — not all JSONB fields are useful. A configurable allowlist or heuristic (e.g., skip fields > 1000 chars) may be needed.
- Cross-embedding creates a dependency: document upsert should trigger re-embedding of the parent entity. The pipeline already handles async re-embedding on entity upsert, so this is an extension of existing behavior.
- Freshness decay should be gentle — a 1.1-1.2x multiplier for entities updated in the last 7 days, not a hard penalty for old entities. Old but relevant entities should still surface.
- All changes should degrade gracefully when embeddings are disabled.
Related
Context
Semantic search quality is limited by what goes into the embeddings. Today, the entity serializer only embeds URN, type, name, and description. Properties (column names, tags, owners), attached documents, and freshness signals are all ignored. This means:
Scope
1. Embed entity properties and tags
The
propertiesJSONB field often contains the most useful metadata — column names, schema details, owners, tags, labels. The entity serializer (core/chunking/serializer.go) should flatten and include relevant properties in the text sent to the embedding provider.Example: a BigQuery table entity with
properties: {columns: ["user_id", "session_duration", "bounce_rate"], owner: "analytics-team", tags: ["pii", "tier-1"]}should produce an embedding that understands "bounce_rate", "analytics-team", and "tier-1".2. Cross-embed entity + document content
When a document is attached to an entity, the document's content should enrich the entity's embedding context. If a runbook for
table:user_sessionsmentions "incident", "SLA", and "late-arriving events", searching for those terms should boost that entity in semantic results.Approach options:
3. Freshness decay in ranking
Add a mild freshness boost to search and context assembly scoring. Entities with a recent
updated_atget a small multiplier. This is not a popularity signal — it's an objective liveness indicator.This applies to:
SearchEntitieshybrid ranking (RRF score adjustment)AssembleContextentity scoring (alongside intent weights)Design Considerations
Related