Skip to content

Enrich embedding pipeline for better semantic search #252

@ravisuhag

Description

@ravisuhag

Context

Semantic search quality is limited by what goes into the embeddings. Today, the entity serializer only embeds URN, type, name, and description. Properties (column names, tags, owners), attached documents, and freshness signals are all ignored. This means:

  • Searching for a column name like "bounce_rate" won't find the table that has it
  • Searching for "incident" won't surface entities whose runbooks describe incidents
  • A freshly updated entity ranks the same as one untouched for a year

Scope

1. Embed entity properties and tags

The properties JSONB field often contains the most useful metadata — column names, schema details, owners, tags, labels. The entity serializer (core/chunking/serializer.go) should flatten and include relevant properties in the text sent to the embedding provider.

Example: a BigQuery table entity with properties: {columns: ["user_id", "session_duration", "bounce_rate"], owner: "analytics-team", tags: ["pii", "tier-1"]} should produce an embedding that understands "bounce_rate", "analytics-team", and "tier-1".

2. Cross-embed entity + document content

When a document is attached to an entity, the document's content should enrich the entity's embedding context. If a runbook for table:user_sessions mentions "incident", "SLA", and "late-arriving events", searching for those terms should boost that entity in semantic results.

Approach options:

  • At embedding time: When an entity is embedded, also pull its document content into the embedding context (heavier, richer)
  • At search time: When semantic search returns document chunks, propagate their scores to the parent entity (lighter, but less precise)

3. Freshness decay in ranking

Add a mild freshness boost to search and context assembly scoring. Entities with a recent updated_at get a small multiplier. This is not a popularity signal — it's an objective liveness indicator.

This applies to:

  • SearchEntities hybrid ranking (RRF score adjustment)
  • AssembleContext entity scoring (alongside intent weights)

Design Considerations

  • Property embedding should be selective — not all JSONB fields are useful. A configurable allowlist or heuristic (e.g., skip fields > 1000 chars) may be needed.
  • Cross-embedding creates a dependency: document upsert should trigger re-embedding of the parent entity. The pipeline already handles async re-embedding on entity upsert, so this is an extension of existing behavior.
  • Freshness decay should be gentle — a 1.1-1.2x multiplier for entities updated in the last 7 days, not a hard penalty for old entities. Old but relevant entities should still surface.
  • All changes should degrade gracefully when embeddings are disabled.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions