Add automatic substructure for dense nodes (fixes #614) by ADunfield · Pull Request #4532 · neo4j-contrib/neo4j-apoc-procedures

ADunfield · 2026-03-17T08:22:00Z

Summary

Implements #614: automatic multi-level B-tree substructure for dense (supernode) nodes with millions of relationships. Distributes relationships across intermediate __DenseBucket nodes to reduce both read fan-out and write lock contention.

This has been an open request since 2017. The implementation went through three rounds of peer review with systematic fixes between each round. Addressed all Round 3 feedback: indexed delete O(log n), streaming analyze, schema transaction safety, 27 tests.

cc @jexp — would appreciate your eyes on this given the scope.

Architecture

When a node has millions of relationships of a given type, this module distributes them across a configurable N-ary tree of bucket nodes:

[Dense Node]
     |
{_DENSE_META_LIKES}     ← one per type+direction, carries metadata
     |
[Root Bucket]           ← __DenseBucket label
   / | \
{_DENSE_BRANCH_LIKES}
 /   |   \
[Bucket] [Bucket] ...  ← leaf buckets, capacity configurable (default 1000)
 /|\       /|\
{_DENSE_LIKES}          ← prefixed to avoid Cypher collision
/ | \
[target nodes]

Key Design Decisions

Prefixed relationship types (_DENSE_META_<TYPE>, _DENSE_BRANCH_<TYPE>, _DENSE_<TYPE>): Prevents silent Cypher query breakage. After migration, MATCH (n)-[:LIKES]->(t) correctly returns 0 results rather than silently missing the substructure — users must go through the apoc.dense.* API. This was identified as the highest-risk item in Round 1 review.

Deadlock-safe locking: Uses optimistic double-checked locking on the fast path (lock bucket only, re-check count). On the split path when a bucket is full, acquires locks in strict hierarchical order: meta-relationship → root → branch → leaf. This prevents the A→B / B→A deadlock pattern that would arise from concurrent writers.

Indexed delete via __dense_target: Each leaf relationship stores the target node's element ID as a property. Combined with the __DenseBucket.__dense_source node index, delete finds the right bucket in O(buckets_for_source) worst case instead of O(total_relationships) linear scan. Worst case occurs when the same source→target pair has duplicate rels across buckets; for typical unique-target workloads this is effectively O(1) per bucket.

Schema transaction safety: Index creation uses a read-only check in a separate transaction first, then DDL in its own committed schema transaction. Catches EquivalentSchemaRuleAlreadyExistsException from concurrent creation gracefully.

Direction.BOTH semantics: Rejected on writes with IllegalArgumentException and clear guidance to call twice. Supported on reads as union (relationships) or sum (degree).

Element ID portability caveat: __dense_target uses Neo4j element IDs which are stable within a database lifecycle but NOT portable across dump/restore. After restore, the recovery path is flatten() → migrate() to rebuild with fresh IDs. Documented in class javadoc.

autoMigrate warning behavior: When autoMigrate: true, the procedure emits a log.warn() to neo4j.log when a node exceeds the dense threshold. It does not silently migrate or throw — operators decide when to run apoc.dense.migrate().

Alternatives Considered

Five storage strategies were evaluated:

Approach	Write	Delete	Read	Verdict
B-tree (selected)	O(1) fast path	O(buckets) indexed	O(log n + results)	Best balanced
Hash-routed flat buckets	O(1) direct	O(1) hash	O(N buckets) all-scan	Write-heavy only
Probabilistic skip-list	O(log n) expected	O(log n) expected	O(log n) expected	Disk-unfriendly
Append-only LSM	O(1) append	Tombstone + GC	Degrades pre-compaction	Write-only use case
Adaptive branching	Same as B-tree	Same as B-tree	Same as B-tree	Tuning layer, not different structure

B-tree was selected for balanced read/write performance, clean mapping to Neo4j's node/relationship model, and deterministic worst-case guarantees.

API

Write Procedures (OUTGOING or INCOMING only)

Procedure	YIELD
`apoc.dense.create.relationship(src, type, tgt, props?, config?)`	`rel, bucket`
`apoc.dense.create.relationship.incoming(src, type, tgt, props?, config?)`	`rel, bucket`
`apoc.dense.delete.relationship(src, type, tgt, matchProps?)`	`removed, remainingCount`
`apoc.dense.migrate(src, type, dir?, config?)`	`migratedCount, bucketsCreated, levels, migrationComplete`
`apoc.dense.flatten(src, type, dir?, config?)`	`flattenedCount, bucketsRemoved`

Read Procedures (BOTH supported)

Procedure/Function	YIELD
`apoc.dense.relationships(src, type, dir?, config?)`	`rel, node, cursor`
`apoc.dense.degree(src, type, dir?)`	`Long` (O(1) metadata)
`apoc.dense.analyze(config?)`	`node, type, direction, degree, alreadyManaged`
`apoc.dense.status(src, type, dir?)`	`type, direction, totalCount, levels, bucketCount, ...`

Config Parameters

Key	Default	Description
`bucketCapacity`	1000	Max leaf relationships per bucket
`branchFactor`	100	Max child buckets per branch node
`denseThreshold`	10000	Degree threshold for analyze/auto-detect
`batchSize`	5000	Relationships per batch in migrate/flatten
`autoMigrate`	false	Warn via log if direct rels exceed threshold
`limit`	0 (unlimited)	Max results for query
`cursor`	null	Resume token for pagination
`sampleRate`	1.0	Probabilistic sampling for analyze (0.0–1.0)
`analyzeLimit`	500	Hard cap on analyze results

Files

Source (5 files, ~1,800 lines):
  Dense.java            — 8 procedures + 1 function, @Extended annotated
  DenseConfig.java      — Config POJO with type-safe converters
  DenseConstants.java   — Labels, rel type prefixes, property keys
  DenseNodeManager.java — Core B-tree logic, locking, indexed delete
  DenseResult.java      — 7 result types

Tests (1 file, ~870 lines):
  DenseTest.java        — 27 integration tests

Test Coverage

Create: single, with properties, bucket fill, bucket split, multi-level tree
Degree: normal, zero for non-dense, BOTH direction sum
Query: full traversal, limit, cursor-based pagination
Delete: basic, property matching, non-existent, empty bucket compaction, indexed lookup across multiple buckets
Analyze: detection, label filter, limit cap, bucket node filtering
Migrate: full, batched, BOTH direction rejection
Flatten: reverse migration, internal property stripping
Status: normal, empty for non-dense
Concurrency: 4-thread concurrent writes with metadata integrity verification

v2 Enhancements (deferred)

Bucket rebalancing: Merging partially-empty sibling buckets after heavy delete. Deferred because it violates the strict top-down lock hierarchy.
Native relationship property index: Neo4j 5 supports relationship property indexes. Creating CREATE INDEX FOR ()-[r:_DENSE_LIKES]-() ON (r.__dense_target) would enable O(1) delete. Deferred because it requires per-type lazy index creation.

Relationship ID Contract

WARNING: Relationship element IDs are NOT preserved across migrate() or flatten(). These procedures delete and recreate relationships. This is consistent with apoc.refactor.* behavior.

Implements multi-level B-tree bucket procedures for managing nodes with millions of relationships. Distributes relationships across intermediate __DenseBucket nodes to reduce read fan-out and write lock contention. Procedures: apoc.dense.create.relationship, delete.relationship, relationships, analyze, migrate, flatten, status. Function: apoc.dense.degree (O(1) metadata count). Design decisions: - Prefixed relationship types (_DENSE_META_, _DENSE_BRANCH_, _DENSE_) to prevent silent Cypher query breakage after migration - Optimistic fast-path locking with strict hierarchical ordering on split path (meta-rel -> root -> branch -> leaf) for deadlock safety - __dense_target property on leaf rels for O(buckets_for_source) delete instead of O(total_relationships) linear scan - Schema DDL in separate committed transaction with concurrent-creation exception handling - Direction.BOTH rejected on writes, returns union on reads - Streaming analyze() with sampleRate + analyzeLimit to prevent OOM - Cursor-based pagination for relationship queries - Element ID portability caveat documented (dump/restore requires flatten + re-migrate) 27 integration tests covering create, delete, query, migrate, flatten, status, analyze, concurrent writes, direction handling, and indexed delete correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic substructure for dense nodes (fixes #614)#4532

Add automatic substructure for dense nodes (fixes #614)#4532
ADunfield wants to merge 1 commit intoneo4j-contrib:5.26from
ADunfield:feature/614-dense-node-substructure

ADunfield commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ADunfield commented Mar 17, 2026

Summary

Architecture

Key Design Decisions

Alternatives Considered

API

Write Procedures (OUTGOING or INCOMING only)

Read Procedures (BOTH supported)

Config Parameters

Files

Test Coverage

v2 Enhancements (deferred)

Relationship ID Contract

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant