Add automatic substructure for dense nodes (fixes #614)#4532
Open
ADunfield wants to merge 1 commit intoneo4j-contrib:5.26from
Open
Add automatic substructure for dense nodes (fixes #614)#4532ADunfield wants to merge 1 commit intoneo4j-contrib:5.26from
ADunfield wants to merge 1 commit intoneo4j-contrib:5.26from
Conversation
Implements multi-level B-tree bucket procedures for managing nodes with millions of relationships. Distributes relationships across intermediate __DenseBucket nodes to reduce read fan-out and write lock contention. Procedures: apoc.dense.create.relationship, delete.relationship, relationships, analyze, migrate, flatten, status. Function: apoc.dense.degree (O(1) metadata count). Design decisions: - Prefixed relationship types (_DENSE_META_, _DENSE_BRANCH_, _DENSE_) to prevent silent Cypher query breakage after migration - Optimistic fast-path locking with strict hierarchical ordering on split path (meta-rel -> root -> branch -> leaf) for deadlock safety - __dense_target property on leaf rels for O(buckets_for_source) delete instead of O(total_relationships) linear scan - Schema DDL in separate committed transaction with concurrent-creation exception handling - Direction.BOTH rejected on writes, returns union on reads - Streaming analyze() with sampleRate + analyzeLimit to prevent OOM - Cursor-based pagination for relationship queries - Element ID portability caveat documented (dump/restore requires flatten + re-migrate) 27 integration tests covering create, delete, query, migrate, flatten, status, analyze, concurrent writes, direction handling, and indexed delete correctness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #614: automatic multi-level B-tree substructure for dense (supernode) nodes with millions of relationships. Distributes relationships across intermediate
__DenseBucketnodes to reduce both read fan-out and write lock contention.This has been an open request since 2017. The implementation went through three rounds of peer review with systematic fixes between each round. Addressed all Round 3 feedback: indexed delete O(log n), streaming analyze, schema transaction safety, 27 tests.
cc @jexp — would appreciate your eyes on this given the scope.
Architecture
When a node has millions of relationships of a given type, this module distributes them across a configurable N-ary tree of bucket nodes:
Key Design Decisions
Prefixed relationship types (
_DENSE_META_<TYPE>,_DENSE_BRANCH_<TYPE>,_DENSE_<TYPE>): Prevents silent Cypher query breakage. After migration,MATCH (n)-[:LIKES]->(t)correctly returns 0 results rather than silently missing the substructure — users must go through theapoc.dense.*API. This was identified as the highest-risk item in Round 1 review.Deadlock-safe locking: Uses optimistic double-checked locking on the fast path (lock bucket only, re-check count). On the split path when a bucket is full, acquires locks in strict hierarchical order: meta-relationship → root → branch → leaf. This prevents the A→B / B→A deadlock pattern that would arise from concurrent writers.
Indexed delete via
__dense_target: Each leaf relationship stores the target node's element ID as a property. Combined with the__DenseBucket.__dense_sourcenode index, delete finds the right bucket in O(buckets_for_source) worst case instead of O(total_relationships) linear scan. Worst case occurs when the same source→target pair has duplicate rels across buckets; for typical unique-target workloads this is effectively O(1) per bucket.Schema transaction safety: Index creation uses a read-only check in a separate transaction first, then DDL in its own committed schema transaction. Catches
EquivalentSchemaRuleAlreadyExistsExceptionfrom concurrent creation gracefully.Direction.BOTH semantics: Rejected on writes with
IllegalArgumentExceptionand clear guidance to call twice. Supported on reads as union (relationships) or sum (degree).Element ID portability caveat:
__dense_targetuses Neo4j element IDs which are stable within a database lifecycle but NOT portable across dump/restore. After restore, the recovery path isflatten()→migrate()to rebuild with fresh IDs. Documented in class javadoc.autoMigrate warning behavior: When
autoMigrate: true, the procedure emits alog.warn()toneo4j.logwhen a node exceeds the dense threshold. It does not silently migrate or throw — operators decide when to runapoc.dense.migrate().Alternatives Considered
Five storage strategies were evaluated:
B-tree was selected for balanced read/write performance, clean mapping to Neo4j's node/relationship model, and deterministic worst-case guarantees.
API
Write Procedures (OUTGOING or INCOMING only)
apoc.dense.create.relationship(src, type, tgt, props?, config?)rel, bucketapoc.dense.create.relationship.incoming(src, type, tgt, props?, config?)rel, bucketapoc.dense.delete.relationship(src, type, tgt, matchProps?)removed, remainingCountapoc.dense.migrate(src, type, dir?, config?)migratedCount, bucketsCreated, levels, migrationCompleteapoc.dense.flatten(src, type, dir?, config?)flattenedCount, bucketsRemovedRead Procedures (BOTH supported)
apoc.dense.relationships(src, type, dir?, config?)rel, node, cursorapoc.dense.degree(src, type, dir?)Long(O(1) metadata)apoc.dense.analyze(config?)node, type, direction, degree, alreadyManagedapoc.dense.status(src, type, dir?)type, direction, totalCount, levels, bucketCount, ...Config Parameters
bucketCapacitybranchFactordenseThresholdbatchSizeautoMigratelimitcursorsampleRateanalyzeLimitFiles
Test Coverage
v2 Enhancements (deferred)
CREATE INDEX FOR ()-[r:_DENSE_LIKES]-() ON (r.__dense_target)would enable O(1) delete. Deferred because it requires per-type lazy index creation.Relationship ID Contract
WARNING: Relationship element IDs are NOT preserved across
migrate()orflatten(). These procedures delete and recreate relationships. This is consistent withapoc.refactor.*behavior.