fix(cassandra): auto-recover session after Cassandra restart by dpol1 · Pull Request #2997 · apache/hugegraph

dpol1 · 2026-04-18T15:38:23Z

Purpose of the PR

HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.

Root cause: CassandraSessionPool builds the Datastax Cluster without a
ReconnectionPolicy, CassandraSession.execute(...) calls the driver once
with no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient NoHostAvailableException /
OperationTimedOutException errors surface to the user and the pool stays
dead even after Cassandra comes back online.

Main Changes

Register ExponentialReconnectionPolicy(baseDelay, maxDelay) on the
Cluster builder so the Datastax driver keeps retrying downed nodes in
the background.
Wrap every Session.execute(...) in executeWithRetry(Statement) with
exponential backoff on transient connectivity failures.
Implement reconnectIfNeeded() / reset() on CassandraSession so the
pool reopens closed sessions and issues a lightweight health-check
(SELECT now() FROM system.local) before subsequent queries.

Add four tunables in CassandraOptions (defaults preserve previous
behavior for healthy clusters):

Option	Default	Meaning
`cassandra.reconnect_base_delay`	`1000` ms	Initial backoff for driver reconnection policy
`cassandra.reconnect_max_delay`	`60000` ms	Cap for reconnection backoff
`cassandra.reconnect_max_retries`	`10`	Per-query retries on transient errors (`0` disables)
`cassandra.reconnect_interval`	`5000` ms	Base interval for per-query exponential backoff

Add unit tests covering defaults, overrides, disabling retries and option keys.

Verifying these changes

Need tests and can be verified as follows:
- mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest — 13/13 pass

Does this PR potentially affect the following parts?

Modify configurations

Documentation Status

Doc - TODO

- Register ExponentialReconnectionPolicy on the Cluster builder so the Datastax driver keeps retrying downed nodes in the background. - Wrap every Session.execute() in executeWithRetry() with exponential backoff on transient connectivity failures. - Implement reconnectIfNeeded()/reset() so the pool reopens closed sessions and issues a lightweight health-check (SELECT now() FROM system.local) before subsequent queries. - Add tunable options: cassandra.reconnect_base_delay, cassandra.reconnect_max_delay, cassandra.reconnect_max_retries, cassandra.reconnect_interval. - Add unit tests covering defaults, overrides, disabling retries and option keys. Fixes apache#2740

imbajin · 2026-04-18T19:47:32Z

⚠️ commitAsync() bypasses retry — still calls this.session.executeAsync(s) directly

The PR wraps execute() and commit() with executeWithRetry, but commitAsync() (line 177 in the base file) still calls this.session.executeAsync(s) directly. If a Cassandra restart happens during an async batch commit, the same connectivity failure will surface without any retry.

Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR).

dpol1 · 2026-04-20T14:38:04Z

Thanks @imbajin for the feedback, changed!

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working store Store module labels Apr 18, 2026

github-project-automation bot added this to HugeGraph PD-Store Tasks Apr 18, 2026

github-project-automation bot moved this to In progress in HugeGraph PD-Store Tasks Apr 18, 2026

dpol1 force-pushed the fix/2740-cassandra-reconnect branch from 97de8e9 to fc3d291 Compare April 18, 2026 17:37

imbajin reviewed Apr 18, 2026

View reviewed changes

Comment thread ...ssandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java

imbajin reviewed Apr 18, 2026

View reviewed changes

Comment thread ...ssandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java

imbajin reviewed Apr 18, 2026

View reviewed changes

Comment thread ...ssandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java

fix: Address reviewer feedback

5ac3990

dpol1 requested a review from imbajin April 20, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cassandra): auto-recover session after Cassandra restart#2997

fix(cassandra): auto-recover session after Cassandra restart#2997
dpol1 wants to merge 2 commits intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect

dpol1 commented Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin commented Apr 18, 2026

Uh oh!

dpol1 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dpol1 commented Apr 18, 2026

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin commented Apr 18, 2026

Uh oh!

dpol1 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants