fix(cassandra): auto-recover session after Cassandra restart#2997
Open
dpol1 wants to merge 2 commits intoapache:masterfrom
Open
fix(cassandra): auto-recover session after Cassandra restart#2997dpol1 wants to merge 2 commits intoapache:masterfrom
dpol1 wants to merge 2 commits intoapache:masterfrom
Conversation
- Register ExponentialReconnectionPolicy on the Cluster builder so the
Datastax driver keeps retrying downed nodes in the background.
- Wrap every Session.execute() in executeWithRetry() with exponential
backoff on transient connectivity failures.
- Implement reconnectIfNeeded()/reset() so the pool reopens closed
sessions and issues a lightweight health-check (SELECT now() FROM
system.local) before subsequent queries.
- Add tunable options: cassandra.reconnect_base_delay,
cassandra.reconnect_max_delay, cassandra.reconnect_max_retries,
cassandra.reconnect_interval.
- Add unit tests covering defaults, overrides, disabling retries and
option keys.
Fixes apache#2740
97de8e9 to
fc3d291
Compare
Member
|
The PR wraps Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR). |
Author
|
Thanks @imbajin for the feedback, changed! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose of the PR
closes #2740
HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.
Root cause:
CassandraSessionPoolbuilds the DatastaxClusterwithout aReconnectionPolicy,CassandraSession.execute(...)calls the driver oncewith no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient
NoHostAvailableException/OperationTimedOutExceptionerrors surface to the user and the pool staysdead even after Cassandra comes back online.
Main Changes
Register
ExponentialReconnectionPolicy(baseDelay, maxDelay)on theClusterbuilder so the Datastax driver keeps retrying downed nodes inthe background.
Wrap every
Session.execute(...)inexecuteWithRetry(Statement)withexponential backoff on transient connectivity failures.
Implement
reconnectIfNeeded()/reset()onCassandraSessionso thepool reopens closed sessions and issues a lightweight health-check
(
SELECT now() FROM system.local) before subsequent queries.Add four tunables in
CassandraOptions(defaults preserve previousbehavior for healthy clusters):
cassandra.reconnect_base_delay1000mscassandra.reconnect_max_delay60000mscassandra.reconnect_max_retries100disables)cassandra.reconnect_interval5000msAdd unit tests covering defaults, overrides, disabling retries and option keys.
Verifying these changes
mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest— 13/13 passDoes this PR potentially affect the following parts?
Documentation Status
Doc - TODO