Skip to content

perf: move request submission off event loop thread in execute_concurrent#827

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/concurrent-submitter-thread
Draft

perf: move request submission off event loop thread in execute_concurrent#827
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/concurrent-submitter-thread

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 21, 2026

Summary

  • Moves execute_async() calls in execute_concurrent from the event-loop callback thread to a dedicated submitter thread
  • The event loop callback now only appends to a deque and signals an Event, reducing per-callback overhead from ~27μs to ~100ns
  • The submitter thread drains the deque in batches and calls execute_async(), which includes serialization — keeping that CPU work off the event loop

v2: Reduce per-request lock overhead in ResponseFuture

Second commit (7cae6a14e) reduces lock/synchronization cost per request in the execute_concurrent hot path:

  1. Lazy Event creation: ResponseFuture._event starts as None instead of Event(). The Event is only materialized in result() (the synchronous path). For execute_concurrent, which never calls result() on individual futures, this eliminates ~620ns per request (351ns Event construction + 267ns Event.set()).

  2. Merged add_callbacks(): Registers both callback and errback under a single _callback_lock acquisition instead of two separate lock/unlock cycles. Saves ~80ns per request.

  3. _set_final_result / _set_final_exception: Capture _event reference under _callback_lock before calling .set() outside the lock. Skip .set() when Event was never created. Null-check callback/errback lists before building to_call tuple.

  4. _wait_for_result(): Checks result availability under _callback_lock before creating Event — avoids Event creation entirely when the result arrived before the caller waits.

  5. _on_speculative_execute: Checks _final_result/_final_exception directly instead of Event.is_set(), since Event may be None with lazy creation.

All changes are safe under both GIL and free-threaded (PEP 703) Python. No GIL assumptions.

Benchmark Results

On our vector ingestion benchmark (100K rows, 768-dim float32 vectors, ScyllaDB 2026.1.1):

  • Stock master + execute_concurrent: ~7,500 rows/s
  • Enhanced driver + this change: +6-9% throughput improvement (additive with Cython serializer gains)
  • The improvement is modest because serialization still dominates; with Cython serializers reducing serialization cost, this change becomes more impactful

How It Works

  • _ConcurrentExecutorBase spawns a daemon submitter thread alongside the existing callback mechanism
  • Callbacks do deque.append(1); event.set() — minimal work on the hot path
  • Submitter thread wakes on the event, drains pending count, and calls _execute_next() in a batch
  • Thread-safe via collections.deque (atomic append/popleft in CPython) + threading.Event
  • Graceful shutdown: sentinel None in deque signals the thread to exit; join() in wait()

Testing

  • 642 unit tests pass, 0 failures
  • All 10 existing test_concurrent.py unit tests pass
  • Tested with real ScyllaDB cluster under sustained load (100K+ inserts)

@mykaul mykaul force-pushed the perf/concurrent-submitter-thread branch 2 times, most recently from fd9be81 to b759d4a Compare April 21, 2026 11:00
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 21, 2026

v2 changes: reduce per-request lock overhead in ResponseFuture

New commit 7cae6a14e on top of the submitter thread change. Focuses on reducing lock/synchronization cost per request in the execute_concurrent hot path.

Changes

  1. Lazy Event creation (cluster.py): ResponseFuture._event starts as None instead of Event(). The Event is only materialized in _wait_for_result() (the synchronous result() path). For execute_concurrent, which never calls result() on individual futures, this eliminates ~620ns per request (351ns Event construction + 267ns Event.set()).

  2. Merged add_callbacks() (cluster.py): Registers both callback and errback under a single _callback_lock acquisition instead of two separate lock/unlock cycles. Saves ~80ns per request.

  3. _set_final_result / _set_final_exception (cluster.py): Capture _event reference under _callback_lock before calling .set() outside the lock. Skips .set() when Event was never created. Null-checks callback/errback lists before building to_call tuple. All safe under free-threaded Python (PEP 703).

  4. _wait_for_result() (cluster.py): New extracted method. Checks result availability under _callback_lock before creating Event — avoids Event creation entirely when the result arrived before the caller waits. Thread-safe under both GIL and no-GIL.

  5. _on_speculative_execute (cluster.py): Checks _final_result/_final_exception directly instead of relying on Event.is_set(), since Event may be None with lazy creation.

  6. Properties / paging (cluster.py): warnings, custom_payload properties and start_fetching_next_page handle _event is None.

Design notes

  • All changes are safe under both GIL and free-threaded (PEP 703) Python. No GIL assumptions.
  • _callback_lock is reused as the synchronization point for lazy Event creation (no new locks).
  • pool.py is not modified — an earlier attempt to optimize _stream_available_condition.notify() was reverted due to lost-wakeup risk under free-threaded Python.

Test results

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 21, 2026

Correction on test failures: The 6 "pre-existing failures" reported in the v2 comment were all caused by a stale Cython .so in the working tree taking precedence over the .py source. After rebuilding (uv sync --reinstall-package scylla-driver):

642 passed, 0 failures, 8 skipped

No pre-existing failures. Clean test run.

@mykaul mykaul force-pushed the perf/concurrent-submitter-thread branch 2 times, most recently from d054886 to 11c0191 Compare April 22, 2026 16:53
…rent

ConcurrentExecutorListResults now uses a dedicated submitter thread
instead of calling _execute_next inline from the event loop callback.
This decouples I/O completion processing from new request serialization
and enqueuing, yielding ~6-9% higher write throughput.

The callback signals a threading.Event; the submitter thread drains a
deque and calls session.execute_async in batches. This avoids blocking
the libev event loop thread with request preparation work (query plan,
serialization, tablet lookup) that takes ~27us per request.

The event-loop callback path is lock-free: it appends to a deque and
sets an Event, with no Condition/Lock acquisition in the hot path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant