perf: move request submission off event loop thread in execute_concurrent by mykaul · Pull Request #827 · scylladb/python-driver

mykaul · 2026-04-21T08:59:31Z

Summary

Moves execute_async() calls in execute_concurrent from the event-loop callback thread to a dedicated submitter thread
The event loop callback now only appends to a deque and signals an Event, reducing per-callback overhead from ~27μs to ~100ns
The submitter thread drains the deque in batches and calls execute_async(), which includes serialization — keeping that CPU work off the event loop

v2: Reduce per-request lock overhead in ResponseFuture

Second commit (7cae6a14e) reduces lock/synchronization cost per request in the execute_concurrent hot path:

Lazy Event creation: ResponseFuture._event starts as None instead of Event(). The Event is only materialized in result() (the synchronous path). For execute_concurrent, which never calls result() on individual futures, this eliminates ~620ns per request (351ns Event construction + 267ns Event.set()).
Merged add_callbacks(): Registers both callback and errback under a single _callback_lock acquisition instead of two separate lock/unlock cycles. Saves ~80ns per request.
_set_final_result / _set_final_exception: Capture _event reference under _callback_lock before calling .set() outside the lock. Skip .set() when Event was never created. Null-check callback/errback lists before building to_call tuple.
_wait_for_result(): Checks result availability under _callback_lock before creating Event — avoids Event creation entirely when the result arrived before the caller waits.
_on_speculative_execute: Checks _final_result/_final_exception directly instead of Event.is_set(), since Event may be None with lazy creation.

All changes are safe under both GIL and free-threaded (PEP 703) Python. No GIL assumptions.

Benchmark Results

On our vector ingestion benchmark (100K rows, 768-dim float32 vectors, ScyllaDB 2026.1.1):

Stock master + execute_concurrent: ~7,500 rows/s
Enhanced driver + this change: +6-9% throughput improvement (additive with Cython serializer gains)
The improvement is modest because serialization still dominates; with Cython serializers reducing serialization cost, this change becomes more impactful

How It Works

_ConcurrentExecutorBase spawns a daemon submitter thread alongside the existing callback mechanism
Callbacks do deque.append(1); event.set() — minimal work on the hot path
Submitter thread wakes on the event, drains pending count, and calls _execute_next() in a batch
Thread-safe via collections.deque (atomic append/popleft in CPython) + threading.Event
Graceful shutdown: sentinel None in deque signals the thread to exit; join() in wait()

Testing

642 unit tests pass, 0 failures
All 10 existing test_concurrent.py unit tests pass
Tested with real ScyllaDB cluster under sustained load (100K+ inserts)

mykaul · 2026-04-21T18:03:44Z

v2 changes: reduce per-request lock overhead in ResponseFuture

New commit 7cae6a14e on top of the submitter thread change. Focuses on reducing lock/synchronization cost per request in the execute_concurrent hot path.

Changes

Lazy Event creation (cluster.py): ResponseFuture._event starts as None instead of Event(). The Event is only materialized in _wait_for_result() (the synchronous result() path). For execute_concurrent, which never calls result() on individual futures, this eliminates ~620ns per request (351ns Event construction + 267ns Event.set()).
Merged add_callbacks() (cluster.py): Registers both callback and errback under a single _callback_lock acquisition instead of two separate lock/unlock cycles. Saves ~80ns per request.
_set_final_result / _set_final_exception (cluster.py): Capture _event reference under _callback_lock before calling .set() outside the lock. Skips .set() when Event was never created. Null-checks callback/errback lists before building to_call tuple. All safe under free-threaded Python (PEP 703).
_wait_for_result() (cluster.py): New extracted method. Checks result availability under _callback_lock before creating Event — avoids Event creation entirely when the result arrived before the caller waits. Thread-safe under both GIL and no-GIL.
_on_speculative_execute (cluster.py): Checks _final_result/_final_exception directly instead of relying on Event.is_set(), since Event may be None with lazy creation.
Properties / paging (cluster.py): warnings, custom_payload properties and start_fetching_next_page handle _event is None.

Design notes

All changes are safe under both GIL and free-threaded (PEP 703) Python. No GIL assumptions.
_callback_lock is reused as the synchronization point for lazy Event creation (no new locks).
pool.py is not modified — an earlier attempt to optimize _stream_available_condition.notify() was reverted due to lost-wakeup risk under free-threaded Python.

Test results

643 unit tests pass
6 failures are all pre-existing (confirmed failing on origin/master):
- 4× test_response_future — tablet= parameter mismatch from PR perf: optimize Tablet memory layout and per-query lookup speed #812
- 1× test_batch_message_with_keyspace
- 1× test_eq in test_resultset

mykaul · 2026-04-21T18:23:00Z

Correction on test failures: The 6 "pre-existing failures" reported in the v2 comment were all caused by a stale Cython .so in the working tree taking precedence over the .py source. After rebuilding (uv sync --reinstall-package scylla-driver):

642 passed, 0 failures, 8 skipped

No pre-existing failures. Clean test run.

…rent ConcurrentExecutorListResults now uses a dedicated submitter thread instead of calling _execute_next inline from the event loop callback. This decouples I/O completion processing from new request serialization and enqueuing, yielding ~6-9% higher write throughput. The callback signals a threading.Event; the submitter thread drains a deque and calls session.execute_async in batches. This avoids blocking the libev event loop thread with request preparation work (query plan, serialization, tablet lookup) that takes ~27us per request. The event-loop callback path is lock-free: it appends to a deque and sets an Event, with no Condition/Lock acquisition in the hot path.

mykaul force-pushed the perf/concurrent-submitter-thread branch 2 times, most recently from fd9be81 to b759d4a Compare April 21, 2026 11:00

mykaul force-pushed the perf/concurrent-submitter-thread branch 2 times, most recently from d054886 to 11c0191 Compare April 22, 2026 16:53

mykaul mentioned this pull request Apr 23, 2026

bench: add ingest benchmark for execute_concurrent #832

Open

mykaul force-pushed the perf/concurrent-submitter-thread branch from 11c0191 to 5a0dac7 Compare April 23, 2026 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: move request submission off event loop thread in execute_concurrent#827

perf: move request submission off event loop thread in execute_concurrent#827
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/concurrent-submitter-thread

mykaul commented Apr 21, 2026 •

edited

Loading

Uh oh!

mykaul commented Apr 21, 2026

Uh oh!

mykaul commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

v2: Reduce per-request lock overhead in ResponseFuture

Benchmark Results

How It Works

Testing

Uh oh!

mykaul commented Apr 21, 2026

v2 changes: reduce per-request lock overhead in ResponseFuture

Changes

Design notes

Test results

Uh oh!

mykaul commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Apr 21, 2026 •

edited

Loading