feat: background workers = non-HTTP workers with shared state#2287
feat: background workers = non-HTTP workers with shared state#2287nicolas-grekas wants to merge 12 commits intophp:mainfrom
Conversation
e1655ab to
867e9b3
Compare
|
Interesting approach to parallelism, what would be a concrete use case for only letting information flow one way from the sidekick to the http workers? Usually the flow would be inverted, where a http worker offloads work to a pool of 'sidekick' workers and can optionally wait for a task to complete. |
da54ab8 to
a06ba36
Compare
|
Thank you for the contribution. Interesting idea, but I'm thinking we should merge the approach with #1883. The kind of worker is the same, how they are started is but a detail. @nicolas-grekas the Caddyfile setting should likely be per |
ad71bfe to
05e9702
Compare
|
@AlliBalliBaba The use case isn't task offloading (HTTP->worker), but out-of-band reconfigurability (environment->worker->HTTP). Sidekicks observe external systems (Redis Sentinel failover, secret rotation, feature flag changes, etc.) and publish updated configuration that HTTP workers pick up on their next request; with per-request consistency guaranteed via Task offloading (what you describe) is a valid and complementary pattern, but it solves a different problem. The non-HTTP worker foundation here could support both. @henderkes Agreed that the underlying non-HTTP worker type overlaps with #1883. The foundation (skip HTTP startup/shutdown, immediate readiness, cooperative shutdown) is the same. The difference is the API layer and the DX goals:
Happy to follow up with your proposals now that this is hopefully clarified. |
05e9702 to
8a56d4c
Compare
|
Great PR! Couldn't we create a single API that covers both use case? We try to keep the number of public symbols and config option as small as possible! |
Yes, that's why I'd like to unify the two API's and background implementations into one. Unfortunately the first task worker attempt didn't make it into |
|
The PHP-side API has been significantly reworked since the initial iteration: I replaced The old design used
Key improvements:
Other changes:
|
cb65f46 to
4dda455
Compare
|
Thanks @dunglas and @henderkes for the feedback. I share the goal of keeping the API surface minimal. Thinking about it more, the current API is actually quite small and already general:
The name "sidekick" works as a generic concept: a helper running alongside. The current set_vars/get_vars protocol covers the config-publishing use case. For task offloading (HTTP->worker) later, the same sidekick infrastructure could support:
Same worker type, same So the path would be:
The foundation (non-HTTP threads, cooperative shutdown, crash recovery, per-php_server scoping) is shared. Only the communication primitives differ. WDYT? |
b3734f5 to
ed79f46
Compare
|
|
|
Hmm, it seems they are on some versions, for example here: https://github.com/php/frankenphp/actions/runs/23192689128/job/67392820942?pr=2287#step:10:3614 For the cache, I'm not aware of a Github feature that allow to clear everything unfortunately 🙁 |
|
Thanks for the ping, it got out of my mind. I'll catch up with the conversation this week! |
|
I'm in Berlin until Friday BTW! Let's meet at SymfonyLive or around? @AlliBalliBaba also? Please join me on https://symfony.com/slack |
|
After re-reading the thread, here's my position. The target use case feels right, and I don't have a concrete need today that would justify use cases beyond the background-to-HTTP flow, like HTTP workers publishing data between themselves. I also have no objection in principle to lazy-starting from PHP: the DX it provides is valuable, and starting a worker can remain conditional on an app-level trigger. Feels a bit like goroutines, I like it. That said, two points tip me toward an API that cleanly separates the lifecycle from the data store:
Concretely, I'd support an API where the store is exposed in a generic form, and where starting a background worker is a named, explicit operation. Nothing prevents lazy-start via Caddyfile catch-all from still kicking in on a On the other points, I'm aligned with what already seems to be converging in the thread. Sorry if I forgot something, there are quite a few comments 🙂 |
|
Thanks for taking the time @alexandre-daubois. Two things I'd like to push back on, plus one structural argument that I think hasn't been engaged with yet. On "freezes the On the diagnostic chain concern: I offered some improvements in my previous reply, I think we can push this further until errors from workers are surfaced better to callers, all compatible with the API I'm proposing. That's the one aspect I think we can improve further. On the explicit
The current design sidesteps both: one The structural argument I'd like your read on: the current model is one-writer-many-readers by construction. Only the background worker owning |
Not necessarily the public webroot, but a root defined by the Caddyfile for sure. The problem with your approach is that it limits to a single background worker script, which is likely in a framework like Symfony with a central kernel and container, but otherwise not.
It's a fair point, but again, I'm not concerned with security when it's explicitly configurable through some background-worker-directory. We're not giving anyone a gun here, they're stealing it, pointing it at their foot and fire repeatedly when someone actually manages to run into a security issue with it. And they could do the same with a single entrypoint, too.
While adding API surface is true, but it's unified api surface that we'd most likely add at some point anyways. Then it's better to have an explicit API than magic behaviour on one, but not the other.
That's a very fair point that I don't have a perfect solution to. It's actually one where I'm going back and fourth between even using names (and just using anonymous lists) in another project I'm working on.
See point 1, because at that point it would just be confusing about what issues a lazy start and what doesn't. (And I'm honestly not even sure how useful a lazy start really is, what problem does that solve? The library will have a dependency on the Caddyfile configuration at that point and the worker existing if it's hit once, and if it is, it would never shut down again)
These are all more or less the same point of disagreement which is: this PR is locking that decision in "forever". No generic KV store, no matter if it would ever make sense (I'd argue it would, how else would you share vars within the same application on different threads, but guard it from being accessed by other, unrelated applications? Using apcu for this is very dirty and will suffer from heavy fragmentation for a runtime concern. I think @alexandre-daubois essentially has the same considerations that Alex and I do too.
It would obviously be blocking until started, but it would a generic API surface that could be reused for task workers, that we're still intending to add. And it would be explicit. And it would solve the inability to reason about what a unified
I just think the actual issue with it is the same as before: worker string names lead to poor reasoning. If library A uses 'redis' and library B uses the same, but both expect different worker scripts, we have the exact same issue that the many-writers, many-readers has. If we don't have conflicting worker names, there's no issue with many-writers-many-readers either. |
|
Marc has already articulated most of where I land, so I'll stay short and add a few angles I don't think have come up yet. On the DNS / Redis / Symfony service name analogy, I think the comparison doesn't hold. Those APIs deliberately separate concerns: DNS has About testability: a global function whose call can spawn a worker process is hostile to unit tests. Libraries adopting this will either need to wrap it in their own abstraction or give up on isolation in tests. A store-shaped API is materially more mockable. Also, about the principle of least surprise: Finally, genuine question about the API: is it possible to unset a key? |
|
Edited: I missed last response by @alexandre-daubois and I agree with him. API updated. Thanks, everyone, for the depth of this one! @nicolas-grekas for the huge amount of work, and @henderkes, @AlliBalliBaba, @alexandre-daubois, @dbu for the careful pushback. I've read through the whole thread, and I think we're close to merging it. Most of the back-and-forth is really Here's my opinion on this: Caddyfile and the whole Go/C runtime stay as Nicolas designed them, but we make small changes to the PHP API:
On unsetting a key: with snapshot semantics it's just set_vars a new array without the key — no dedicated primitive needed. If we ever add per-key writes we'd add a matching unset. We can apply the same logic for #2319, drop the
The fact that a worker picks up the task is an implementation detail. WDYT? |
|
Looks like the best of both worlds @dunglas. Dropping Sorry if this was answered somewhere in the comments: what's the defined behavior when the caller has no worker scope, e.g. called from an HTTP request context, a CLI script, or any non-worker code path? Should it be no-op or throw a |
|
I would throw too |
Why do we need a timeout for the get_vars? Shouldn't that just return immediately since the prior
Perhaps this should return an object on which php can call
You're talking about I'm generally happy with that direction, but I'd still want to argue the case for being able to define multiple background worker scripts. We went out of our way to support non-framework code all the way up until this point, for the gain I see (for a single script would already mostly disappear with an explicit |
|
Thanks @dunglas for the proposal, I think we're very close. Let me suggest a small refinement that I think fully addresses the debuggability objection without giving up anything structural. Proposal (noted about #2319 also)frankenphp_require_background_worker(string $name, float $timeout = 30.0): void
frankenphp_set_vars(array $vars): void
frankenphp_get_vars(string|array $name): array
frankenphp_get_worker_handle(): resourceFour functions, same count as your proposal. Two differences:
|
Renames PHP API for forward compat (the API can later serve non-worker use cases): frankenphp_set_worker_vars -> frankenphp_set_vars frankenphp_get_worker_vars -> frankenphp_get_vars Also enriches get_vars timeout errors when the worker fails before reaching set_vars: the exception now includes the worker name, resolved entrypoint path, exit status, number of attempts, and the last PHP error (message, file, line) captured from PG(last_error_*).
Makes lifecycle explicit and decouples it from data access: - Add frankenphp_require_background_worker(string $name, float $timeout = 30.0): lazy-starts the worker and blocks until it has called set_vars once (ready) or the timeout expires. Throws on boot failure with the same rich details. - frankenphp_get_vars(string|array $name): array is now a pure read. It no longer starts workers or waits for readiness. Throws if the target worker is not running or has not called set_vars yet. The $timeout argument is removed (no blocking). This eliminates the "sometimes starts, sometimes doesn't" side effect on reads and makes traceability linear: grep for require_background_worker to find where a dependency is declared. All PHP test scripts are updated to call require_background_worker before get_vars. The Go runtime now exposes two separate exports (go_frankenphp_require_background_worker, go_frankenphp_get_vars) with focused responsibilities.
frankenphp_require_background_worker now behaves according to the caller's context, which makes the call site carry meaningful intent: - HTTP worker BEFORE frankenphp_handle_request (bootstrap): lazy-start and fail-fast. As soon as a boot attempt fails, the rich error is thrown without waiting for the retry backoff cycle. This turns the bootstrap phase into a strict dependency declaration: broken deps = worker boot fails visibly, not serves traffic degraded. - HTTP worker INSIDE frankenphp_handle_request (runtime): assert-only. The worker must already be running (num 1 in Caddyfile, or previously required during bootstrap). Throws immediately if not. Never lazy-starts. Runtime require becomes a clean assertion, not a side-effectful call. - Non-worker mode (classic request, CLI): lazy-start with tolerance. Waits up to timeout, letting the restart/backoff loop recover from transient failures. Matches the existing behavior every request does anyway. Mode detection uses workerThread.isBootingScript, which is already tracked. Test PHP files are restructured to call require_background_worker at bootstrap (before frankenphp_handle_request). New tests cover: - Runtime require asserting on unknown worker - get_vars throwing on not-running worker The boot-failure test is moved to non-worker mode so the tolerant path can exercise the rich error reporting. Docs describe the three modes with examples.
In CLI mode (frankenphp php-cli), there is no worker pool and no SAPI request cycle; the background-worker API is meaningless. Rather than throwing at call time, the four functions are now unregistered from the function table in MINIT when the SAPI is "cli": frankenphp_require_background_worker frankenphp_set_vars frankenphp_get_vars frankenphp_get_worker_handle This lets library code detect the mode cleanly via function_exists() and fall back to alternative config sources without try/catch. HTTP-mode set_vars cannot be hidden the same way because the context (HTTP worker vs background worker) is per-thread, not per-process; threads share the PHP module and function table. Runtime throw stays the approach there, matching PHP's convention for pcntl_*/posix_*.
Background workers inherited the HTTP-worker wording which mentioned frankenphp_handle_request — that function doesn't apply to bg workers. The ready signal for a bg worker is frankenphp_set_vars. - startupFailChan error: "background worker %s has not reached frankenphp_set_vars()" - Warn log (watcher): "(watcher enabled) background worker has not reached frankenphp_set_vars()" - Warn log (normal): "background worker boot failed, restarting" The log now also carries exit_status and, when captured, the last PHP error message from bootFailureInfo. Tailing the FrankenPHP log at Warn level is enough to see what went wrong without reading the PHP error log separately.
In CLI mode the bg-worker functions are no longer exposed, so listing CLI as a "non-worker mode" case is misleading. Non-worker mode now means classic request-per-process (HTTP requests where no worker is configured), which is the only case where require still runs with tolerant lazy-start semantics.
Was: scope passed around as string (formatted "php_server_N") with "" as the "no scope" sentinel. Every consumer had to agree on the string shape, and changing the representation would be a BC break on the public API (NextBackgroundWorkerScope, WithWorkerBackgroundScope, WithRequestBackgroundScope). Now: type BackgroundScope int. Opaque wrapper; callers can only obtain one via NextBackgroundWorkerScope(). The zero value (BackgroundScope(0)) keeps the "no scope" semantics, so the embed/non-caddy path is unchanged. - Type safety: can't accidentally pass an env var or worker name as a scope. - Forward BC: internal representation can change (int -> struct, different hashing, whatever) without touching consumer code. - No overhead: int map keys vs string map keys at startup/per-request is noise.
"HTTP workers" implies specifically worker mode; but non-worker classic requests can also read bg worker vars and need signaling. In contexts where both apply, "HTTP threads" is accurate and doesn't suggest the reader needs worker mode. Kept "HTTP worker" where it's specifically about worker-mode behavior (inside frankenphp_handle_request, FRANKENPHP_WORKER_BACKGROUND, etc.).
get_vars is now pure-read; the lazy-start lives in require_background_worker. Updated the How It Works steps, the named/catch-all bullets in Configuration, and the Caddyfile comment that described the catch-all as "handles any unlisted name via get_vars()".
Docs: "same behavior every request" wording was misleading. In non-worker mode, only the first request to a given name pays the lazy-start cost; subsequent calls see the worker already reserved and return almost immediately. Tests: add coverage for three edge cases of frankenphp_require_background_worker(): - Empty name -> ValueError - Negative timeout -> ValueError - timeout=0 -> must not hang (returns promptly, any error flavor)
|
I don't like WDYT about |
|
We wouldn't ensure a background worker, we would ensure a background worker is running. I'm with you though, require feels wrong. I'm still in favour of |
Yes, I think we all meant
We already don't do frankenphp sapi bootup (embed instead) in the cli version. With my proposed php-src change it would use the cli sapi, still without the frankenphp extension.
I was thinking of potential worker orchestration from php side later. But thinking about it again, we could do that with streams too, so it's fine.
Sorry, I should've re-read the current version. We've been through so many iterations, at this point it's all getting a bit fuzzy, haha. No further objections from my side then. |
|
Thanks @henderkes for the follow-up confirming no further objections. On your php-src change proposal: if it lands and makes FrankenPHP CLI use the I pushed a set of refinements on top of the previous round. Summary of what changed and why: 1. Rename
|
|
The last version of the public API sounds good to me! Excellent work. |
I'm sorry, but I strongly disagree here.
When this is a real concern (and I'm not sure it is), I think we should shut down workers that haven't been asked for in a while. I'm all for keeping it as simple as possible: |
|
ensure/start I'll follow your lead - @dunglas any stronger opinion?
I agree, and that's now closer to one behavior! the only special case is failing early when a worker cannot start while http workers didn't call handle_request yet. I think that's a net safety gain that will improve robustness for ppl that can start things early, because it makes putting frankenphp live safer. The backoff mechanism of http workers will help recover from that automatically on startup when possible, while providing quicker feedback. |
|
This PR is absolutely massive. 3k loc change ... I'd argue breaking it down by scope and merge in minimal working systems, iterating as you go and paying attention to related issues so you learn the pain points users experience. For this PR ... There's so much going on, and some of it is not-obviously-wrong. There are at least 3 potential race conditions that jump out at me immediately, double close issues (which can create a security vulnerability or corruption), workers potentially getting stuck in half-started states, caddy file ordering issues, lack of synchronization, etc. Sure, many of these problems "go away" by enforcing exactly one worker thread and assuming users only use caddy to run frankenphp, but it would be a ton of work to remove that constraint if/when we want to. I'd be happy to review the whole diff, but my personal preference is to break it down. Here's where I see some seams:
You could stop here, or keep going. User demand (how can I add more instances?) gives a good reason to continue.
Each of these is independently useful, independently reviewable, and (importantly) independently revertible if a design choice turns out to be wrong. Step 4 alone covers probably 80% of what users will actually reach for. If steps 5–7 take another release or two while patterns emerge from issues, that's fine; the feature is still shipped. The other thing this buys you: each slice lets the next one's API be informed by what users actually do with the previous one. Shipping 3k lines at once locks in set_vars / get_vars / ensure / batch-names / scoping / catch-all semantics before anyone has written a single real background worker against them. This is good work, and I'm excited to see where it goes. |
Collapses the three-mode require semantics into two, moves the batch-declaration affordance to require instead of get_vars, and renames require to ensure per @dunglas's concern that "require" collides with PHP's language keyword (which takes a path). - frankenphp_require_background_worker -> frankenphp_ensure_background_worker. "Ensure" captures the "make sure this is running, start it if it isn't" semantic without the keyword collision. - Tolerant lazy-start inside frankenphp_handle_request: runtime ensure now behaves the same as non-worker mode (lazy-start + timeout + backoff tolerance). Bootstrap-before-handle_request keeps its fail-fast discipline. This lets processes start only the workers they actually exercise, instead of over-provisioning by pre-ensuring everything that might be needed. - Multi-name ensure: frankenphp_ensure_background_worker now accepts string|array. Batch declaration with a shared deadline, fail-fast on any worker's boot failure in bootstrap mode. get_vars loses the array form, becoming single-name pure read. - globalCtx.Done() wired into ensure's select cases, so in-flight calls unblock cleanly on FrankenPHP shutdown instead of waiting out their timeout. - Fix: markBackgroundReady() is now called on every set_vars, not just the first. Previously sk.readyOnce gated it, leaving isBootingScript stuck at true after a crash-restart. That misclassified subsequent crashes as boot failures and kept the readyWorkers metric decremented.
Note
Description updated to reflect the latest pushes. API names and semantics are final pending review; see the thread for the back-and-forth that led here.
Summary
Background workers are long-running PHP workers that run outside the HTTP cycle. They observe their environment (Redis, DB, filesystem, etc.) and publish variables that HTTP threads (workers or classic requests) read per-request, enabling real-time reconfiguration without restarts or polling.
PHP API
Four functions:
frankenphp_ensure_background_worker(string|array $name, float $timeout = 30.0): void— declares a dependency on one or more background workers. Lazy-starts them if needed, blocks until each has calledset_vars()at least once or the timeout expires. Two behaviors depending on caller:frankenphp_handle_request): fail-fast. Any boot failure throws immediately with the captured details instead of waiting for the backoff cycle. Use for strict dependency declaration at boot.frankenphp_handle_request, classic request-per-process): tolerant lazy-start. First caller pays the startup cost; later callers see the worker already reserved. Processes only start workers they actually exercise.frankenphp_set_vars(array $vars): void— publishes vars from a background worker script (persistent memory, cross-thread). Skips all work when data is unchanged (===check).frankenphp_get_vars(string $name): array— pure read. Returns the latest published vars. Throws if the worker isn't running or hasn't calledset_vars()yet. Generational cache: repeated calls within a single HTTP request return the same array instance (===is O(1)).frankenphp_get_worker_handle(): resource— readable stream for shutdown signaling. Closed on shutdown (EOF).In CLI mode (
frankenphp php-cli), none of these functions are exposed (MINIT-level hiding viazend_hash_str_del).function_exists()returnsfalse, so library code can degrade gracefully.Caddyfile configuration
backgroundmarks a worker as non-HTTPnamespecifies an exact worker name; workers withoutnameare catch-all for lazy-started namesmax_threadson catch-all sets a safety cap for lazy-started instances (defaults to 16)max_consecutive_failuresdefaults to 6 (same as HTTP workers)max_execution_timeautomatically disabled for background workersphp_serverblock has its own isolated scope (opaqueBackgroundScopetype managed byfrankenphp.NextBackgroundWorkerScope())Shutdown
On restart/shutdown, the signaling stream is closed. Workers detect this via
fgets()returningfalse(EOF). Workers have a 5-second grace period. In-flightensure_background_workercalls unblock onglobalCtx.Done()instead of waiting out their timeout.After the grace period, a best-effort force-kill is attempted:
max_execution_timetimer cross-thread viatimer_settime(EG(max_execution_timer_timer))CancelSynchronousIo+QueueUserAPCinterrupts blocking I/O and alertable waitsDuring the restart window,
get_varsreturns the last published data (stale but available, kept in persistent memory across restarts). A warning is logged on crash.Boot-failure reporting
When a background worker fails before calling
set_vars,ensure_background_workerthrows aRuntimeExceptionwith the captured details: worker name, resolved entrypoint path, exit status, number of attempts, and the last PHP error (message, file, line) captured fromPG(last_error_*).Forward compatibility
The signaling stream is forward-compatible with the PHP 8.6 poll API RFC.
Poll::addReadableaccepts stream resources directly; code written today withstream_selectwill work on 8.6 withPoll, no API change needed.Architecture
php_serverscope isolation via opaqueBackgroundScopetype. Internal registry is unexported.backgroundWorkerThreadhandler implementingthreadHandlerinterface, decoupled from HTTP worker code paths.drain()closes the signaling stream (EOF) for clean shutdown signaling.pemalloc) withRWMutexfor safe cross-thread sharing.set_varsskip: uses PHP's===(zend_is_identical) to detect unchanged data, skips validation, persistent copy, write lock, and version bump.IS_ARRAY_IMMUTABLE).ZSTR_IS_INTERNED): skip copy/free for shared-memory strings.ensure_background_workeraccepts a batch of names with a shared deadline; fail-fast in bootstrap mode reports the failing worker's details.$_SERVER['FRANKENPHP_WORKER_NAME']set for background workers.$_SERVER['FRANKENPHP_WORKER_BACKGROUND']set for all workers (true/false).Example
Test coverage
Unit tests, integration tests, and one Caddy integration test covering: bootstrap fail-fast, runtime tolerant lazy-start, multi-name ensure, get_vars pure read, set_vars validation (types, objects, refs), CLI function hiding, enum support, binary-safe strings, multiple entrypoints, crash-restart reclassification, boot-failure rich errors, signaling stream, worker restart lifecycle, named auto-start with
m#prefix, edge cases (empty name, negative timeout, timeout=0).All tests pass on PHP 8.2, 8.3, 8.4, and 8.5 with
-race. Zero memory leaks on PHP debug builds.Documentation
Full docs at
docs/background-workers.md.