Skip to content

Commit 6a95756

Browse files
committed
Make recovery handoffs explain why a lane resumed instead of leaking control prose
Recent OMX dogfooding kept surfacing raw `[OMX_TMUX_INJECT]` messages as lane results, which told operators that tmux reinjection happened but not why or what lane/state it applied to. The lane-finished persistence path now recognizes that control prose, stores structured recovery metadata, and emits a human-meaningful fallback summary instead of preserving the raw marker as the primary result. Constraint: Keep the fix in the existing lane-finished metadata surface rather than inventing a new runtime channel Rejected: Treat all reinjection prose as ordinary quality-floor mush | loses the recovery cause and target lane operators actually need Confidence: high Scope-risk: narrow Reversibility: clean Directive: Recovery classification is heuristic; extend the parser only when new operator phrasing shows up in real dogfood evidence Tested: cargo fmt --all --check Tested: cargo clippy --workspace --all-targets -- -D warnings Tested: cargo test --workspace Tested: LSP diagnostics on rust/crates/tools/src/lib.rs (0 errors) Tested: Architect review (APPROVE) Not-tested: Additional reinjection phrasings beyond the currently observed `[OMX_TMUX_INJECT]` / current-mode-state variants Related: ROADMAP #68
1 parent 42bb6cd commit 6a95756

File tree

2 files changed

+166
-10
lines changed

2 files changed

+166
-10
lines changed

ROADMAP.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -496,7 +496,7 @@ Model name prefix now wins unconditionally over env-var presence. Regression tes
496496

497497
62. **Worker state file surface not implemented****done (verified 2026-04-12):** current `main` already wires `emit_state_file(worker)` into the worker transition path in `rust/crates/runtime/src/worker_boot.rs`, atomically writes `.claw/worker-state.json`, and exposes the documented reader surface through `claw state` / `claw state --output-format json` in `rust/crates/rusty-claude-cli/src/main.rs`. Fresh proof exists in `runtime` regression `emit_state_file_writes_worker_status_on_transition`, the end-to-end `tools` regression `recovery_loop_state_file_reflects_transitions`, and direct CLI parsing coverage for `state` / `state --output-format json`. Source: Jobdori dogfood.
498498

499-
**Scope note (verified 2026-04-12):** ROADMAP #31, #43, and #63-#68 currently appear to describe acpx/droid or upstream OMX/server orchestration behavior, not claw-code source already present in this repository. Repo-local searches for `acpx`, `use-droid`, `run-acpx`, `commit-wrapper`, `ultraclaw`, `roadmap-nudge-10min`, `OMX_TMUX_INJECT`, `/hooks/health`, and `/hooks/status` found no implementation hits outside `ROADMAP.md`, and the earlier state-surface note already records that the HTTP server is not owned by claw-code. With #45, #64-#67, and #69 now fixed, the remaining unresolved items in this section still look like external tracking notes rather than confirmed repo-local backlog; re-check if new repo-local evidence appears.
499+
**Scope note (verified 2026-04-12):** ROADMAP #31, #43, and #63 currently appear to describe acpx/droid or upstream OMX/server orchestration behavior, not claw-code source already present in this repository. Repo-local searches for `acpx`, `use-droid`, `run-acpx`, `commit-wrapper`, `ultraclaw`, `/hooks/health`, and `/hooks/status` found no implementation hits outside `ROADMAP.md`, and the earlier state-surface note already records that the HTTP server is not owned by claw-code. With #45, #64-#69, and #75 now fixed, the remaining unresolved items in this section still look like external tracking notes rather than confirmed repo-local backlog; re-check if new repo-local evidence appears.
500500

501501
63. **Droid session completion semantics broken: code arrives after "status: completed"** — dogfooded 2026-04-12. Ultraclaw droid sessions (use-droid via acpx) report `session.status: completed` before file writes are fully flushed/synced to the working tree. Discovered +410 lines of "late-arriving" droid output that appeared after I had already assessed 8 sessions as "no code produced." This creates false-negative assessments and duplicate work. **Fix shape:** (a) droid agent should only report completion after explicit file-write confirmation (fsync or existence check); (b) or, claw-code should expose a `pending_writes` status that indicates "agent responded, disk flush pending"; (c) lane orchestrators should poll for file changes for N seconds after completion before final assessment. **Blocker:** none. Source: Jobdori ultraclaw dogfood 2026-04-12.
502502

@@ -508,7 +508,7 @@ Model name prefix now wins unconditionally over env-var presence. Regression tes
508508

509509
67. **Scoped review lanes do not emit structured verdicts****done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now recognizes review-style `APPROVE`/`REJECT`/`BLOCKED` results and records structured `reviewVerdict`, `reviewTarget`, and `reviewRationale` metadata on the `lane.finished` event while preserving existing non-review lane behavior. Regression coverage locks both the normal completion path and a scoped review-lane completion payload. **Original filing below.**
510510

511-
68. **Internal reinjection/resume paths leak opaque control prose**dogfooded 2026-04-12. OMX lanes stopping with `Continue from current mode state. [OMX_TMUX_INJECT]` expose internal implementation details instead of operator-meaningful state. The event tells us *that* tmux reinjection happened, but not *why* (retry after failure? resume after idle? manual recovery?), *what state was preserved*, or *what the lane was trying to do*. **Fix shape:** recovery/reinject events should emit structured cause like: `resume_after_stop`, `retry_after_tool_failure`, `tmux_reinject_after_idle`, `manual_recovery` plus preserved state / target lane info. Never leak bare internal markers like `[OMX_TMUX_INJECT]` as the primary summary. Blocker: none. Source: gaebal-gajae dogfood analysis 2026-04-12.
511+
68. **Internal reinjection/resume paths leak opaque control prose****done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now recognizes `[OMX_TMUX_INJECT]`-style recovery control prose and records structured `recoveryOutcome` metadata on `lane.finished`, including `cause`, optional `targetLane`, and optional `preservedState`. Recovery-style summaries now normalize to a human-meaningful fallback instead of surfacing the raw internal marker as the primary lane result. Regression coverage locks both the tmux-idle reinjection path and the `Continue from current mode state` resume path. Source: gaebal-gajae / Jobdori dogfood 2026-04-12.
512512

513513
69. **Lane stop summaries have no minimum quality floor****done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now normalizes vague/control-only stop summaries into a contextual fallback that includes the lane target and status, while preserving structured metadata about whether the quality floor fired (`qualityFloorApplied`, `rawSummary`, `reasons`, `wordCount`). Regression coverage locks both the pass-through path for good summaries and the fallback path for mushy summaries like `commit push everyting, keep sweeping $ralph`. **Original filing below.**
514514

rust/crates/tools/src/lib.rs

Lines changed: 164 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3845,6 +3845,8 @@ struct LaneFinishedSummaryData {
38453845
review_rationale: Option<String>,
38463846
#[serde(rename = "selectionOutcome", skip_serializing_if = "Option::is_none")]
38473847
selection_outcome: Option<SelectionOutcome>,
3848+
#[serde(rename = "recoveryOutcome", skip_serializing_if = "Option::is_none")]
3849+
recovery_outcome: Option<RecoveryOutcome>,
38483850
#[serde(rename = "artifactProvenance", skip_serializing_if = "Option::is_none")]
38493851
artifact_provenance: Option<ArtifactProvenance>,
38503852
#[serde(rename = "disabledCronIds", skip_serializing_if = "Vec::is_empty")]
@@ -3863,6 +3865,7 @@ struct LaneSummaryAssessment {
38633865
reasons: Vec<String>,
38643866
word_count: usize,
38653867
review_outcome: Option<ReviewLaneOutcome>,
3868+
recovery_outcome: Option<RecoveryOutcome>,
38663869
}
38673870

38683871
#[derive(Debug, Clone)]
@@ -3882,6 +3885,15 @@ struct SelectionOutcome {
38823885
rationale: Option<String>,
38833886
}
38843887

3888+
#[derive(Debug, Clone, Serialize)]
3889+
struct RecoveryOutcome {
3890+
cause: String,
3891+
#[serde(rename = "targetLane", skip_serializing_if = "Option::is_none")]
3892+
target_lane: Option<String>,
3893+
#[serde(rename = "preservedState", skip_serializing_if = "Option::is_none")]
3894+
preserved_state: Option<String>,
3895+
}
3896+
38853897
#[derive(Debug, Clone, Serialize)]
38863898
struct ArtifactProvenance {
38873899
#[serde(rename = "sourceLanes", skip_serializing_if = "Vec::is_empty")]
@@ -3906,10 +3918,15 @@ fn build_lane_finished_summary(
39063918
let assessment = assess_lane_summary_quality(raw_summary.unwrap_or_default());
39073919
let detail = match raw_summary {
39083920
Some(summary) if !assessment.apply_quality_floor => Some(compress_summary_text(summary)),
3909-
Some(summary) => Some(compose_lane_summary_fallback(manifest, Some(summary))),
3910-
None => Some(compose_lane_summary_fallback(manifest, None)),
3921+
Some(summary) => Some(compose_lane_summary_fallback(
3922+
manifest,
3923+
Some(summary),
3924+
assessment.recovery_outcome.as_ref(),
3925+
)),
3926+
None => Some(compose_lane_summary_fallback(manifest, None, None)),
39113927
};
39123928
let review_outcome = assessment.review_outcome.clone();
3929+
let recovery_outcome = assessment.recovery_outcome.clone();
39133930
let review_target = review_outcome
39143931
.as_ref()
39153932
.map(|_| manifest.description.trim())
@@ -3930,6 +3947,7 @@ fn build_lane_finished_summary(
39303947
review_target,
39313948
review_rationale: review_outcome.and_then(|outcome| outcome.rationale),
39323949
selection_outcome: extract_selection_outcome(raw_summary.unwrap_or_default()),
3950+
recovery_outcome,
39333951
artifact_provenance,
39343952
disabled_cron_ids: Vec::new(),
39353953
},
@@ -3950,6 +3968,10 @@ fn assess_lane_summary_quality(summary: &str) -> LaneSummaryAssessment {
39503968
}
39513969

39523970
let review_outcome = extract_review_outcome(summary);
3971+
let recovery_outcome = extract_recovery_outcome(summary);
3972+
if recovery_outcome.is_some() {
3973+
reasons.push(String::from("recovery_control_prose"));
3974+
}
39533975

39543976
let control_only = !words.is_empty()
39553977
&& words
@@ -3976,10 +3998,15 @@ fn assess_lane_summary_quality(summary: &str) -> LaneSummaryAssessment {
39763998
reasons,
39773999
word_count,
39784000
review_outcome,
4001+
recovery_outcome,
39794002
}
39804003
}
39814004

3982-
fn compose_lane_summary_fallback(manifest: &AgentOutput, raw_summary: Option<&str>) -> String {
4005+
fn compose_lane_summary_fallback(
4006+
manifest: &AgentOutput,
4007+
raw_summary: Option<&str>,
4008+
recovery_outcome: Option<&RecoveryOutcome>,
4009+
) -> String {
39834010
let target = manifest.description.trim();
39844011
let base = format!(
39854012
"Completed lane `{}` for target: {}. Status: completed.",
@@ -3990,6 +4017,25 @@ fn compose_lane_summary_fallback(manifest: &AgentOutput, raw_summary: Option<&st
39904017
target
39914018
}
39924019
);
4020+
if let Some(outcome) = recovery_outcome {
4021+
let mut detail = format!(
4022+
"{base} Recovery handoff observed via tmux reinjection (cause: `{}`).",
4023+
outcome.cause
4024+
);
4025+
if let Some(target_lane) = &outcome.target_lane {
4026+
let _ = std::fmt::Write::write_fmt(
4027+
&mut detail,
4028+
format_args!(" Target lane: `{target_lane}`."),
4029+
);
4030+
}
4031+
if let Some(preserved_state) = &outcome.preserved_state {
4032+
let _ = std::fmt::Write::write_fmt(
4033+
&mut detail,
4034+
format_args!(" Preserved state: {preserved_state}."),
4035+
);
4036+
}
4037+
return detail;
4038+
}
39934039
match raw_summary {
39944040
Some(summary) => format!(
39954041
"{base} Original stop summary was too vague to keep as the lane result: \"{}\".",
@@ -4086,6 +4132,59 @@ fn extract_selection_outcome(summary: &str) -> Option<SelectionOutcome> {
40864132
})
40874133
}
40884134

4135+
fn extract_recovery_outcome(summary: &str) -> Option<RecoveryOutcome> {
4136+
let trimmed = summary.trim();
4137+
if trimmed.is_empty() {
4138+
return None;
4139+
}
4140+
4141+
let lowered = trimmed.to_ascii_lowercase();
4142+
let has_tmux_inject_marker = lowered.contains("omx_tmux_inject");
4143+
let has_recovery_phrase = lowered.contains("continue from current mode state")
4144+
|| (lowered.starts_with("team ") && lowered.contains(" next:"));
4145+
if !has_tmux_inject_marker && !has_recovery_phrase {
4146+
return None;
4147+
}
4148+
4149+
let cause = if lowered.contains("current mode state") {
4150+
"resume_after_stop"
4151+
} else if lowered.contains("tool failure") {
4152+
"retry_after_tool_failure"
4153+
} else if lowered.contains("worker panes stalled")
4154+
|| lowered.contains("no progress")
4155+
|| lowered.contains("leader stale")
4156+
|| lowered.contains("all workers idle")
4157+
|| lowered.contains("all 1 worker idle")
4158+
|| lowered.contains("pane(s) active")
4159+
{
4160+
"tmux_reinject_after_idle"
4161+
} else {
4162+
"manual_recovery"
4163+
};
4164+
4165+
let target_lane = trimmed.lines().map(str::trim).find_map(|line| {
4166+
let lower = line.to_ascii_lowercase();
4167+
if !lower.starts_with("team ") {
4168+
return None;
4169+
}
4170+
line[5..]
4171+
.split_once(':')
4172+
.map(|(name, _)| name.trim())
4173+
.filter(|name| !name.is_empty())
4174+
.map(str::to_string)
4175+
});
4176+
4177+
let preserved_state = lowered
4178+
.contains("current mode state")
4179+
.then(|| String::from("current mode state"));
4180+
4181+
Some(RecoveryOutcome {
4182+
cause: cause.to_string(),
4183+
target_lane,
4184+
preserved_state,
4185+
})
4186+
}
4187+
40894188
fn extract_roadmap_items(line: &str) -> Vec<String> {
40904189
let mut items = Vec::new();
40914190
let mut chars = line.chars().peekable();
@@ -6028,11 +6127,11 @@ mod tests {
60286127

60296128
use super::{
60306129
agent_permission_policy, allowed_tools_for_subagent, classify_lane_failure,
6031-
derive_agent_state, execute_agent_with_spawn, execute_tool, final_assistant_text,
6032-
global_cron_registry, maybe_commit_provenance, mvp_tool_specs, permission_mode_from_plugin,
6033-
persist_agent_terminal_state, push_output_block, run_task_packet, AgentInput, AgentJob,
6034-
GlobalToolRegistry, LaneEventName, LaneFailureClass, ProviderRuntimeClient,
6035-
SubagentToolExecutor,
6130+
derive_agent_state, execute_agent_with_spawn, execute_tool, extract_recovery_outcome,
6131+
final_assistant_text, global_cron_registry, maybe_commit_provenance, mvp_tool_specs,
6132+
permission_mode_from_plugin, persist_agent_terminal_state, push_output_block,
6133+
run_task_packet, AgentInput, AgentJob, GlobalToolRegistry, LaneEventName, LaneFailureClass,
6134+
ProviderRuntimeClient, SubagentToolExecutor,
60366135
};
60376136
use api::OutputContentBlock;
60386137
use runtime::ProviderFallbackConfig;
@@ -7856,6 +7955,54 @@ mod tests {
78567955
"control_only"
78577956
);
78587957

7958+
let recovery = execute_agent_with_spawn(
7959+
AgentInput {
7960+
description: "Recover the stalled audit lane".to_string(),
7961+
prompt: "Normalize OMX reinjection control prose".to_string(),
7962+
subagent_type: Some("Explore".to_string()),
7963+
name: Some("recovery-lane".to_string()),
7964+
model: None,
7965+
},
7966+
|job| {
7967+
persist_agent_terminal_state(
7968+
&job.manifest,
7969+
"completed",
7970+
Some(
7971+
"Team read-only-audit-only-for-roadm: worker panes stalled, no progress 2m30s. Next: omx team status read-only-audit-only-for-roadm; read worker messages; unblock/reassign or shutdown. [OMX_TMUX_INJECT]",
7972+
),
7973+
None,
7974+
)
7975+
},
7976+
)
7977+
.expect("recovery agent should succeed");
7978+
7979+
let recovery_manifest = std::fs::read_to_string(&recovery.manifest_file)
7980+
.expect("recovery manifest should exist");
7981+
let recovery_manifest_json: serde_json::Value =
7982+
serde_json::from_str(&recovery_manifest).expect("recovery manifest json");
7983+
let recovery_detail = recovery_manifest_json["laneEvents"][1]["detail"]
7984+
.as_str()
7985+
.expect("recovery detail");
7986+
assert!(recovery_detail.contains("Recovery handoff observed via tmux reinjection"));
7987+
assert!(recovery_detail.contains("read-only-audit-only-for-roadm"));
7988+
assert!(!recovery_detail.contains("OMX_TMUX_INJECT"));
7989+
assert_eq!(
7990+
recovery_manifest_json["laneEvents"][1]["data"]["recoveryOutcome"]["cause"],
7991+
"tmux_reinject_after_idle"
7992+
);
7993+
assert_eq!(
7994+
recovery_manifest_json["laneEvents"][1]["data"]["recoveryOutcome"]["targetLane"],
7995+
"read-only-audit-only-for-roadm"
7996+
);
7997+
assert_eq!(
7998+
recovery_manifest_json["laneEvents"][1]["data"]["qualityFloorApplied"],
7999+
true
8000+
);
8001+
assert_eq!(
8002+
recovery_manifest_json["laneEvents"][1]["data"]["reasons"][0],
8003+
"recovery_control_prose"
8004+
);
8005+
78598006
let review = execute_agent_with_spawn(
78608007
AgentInput {
78618008
description: "Review commit 1234abcd for ROADMAP #67".to_string(),
@@ -8044,6 +8191,15 @@ mod tests {
80448191
.expect("cron should still exist");
80458192
assert!(!disabled_entry.enabled);
80468193

8194+
let resume_outcome =
8195+
extract_recovery_outcome("Continue from current mode state. [OMX_TMUX_INJECT]")
8196+
.expect("resume outcome should be detected");
8197+
assert_eq!(resume_outcome.cause, "resume_after_stop");
8198+
assert_eq!(
8199+
resume_outcome.preserved_state.as_deref(),
8200+
Some("current mode state")
8201+
);
8202+
80478203
let spawn_error = execute_agent_with_spawn(
80488204
AgentInput {
80498205
description: "Spawn error task".to_string(),

0 commit comments

Comments
 (0)