Stop Recording Timeout: Frames Upload After Stop, Double /end Call

Metadata

Date: 2026-05-04
Status: fixed
Severity: high
Related issue/ticket: N/A
Owner: N/A

About

Overview: Three related defects manifest when the user presses the stop button during a glasses-mode recording session with a large frame backlog:

Frames keep uploading after stop is pressed — the flush() call in stopCapture() uploads every queued frame before returning. With 30-second default sweep interval and a 60-second flushTimeoutMs, many frames can still be in-flight for up to 60 seconds after the stop button is pressed. The user sees the upload activity continuing long after stop.
"Recording transition timed out" shown in UI — the orchestrator races stopCapture() against a 30-second timeout (TRANSITION_TIMEOUT_MS = 30_000). flush() defaults to 60 seconds (DEFAULT_FLUSH_TIMEOUT_MS = 60_000). The timeout fires before flush() returns, causing Promise.race to reject with "Recording transition timed out" while stopCapture() continues running detached in the background.
/sessions/{id}/end called twice — the first call comes from the detached stopCapture() that continues running after the orchestrator's race times out. It eventually completes its flush(), sees currentSessionId is still set (module-level variable), and calls /end with the real session ID (logged as "session_ended" with sessionId). Meanwhile the timeout causes stopRecordingFromControl() to reject and reset state to "recording" (with lastError). The user, or the orchestrator's retry logic, can then invoke stopRecordingFromControl() again — which calls stopCapture() again. At that point, currentSessionId is already null (cleared by the first stopCapture() run), so session_ended is logged with null as the session ID, and the second /end POST uses a null URL segment — but the log shows it as the second occurrence.

Secondary symptom: - WARN [PersistentUploadQueue] writeManifest failed: "Unable to create file or directory: directory already exists or could not be created" — this is a TOCTOU race in writeManifest(): the if (!dir.exists) check passes, then a concurrent call creates the directory, and dir.create() throws. This is a pre-existing non-blocking issue surfaced under concurrent upload pressure.

Technical Questions: - Are we making assumptions? No — tracing the module-level state in capture-session.ts and the Promise.race in recording-control-orchestrator.ts confirms all three root causes. - How old is this bug? The 30s timeout vs. 60s flush mismatch is structural and has existed since the TRANSITION_TIMEOUT_MS and DEFAULT_FLUSH_TIMEOUT_MS constants were set independently. - Is there anything obvious we might have missed? Yes — after the race times out, the orchestrator has no mechanism to signal stopCapture() to abort. stopCapture() runs to completion against stale module state, including clearing currentSessionId after calling /end. If the user retries stop, the second call hits null state. - Are there specific system states required to reproduce it? Yes — a glasses-mode session with enough queued frames that flush() exceeds 30 seconds. One audio chunk upload is also in the queue, consistent with the observed windowIndex=3 upload.

Resources: - main/app/lib/capture-session.ts — stopCapture(), cleanupSession(), flush() interaction - main/app/lib/recording-control-orchestrator.ts — stopRecordingFromControl(), TRANSITION_TIMEOUT_MS - main/app/lib/persistent-upload-queue.ts — flush(), DEFAULT_FLUSH_TIMEOUT_MS, writeManifest() - main/app/__tests__/recording-control-orchestrator.test.ts — existing timeout tests - main/app/__tests__/capture-session.test.ts — existing stop tests

Steps to cause failure

flowchart LR
    User([User presses Stop]) -->|stopRecordingFromControl| Orchestrator[Orchestrator]
    Orchestrator -->|Promise.race stopCapture vs 30s timeout| Race{Race}
    Race -->|flush takes 60s| Timeout([30s timeout fires])
    Race -->|stopCapture continues detached| StopCapture[stopCapture running]
    Timeout -->|rejects orchestrator promise| OrchestratorError([UI shows timed out])
    StopCapture -->|flush completes| EndCall1([POST /end with sessionId])
    OrchestratorError -->|user or retry stops again| StopCapture2[stopCapture called again]
    StopCapture2 -->|currentSessionId is null| EndCall2([POST /end with null])

System

flowchart TD
    Orchestrator[recording-control-orchestrator.ts] -->|stopCapture| CaptureSession[capture-session.ts]
    CaptureSession -->|flush| UploadQueue[PersistentUploadQueue]
    UploadQueue -->|uploadFn per frame| API[POST /frames]
    CaptureSession -->|after flush| EndAPI[POST /sessions/id/end]
    Orchestrator -->|TRANSITION_TIMEOUT_MS=30s| TimeoutRace[Promise.race]
    TimeoutRace -.->|fires before flush| DetachedRun[stopCapture runs detached]
    DetachedRun --> EndAPI

Reproduction Details

Reproduction: a test that verifies the orchestrator's timeout fires while stopCapture() is still running flush, causing /end to be called without a valid session context, and that the timeout message is shown.

Reproduction tests: - main/app/__tests__/capture-session.test.ts — "stopCapture: does not call /end twice when called concurrently after timeout" - main/app/__tests__/recording-control-orchestrator.test.ts — "stop transition timeout does not leave stopCapture running detached"

Notes for PR

Root cause 1 — Timeout shorter than flush timeout: TRANSITION_TIMEOUT_MS (30s) is shorter than DEFAULT_FLUSH_TIMEOUT_MS (60s). When flush() takes longer than 30s the orchestrator times out but stopCapture() continues running detached.

Fix: align the timeouts. Either raise TRANSITION_TIMEOUT_MS to exceed the flush timeout (e.g. 90s), or pass the orchestrator's remaining time budget into flush() so the queue respects the same deadline. The cleanest fix is to make stopCapture accept an AbortSignal or a deadline, so the orchestrator can cancel the inflight flush on timeout.

Root cause 2 — No abort/cancellation signal to stopCapture: Once Promise.race rejects, the orchestrator discards the stopCapture() promise. stopCapture() is not aware the caller gave up. It continues to run and mutate the shared module-level state (currentSessionId, uploadQueue, activeRecordingDevice).

Fix: accept an AbortSignal in stopCapture(). When the signal fires, the flush should stop adding new upload attempts and the session end should be skipped (the orchestrator can call /end itself after signaling abort).

Root cause 3 — Double /end from concurrent stop calls: After the timeout, the state machine reverts to "recording" (in the orchestrator). A second stop press calls stopCapture() again. At this point currentSessionId may already be null (cleared by the first detached stopCapture()) but it may also still be set if the first call hasn't completed yet, causing a second /end call.

Fix: guard stopCapture() against re-entrant calls with a module-level stopping flag. If a stop is already in progress, return the existing promise (similar to how processNext guards processing).

Root cause 4 (minor) — writeManifest TOCTOU race: writeManifest checks dir.exists, and if false calls dir.create(). Under concurrent parkItem or flush calls (which are serialized by manifestLock), this can still race with the filesystem if the lock isn't held during the exists+create sequence, or if the expo-file-system exists getter returns a stale value. This is cosmetic — the upload still proceeds from in-memory state — but surfaces noisy warnings.

Fix: wrap the create() in a try/catch that ignores "already exists" errors rather than relying on the exists guard.

Audit Log

ID	Action	Note	Context
1	Create audit log	Initialize bug investigation	Stop recording shows timeout + frames keep uploading
2	Read capture-session.ts	Traced `stopCapture()` flow: flush then `/end`	`lib/capture-session.ts`
3	Read recording-control-orchestrator.ts	Found 30s TRANSITION_TIMEOUT_MS races against stopCapture	`lib/recording-control-orchestrator.ts`
4	Read persistent-upload-queue.ts	DEFAULT_FLUSH_TIMEOUT_MS = 60s, writeManifest TOCTOU	`lib/persistent-upload-queue.ts`
5	Read existing tests	Verified existing test coverage for timeout, stop paths	`__tests__/capture-session.test.ts`, `__tests__/recording-control-orchestrator.test.ts`
6	Identify root causes	Three root causes identified from code trace	See Notes for PR section
7	Write failing reproduction test	`concurrent stopCapture calls POST /end exactly once` times out/fails before fix	`__tests__/capture-session.test.ts`
8	Confirm test fails without fix	FAIL (timeout) confirmed with `stopCapture` as plain async function	`capture-session.ts` revert confirmed
9	Apply fix — re-entrancy guard	Added `stopCapturePromise` module-level guard; extracted `doStopCapture`	`lib/capture-session.ts`
10	Apply fix — writeManifest TOCTOU	Inner try/catch on `dir.create()` for already-exists errors	`lib/persistent-upload-queue.ts`
11	Confirm tests pass	57/57 pass in `capture-session.test.ts` + `recording-control-orchestrator.test.ts`	All green
12	Causality check	Reverted fix → test fails; restored fix → test passes	Causality confirmed
13	Code review gate	Plan alignment verified (phone-only-recording.md), DRY/YAGNI pass, minor findings noted	See code review section

Verification

[x] Reproduced failure before fix
[x] Reproduction test fails before fix
[x] Root cause identified with evidence
[x] Fix applied at source (no workaround-only patch)
[x] Reproduction test passes after fix
[x] Reproduction path now passes
[x] Regression test added/updated
[x] Verified no duplicate solved-bug log exists for same root cause