Stop Recording Timeout: Frames Upload After Stop, Double /end Call
Metadata
- Date:
2026-05-04 - Status:
fixed - Severity:
high - Related issue/ticket:
N/A - Owner:
N/A
About
Overview: Three related defects manifest when the user presses the stop button during a glasses-mode recording session with a large frame backlog:
-
Frames keep uploading after stop is pressed — the
flush()call instopCapture()uploads every queued frame before returning. With 30-second default sweep interval and a 60-secondflushTimeoutMs, many frames can still be in-flight for up to 60 seconds after the stop button is pressed. The user sees the upload activity continuing long after stop. -
"Recording transition timed out" shown in UI — the orchestrator races
stopCapture()against a 30-second timeout (TRANSITION_TIMEOUT_MS = 30_000).flush()defaults to 60 seconds (DEFAULT_FLUSH_TIMEOUT_MS = 60_000). The timeout fires beforeflush()returns, causingPromise.raceto reject with "Recording transition timed out" whilestopCapture()continues running detached in the background. -
/sessions/{id}/endcalled twice — the first call comes from the detachedstopCapture()that continues running after the orchestrator's race times out. It eventually completes itsflush(), seescurrentSessionIdis still set (module-level variable), and calls/endwith the real session ID (logged as"session_ended"with sessionId). Meanwhile the timeout causesstopRecordingFromControl()to reject and reset state to"recording"(withlastError). The user, or the orchestrator's retry logic, can then invokestopRecordingFromControl()again — which callsstopCapture()again. At that point,currentSessionIdis alreadynull(cleared by the firststopCapture()run), sosession_endedis logged withnullas the session ID, and the second/endPOST uses a null URL segment — but the log shows it as the second occurrence.
Secondary symptom: - WARN [PersistentUploadQueue] writeManifest failed: "Unable to create file or directory: directory already exists or could not be created" — this is a TOCTOU race in writeManifest(): the if (!dir.exists) check passes, then a concurrent call creates the directory, and dir.create() throws. This is a pre-existing non-blocking issue surfaced under concurrent upload pressure.
Technical Questions: - Are we making assumptions? No — tracing the module-level state in capture-session.ts and the Promise.race in recording-control-orchestrator.ts confirms all three root causes. - How old is this bug? The 30s timeout vs. 60s flush mismatch is structural and has existed since the TRANSITION_TIMEOUT_MS and DEFAULT_FLUSH_TIMEOUT_MS constants were set independently. - Is there anything obvious we might have missed? Yes — after the race times out, the orchestrator has no mechanism to signal stopCapture() to abort. stopCapture() runs to completion against stale module state, including clearing currentSessionId after calling /end. If the user retries stop, the second call hits null state. - Are there specific system states required to reproduce it? Yes — a glasses-mode session with enough queued frames that flush() exceeds 30 seconds. One audio chunk upload is also in the queue, consistent with the observed windowIndex=3 upload.
Resources: - main/app/lib/capture-session.ts — stopCapture(), cleanupSession(), flush() interaction - main/app/lib/recording-control-orchestrator.ts — stopRecordingFromControl(), TRANSITION_TIMEOUT_MS - main/app/lib/persistent-upload-queue.ts — flush(), DEFAULT_FLUSH_TIMEOUT_MS, writeManifest() - main/app/__tests__/recording-control-orchestrator.test.ts — existing timeout tests - main/app/__tests__/capture-session.test.ts — existing stop tests
Steps to cause failure
flowchart LR
User([User presses Stop]) -->|stopRecordingFromControl| Orchestrator[Orchestrator]
Orchestrator -->|Promise.race stopCapture vs 30s timeout| Race{Race}
Race -->|flush takes 60s| Timeout([30s timeout fires])
Race -->|stopCapture continues detached| StopCapture[stopCapture running]
Timeout -->|rejects orchestrator promise| OrchestratorError([UI shows timed out])
StopCapture -->|flush completes| EndCall1([POST /end with sessionId])
OrchestratorError -->|user or retry stops again| StopCapture2[stopCapture called again]
StopCapture2 -->|currentSessionId is null| EndCall2([POST /end with null]) System
flowchart TD
Orchestrator[recording-control-orchestrator.ts] -->|stopCapture| CaptureSession[capture-session.ts]
CaptureSession -->|flush| UploadQueue[PersistentUploadQueue]
UploadQueue -->|uploadFn per frame| API[POST /frames]
CaptureSession -->|after flush| EndAPI[POST /sessions/id/end]
Orchestrator -->|TRANSITION_TIMEOUT_MS=30s| TimeoutRace[Promise.race]
TimeoutRace -.->|fires before flush| DetachedRun[stopCapture runs detached]
DetachedRun --> EndAPI Reproduction Details
Reproduction: a test that verifies the orchestrator's timeout fires while stopCapture() is still running flush, causing /end to be called without a valid session context, and that the timeout message is shown.
Reproduction tests: - main/app/__tests__/capture-session.test.ts — "stopCapture: does not call /end twice when called concurrently after timeout" - main/app/__tests__/recording-control-orchestrator.test.ts — "stop transition timeout does not leave stopCapture running detached"
Notes for PR
Root cause 1 — Timeout shorter than flush timeout: TRANSITION_TIMEOUT_MS (30s) is shorter than DEFAULT_FLUSH_TIMEOUT_MS (60s). When flush() takes longer than 30s the orchestrator times out but stopCapture() continues running detached.
Fix: align the timeouts. Either raise TRANSITION_TIMEOUT_MS to exceed the flush timeout (e.g. 90s), or pass the orchestrator's remaining time budget into flush() so the queue respects the same deadline. The cleanest fix is to make stopCapture accept an AbortSignal or a deadline, so the orchestrator can cancel the inflight flush on timeout.
Root cause 2 — No abort/cancellation signal to stopCapture: Once Promise.race rejects, the orchestrator discards the stopCapture() promise. stopCapture() is not aware the caller gave up. It continues to run and mutate the shared module-level state (currentSessionId, uploadQueue, activeRecordingDevice).
Fix: accept an AbortSignal in stopCapture(). When the signal fires, the flush should stop adding new upload attempts and the session end should be skipped (the orchestrator can call /end itself after signaling abort).
Root cause 3 — Double /end from concurrent stop calls: After the timeout, the state machine reverts to "recording" (in the orchestrator). A second stop press calls stopCapture() again. At this point currentSessionId may already be null (cleared by the first detached stopCapture()) but it may also still be set if the first call hasn't completed yet, causing a second /end call.
Fix: guard stopCapture() against re-entrant calls with a module-level stopping flag. If a stop is already in progress, return the existing promise (similar to how processNext guards processing).
Root cause 4 (minor) — writeManifest TOCTOU race: writeManifest checks dir.exists, and if false calls dir.create(). Under concurrent parkItem or flush calls (which are serialized by manifestLock), this can still race with the filesystem if the lock isn't held during the exists+create sequence, or if the expo-file-system exists getter returns a stale value. This is cosmetic — the upload still proceeds from in-memory state — but surfaces noisy warnings.
Fix: wrap the create() in a try/catch that ignores "already exists" errors rather than relying on the exists guard.
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize bug investigation | Stop recording shows timeout + frames keep uploading |
| 2 | Read capture-session.ts | Traced stopCapture() flow: flush then /end | lib/capture-session.ts |
| 3 | Read recording-control-orchestrator.ts | Found 30s TRANSITION_TIMEOUT_MS races against stopCapture | lib/recording-control-orchestrator.ts |
| 4 | Read persistent-upload-queue.ts | DEFAULT_FLUSH_TIMEOUT_MS = 60s, writeManifest TOCTOU | lib/persistent-upload-queue.ts |
| 5 | Read existing tests | Verified existing test coverage for timeout, stop paths | __tests__/capture-session.test.ts, __tests__/recording-control-orchestrator.test.ts |
| 6 | Identify root causes | Three root causes identified from code trace | See Notes for PR section |
| 7 | Write failing reproduction test | concurrent stopCapture calls POST /end exactly once times out/fails before fix | __tests__/capture-session.test.ts |
| 8 | Confirm test fails without fix | FAIL (timeout) confirmed with stopCapture as plain async function | capture-session.ts revert confirmed |
| 9 | Apply fix — re-entrancy guard | Added stopCapturePromise module-level guard; extracted doStopCapture | lib/capture-session.ts |
| 10 | Apply fix — writeManifest TOCTOU | Inner try/catch on dir.create() for already-exists errors | lib/persistent-upload-queue.ts |
| 11 | Confirm tests pass | 57/57 pass in capture-session.test.ts + recording-control-orchestrator.test.ts | All green |
| 12 | Causality check | Reverted fix → test fails; restored fix → test passes | Causality confirmed |
| 13 | Code review gate | Plan alignment verified (phone-only-recording.md), DRY/YAGNI pass, minor findings noted | See code review section |
Verification
- [x] Reproduced failure before fix
- [x] Reproduction test fails before fix
- [x] Root cause identified with evidence
- [x] Fix applied at source (no workaround-only patch)
- [x] Reproduction test passes after fix
- [x] Reproduction path now passes
- [x] Regression test added/updated
- [x] Verified no duplicate solved-bug log exists for same root cause