Sessions Audio Upload — Presigned PUT Migration
Plan Metadata
- Plan type:
plan - Parent plan: N/A
- Depends on: #479 (boto3 client
S3_CONFIG/S3_UPLOAD_CONFIGin shared layer) — bridges audio path until this lands - Status:
draft
Status semantics: - draft: Plan is being created or updated and is not final. - approved: Plan is approved but not yet applied in code. - documentation: Code currently exists and matches the plan contract.
System Intent
- What is being built: Replace the synchronous body-bearing
POST /sessions/{id}/audio?windowIndex=Nhandler with a presigned-PUT URL grant + S3 ObjectCreated event-driven completion handler (delivered via EventBridge). Removes the 29s API-Gateway-cap failure class for large audio chunks. Splits one synchronous handler into two single-responsibility handlers connected by an S3 event source. - Primary consumer(s):
SessionAudioFunction(rewritten — URL minter only);AudioUploadCompleteFunction(new — S3-event-driven bookkeeping + ingest trigger);IngestWindowFunction(unchanged — receives async invoke from completion handler); React Native clientcapture-session.ts(rewritten audio uploadFn). - Boundary (black-box scope only): Frame upload path (
sessions/frames) is out of scope — body is always small enough to fit the 29s cap.IngestWindowFunctioninternals are out of scope — only its async-invoke entry contract matters here.
Motivation
sessions/audio is the only API-GW-fronted Lambda in the codebase that routes a multi-MB request body through Lambda memory. WAV audio chunks (~2 MB per 60s window for the chunked path; tens of MB for the legacy full-session path) routinely take longer than 8s to transit S3 on cell networks. API Gateway hard-caps requests at 29s and cannot be configured higher. Result: the slowest legitimate uploads silently 504, audio chunks are lost, and the user's session has gaps.
Goals:
- Eliminate the 29s API-Gateway-cap failure class for audio uploads at any realistic body size.
- Preserve atomicity semantics from the client queue's POV — each audio item still either fully uploads or parks for retry, no half-states.
- Preserve the exactly-once ingest trigger guarantee under concurrent frame+audio window completion.
- No orphan S3 objects — if it's in S3, it gets registered.
Non-goals:
- Migrating
sessions/frames— body is small (~100 KB JPEG), current path fits well within 29s cap, doubling round trips would hurt without gain. - Migrating
memories/videolarge multipart I/O — tracked separately in #494. - Repository-wide enforcement of
boto3.client(..., config=...)convention — tracked in #495. - Backward compatibility with the legacy
_handle_legacy_uploadfull-session path — there are no users today; the path is deleted entirely. - Property-based, real-AWS-integration, and load tests — deferred to #496, #497, #498 with explicit resolution triggers.
Architecture
┌──────────────────────┐
│ SessionAudioFunction │
┌──────────────┐ POST /sessions/{id} │ (existing, rewritten)│
│ RN client │ ──────────────────────▶│ │
│ │ {sizeBytes} │ Generates presigned │
│ │ ◀──────────────────────│ PUT URL │
│ │ {url, s3Key, ...} └──────────────────────┘
│ │
│ │ PUT <presigned URL> ┌──────────────────────┐
│ │ ──────────────────────▶│ S3 │
│ │ audio bytes │ sessions/{id}/ │
│ │ ◀─────────── 200 OK │ window_NNN/audio.wav │
└──────────────┘ └──────────┬───────────┘
│ ObjectCreated event
│ (via EventBridge)
▼
┌──────────────────────┐
│ AudioUploadComplete │
│ Function (NEW) │
│ │
│ - parse S3 key │
│ - DDB ADD │
│ completedAudio │
│ Windows │
│ - claim+trigger │
│ ingest │
└──────────┬───────────┘
│ invoke async (Event)
▼
┌──────────────────────┐
│ IngestWindowFunction │
│ (unchanged) │
└──────────────────────┘
Three Lambdas, one new. Two AWS surfaces: presigned URL grant (API GW) + S3 ObjectCreated event (S3 → EventBridge → Lambda). No new HTTP endpoints from the client's POV — same POST /sessions/{id}/audio?windowIndex=N URL, different response shape.
Endpoint contracts
POST /sessions/{id}/audio?windowIndex=N (rewritten SessionAudioFunction)
Request body (JSON):
Response 200:
{
"url": "https://encache-raw-memory.s3.amazonaws.com/sessions/abc/window_007/audio.wav?X-Amz-Algorithm=...",
"s3Key": "sessions/abc/window_007/audio.wav",
"windowIndex": 7,
"expiresIn": 300
}
Response 400:
sessionIdpath param missingwindowIndexmissing, negative, or non-integersizeBytesmissing, ≤ 0, or > 10 MB capsessionIdnot in DynamoDB → 404
Removed from this handler: legacy full-session branch (_handle_legacy_upload), body decoding, S3 PUT, DDB update, ingest trigger.
PUT <presigned URL> (client → S3, no Lambda)
Headers (all must match URL binding exactly):
Content-Type: audio/wavContent-Length: <sizeBytes>
Body: raw WAV bytes. URL expires after 300s.
AudioUploadCompleteFunction (new, S3-event triggered via EventBridge)
Event source: S3 ObjectCreated event on bucket encache-raw-memory, delivered through the default EventBridge bus with a SAM-managed EventBridgeRule filtering on detail.bucket.name = encache-raw-memory, detail.object.key prefix=sessions/, detail.object.key suffix=audio.wav. EventBridge delivers one event per invocation (no Records wrapper); the handler reads event["detail"]["object"]["key"].
Behavior:
- Parse key matching
sessions/{sessionId}/window_{NNN}/audio.wavregex; extractsessionIdandwindowIndex. - If parse fails: log
step: audio_upload_malformed_key, return success (do not fail invocation — failed = AWS retry of unparseable event forever). - DDB
update_itemADD completedAudioWindows :win, returnALL_NEW. - If session row missing: log
step: audio_upload_unknown_session, return success (session deleted mid-flight — orphan stays in S3 as cold storage). - Read
captureModeandcompletedFrameWindowsfrom returned attrs. - Compute should_trigger:
captureMode == "audio_only"ORwindowIndex in completedFrameWindows. - If
should_trigger: call_claim_and_trigger— DDB conditionalADD ingestTriggeredWindows, thenLambda Invoke(InvocationType="Event"). If the invoke fails, the claim is rolled back via DDBDELETE ingestTriggeredWindowsso the AWS async-invoke retry can re-claim and re-invoke (without rollback, the retry's conditional would silently reject and drop the window). - Log step transitions matching existing
flow: audio_postshape withstep: audio_upload_registered.
DLQ: dedicated SQS queue AudioUploadCompleteDLQ, 2 AWS async-invoke retries before DLQ. Manual replay path documented in runbook.
S3 key shape
sessions/{sessionId}/window_{N:03d}/audio.wav- Server-determined, never client-supplied. Prevents path traversal into other sessions.
Data flow
Happy path (one window)
- Client queue picks audio item from manifest.
POST /sessions/{id}/audio?windowIndex=7with{sizeBytes: 1923456}.SessionAudioFunctionvalidates session exists in DDB, generates URL.- Returns
{url, s3Key, windowIndex: 7, expiresIn: 300}. - Client PUTs raw bytes to
urlwithContent-Length: 1923456andContent-Type: audio/wav. - S3 responds 200 OK.
- Client marks item uploaded in dedup index, removes from queue.
- S3 fires ObjectCreated event to EventBridge (typically <1s, up to 30s). EventBridge routes to
AudioUploadCompleteFunctionvia the SAM-managedS3AudioCreatedrule. AudioUploadCompleteFunction:- Parses key →
(id, 7). - DDB ADD 7 to
completedAudioWindows. - Checks
captureMode+completedFrameWindows. - If ready:
_claim_and_triggerfiresIngestWindowFunctionasync.
Failure modes
| Where | What | Recovery |
|---|---|---|
| Step 2 (POST URL grant) | 5xx, network error | Client queue retries — existing PersistentUploadQueue logic, unchanged |
| Step 2 | 400 invalid sessionId/windowIndex/sizeBytes | Client gives up, item parked, surfaced via parkedCount metric |
| Step 3 (PUT to S3) | 403 SignatureDoesNotMatch | Means client tampered with headers OR clock skew OR Content-Length mismatch — park + retry from step 2 (fresh URL) |
| Step 3 | 5xx from S3 | Retry PUT with same URL until expiry; after expiry restart from step 2 |
| Step 3 | URL expired (>5 min) | Restart from step 2 — client requests fresh URL |
| Step 4 (S3 event) | S3 event fails to deliver to Lambda | AWS retries automatically; after 2 failures → DLQ |
Step 5 (AudioUploadComplete) | DDB throttle | Lambda exception → AWS async retry (2x) → DLQ |
| Step 5 | sessionId resolves to no DDB row (deleted mid-flight) | Log audio_upload_unknown_session, return success — object stays as cold storage |
| Step 5 | _claim_and_trigger ingest invoke fails | Claim rolled back via DDB DELETE ingestTriggeredWindows. Lambda exception → AWS async retry (now able to re-claim). After 2 retry failures → DLQ. |
| Step 5 | Claim rollback itself fails (DDB throttle during DELETE) | Log ingest_claim_rollback_failed. Original invoke failure still propagates. Manual cleanup: operator deletes window from ingestTriggeredWindows set to unblock retry. |
| Step 5 | Duplicate S3 event (same key fires twice) | DDB ADD is idempotent (set semantics); _claim_and_trigger conditional dedups ingest invoke |
Race: frame vs audio window completion
Unchanged from today. Both AudioUploadCompleteFunction (was SessionAudioFunction) and FramesFunction race to check completedFrameWindows ∩ completedAudioWindows and call _claim_and_trigger. The DDB conditional in _claim_and_trigger ensures exactly one wins. No new race introduced.
Orphan handling
S3 event is the source of truth — there are no orphans by definition. Every object in S3 fires an event; every event hits the Lambda; every Lambda either registers the window or DLQs. DLQ contents are operational signal, not data loss (manual replay).
Client queue contract
The queue's uploadFn currently makes one POST per audio item. After migration:
async function uploadAudio(item) {
const file = new File(toFileUri(item.uri));
const bytes = await file.bytes();
const sizeBytes = bytes.byteLength;
const { url } = await api.post(
`/sessions/${sessionId}/audio?windowIndex=${item.windowIndex}`,
JSON.stringify({ sizeBytes }),
{ headers: { "Content-Type": "application/json" }, timeout: 5000 },
);
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 60000);
try {
const resp = await fetch(url, {
method: "PUT",
headers: { "Content-Type": "audio/wav", "Content-Length": String(sizeBytes) },
body: bytes,
signal: controller.signal,
});
if (!resp.ok) throw new Error(`S3 PUT failed: ${resp.status}`);
} finally {
clearTimeout(timeoutId);
}
}
The 60s AbortController on the PUT prevents a stalled cell upload from blocking the upload queue indefinitely. Without it, the queue's per-item retry policy never fires because the fetch never resolves or rejects.
If either step throws, the queue parks the item and retries with full reset (fresh URL). uploadFn atomicity preserved.
Testing strategy (no test theater)
Discipline:
- Spec → test → code, in that order. Every test cites a Failure-Modes table row or Happy-Path step as its justification. Tests written before the corresponding production code exists.
- Test the contract, not the call. Don't assert
mock_s3.put_object.called_with(...)when the input to that mock is the same constant the production code uses — that's circular. Assert on observable state: object exists in S3 with expected key+size, DDB row contains expected attribute, ingest Lambda received expected payload. - Mock only at the AWS edge. Use
motofor S3+DDB+Lambda, not handcrafted mocks of boto3 clients. moto enforces real S3 semantics (key uniqueness, event payload shape, signature binding) so passing tests are stronger evidence. - One failure mode = one named test. Test name describes the behavior under test, not the function being tested.
- Red-green-refactor literally. Each test added to the suite runs red, then a minimal code change makes it green, then refactor if needed.
Test inventory
SessionAudioFunction (URL minter) — unit, moto for DDB session lookup:
test_returns_presigned_url_with_correct_s3_key_for_valid_requesttest_url_binds_content_type_audio_wavtest_url_binds_exact_content_length_from_size_bytestest_url_expires_in_300_secondstest_returns_400_when_session_id_path_param_missingtest_returns_400_when_window_index_missingtest_returns_400_when_window_index_negativetest_returns_400_when_window_index_non_integertest_returns_400_when_size_bytes_missingtest_returns_400_when_size_bytes_zero_or_negativetest_returns_400_when_size_bytes_exceeds_10mb_captest_returns_404_when_session_id_not_in_dynamodb
AudioUploadCompleteFunction (S3 event handler) — unit, moto for DDB+Lambda:
test_happy_path_audio_only_mode_triggers_ingest_immediatelytest_audio_video_mode_triggers_ingest_only_when_frames_readytest_audio_video_mode_does_not_trigger_when_frames_pendingtest_adds_window_index_to_completed_audio_windows_in_ddbtest_duplicate_s3_event_does_not_double_trigger_ingesttest_malformed_s3_key_logs_and_returns_successtest_unknown_session_id_logs_session_not_found_and_returns_successtest_ddb_throttle_raises_for_aws_retrytest_ingest_invoke_failure_raises_for_aws_retrytest_concurrent_audio_and_frame_completion_invoke_ingest_exactly_once
Integration (end-to-end with moto, single test file):
test_full_upload_flow_writes_audio_and_triggers_ingesttest_url_signature_rejects_wrong_content_lengthtest_url_signature_rejects_wrong_content_typetest_expired_url_returns_403_on_put
Client (capture-session.ts uploadFn) — Jest:
uploads_audio_when_both_url_grant_and_put_succeedparks_item_when_url_grant_returns_5xxparks_item_when_put_to_s3_returns_5xxparks_item_when_put_returns_403_signature_mismatchrefetches_url_after_5xx_retry_does_not_reuse_stale_urldoes_not_remove_item_from_queue_if_put_throws
Deferred test scope
Filed as separate issues with explicit resolution triggers:
-
496 — property-based tests (hypothesis + fast-check); trigger: 2 weeks green CI with real users, OR regression slip, OR contract refactor.
-
497 — real-AWS integration test (not moto); trigger: boto3 major bump, OR presigned URL param change, OR moto-vs-real divergence.
-
498 — load test for
AudioUploadCompleteFunction; trigger: ~4000 concurrent recording users projected, OR first DDB throttle alarm, OR first DLQ depth > 0, OR pre-launch event.
Done criterion for testing
- All tests above written and failing before any production code for them exists. Git history shows test commits preceding implementation commits.
- Final suite: 100% green, no skips, no xfails.
- Mutation check on critical paths: reverting the corresponding production logic causes the test to fail (guards against tests that pass for the wrong reason).
Deployment + infrastructure
CloudFormation / SAM changes (main/server/template.yaml)
| Resource | Action | Notes |
|---|---|---|
SessionAudioFunction (existing) | Modify — keep route, rewrite handler | Smaller code; can drop MemorySize if profile shows reduced footprint |
AudioUploadCompleteFunction | New | Runtime: python3.12, Timeout: 30, MemorySize: 256, layers: SharedLayer, dedicated DLQ |
AudioUploadCompleteDLQ | New | Type: AWS::SQS::Queue, redrive from Lambda (2 retries → DLQ) |
AudioUploadCompletePermission | New | AWS::Lambda::Permission granting s3.amazonaws.com invoke rights on AudioUploadCompleteFunction |
SessionAudioFunction.Policies | Modify | Drop DDB update_item on completedAudioWindows and lambda:InvokeFunction on IngestWindowFunction — no longer needed |
AudioUploadCompleteFunction.Policies | New | DatabaseSsmPolicy, DDB update on encache-sessions, lambda:InvokeFunction for IngestWindowFunction |
Note: S3 ObjectCreated notification on
encache-raw-memoryis wired in Terraform (main/devops/main.tf) — the bucket is Terraform-managed, not CloudFormation. SAM contributes onlyAudioUploadCompleteFunction+AudioUploadCompleteDLQ+AudioUploadCompletePermission(Lambda permission grant fors3.amazonaws.com).
Terraform changes (main/devops/main.tf)
| Resource | Action | Notes |
|---|---|---|
data "aws_lambda_function" "audio_upload_complete" | New | Looks up the SAM-deployed Lambda by deterministic name server-AudioUploadCompleteFunction |
aws_s3_bucket_notification.raw_data_audio | New | Wires s3:ObjectCreated:* on prefix sessions/, suffix audio.wav to the looked-up Lambda |
IAM principle
SessionAudioFunction loses write capability on S3, DDB, and Lambda invoke — strictly downgraded permissions. AudioUploadCompleteFunction gets only what was removed. Net IAM surface unchanged but cleanly separated by responsibility.
Deploy order
cd main/server && sam deploy— creates the Lambda + permission grant. Lambda name pinned viaFunctionName: !Sub "${AWS::StackName}-AudioUploadCompleteFunction"so the Terraform lookup is deterministic.cd main/devops && terraform apply— wires the notification. Reverse order fails: data lookup errors if the Lambda doesn't exist.
With zero users today, no canary or phased rollout is required.
Observability
New CloudWatch alarms:
AudioUploadCompleteFunctionErrors > 0over 5 min → Discord webhookAudioUploadCompleteDLQApproximateNumberOfMessagesVisible > 0→ Discord webhookSessionAudioFunctionErrors > 5/min→ Discord webhook (URL grant failures are user-facing)
Log grep patterns for runbook:
"step": "audio_url_granted"— successful URL mint"step": "audio_upload_registered"— S3 event handler success"step": "audio_upload_unknown_session"— session deleted mid-flight; rare"step": "audio_upload_malformed_key"— should never happen; investigate immediately"step": "ingest_claim_lost"— race between audio+frame triggers; harmless"step": "ingest_claim_rolled_back"— invoke failed; claim removed so AWS retry can re-claim. Expected to be rare; spike indicates ingest Lambda health issues."step": "ingest_claim_rollback_failed"— both invoke AND rollback failed. Investigate immediately; window is stuck (claim recorded but ingest never fired and retry will see the claim).
Cutover
No flag, no canary, no gradual rollout (no users). Ship via standard sam deploy.
Post-deploy verification:
- Run integration test suite against deployed stack (not just moto).
- Manually exercise capture flow on a test device → confirm audio chunk lands in S3 + DDB updates + ingest fires.
- Trigger artificial failure (bad URL signature) → confirm DLQ receives it.
Rollback
git revert + sam deploy. Old handler code restored, S3 notification deleted on the new function being absent. Caveat: any audio chunks PUT to S3 between rollback decision and rollback completion will not be registered (handler is gone). Acceptable risk given zero-user scenario.
Out of scope
sessions/framesmigration — body is small (~100 KB), current path fits 29s cap, doubling round trips would hurt.memories/videolarge multipart I/O — tracked in #494.- Repository-wide enforcement of
boto3.client(..., config=...)— tracked in #495. - Backward-compatible legacy
_handle_legacy_uploadpath — deleted entirely; no users today. - Property-based tests — #496.
- Real-AWS integration tests — #497.
- Load tests — #498.
Files affected
Server (Python):
main/server/api/sessions/audio/app.py— rewrite (URL minter only)main/server/events/audio_upload_complete/app.py— new Lambda handler. New top-levelevents/directory introduced for S3/SNS/EventBridge-triggered handlers (no existing event-driven Lambda in repo, so no precedent — clean separation fromapi/for HTTP andworldmm/for the ingest pipeline).main/server/template.yaml— new function + DLQ + S3 notification + IAM changesmain/server/tests/unit/test_session_audio_url_minter.py— new unit tests per Test Inventory (follows repo convention of centralizedtests/unit/, not per-handler__tests__/)main/server/tests/unit/test_audio_upload_complete.py— new unit tests per Test Inventorymain/server/tests/integration/test_audio_presigned_flow.py— new end-to-end tests
Client (TypeScript):
main/app/lib/capture-session.ts— rewrite audio branch of uploadFn (lines 103-112)main/app/lib/__tests__/capture-session.test.ts— new tests per Test Inventory
Docs:
docs/plans/2026-05-19-sessions-audio-presigned.md— this filedocs/docs/— update affected system docs once code lands (separate pass)