GPU Retry & Push Notification
System Intent
When the async worker Lambda hits GPU_MAX_WAIT_MS without a healthy GPU, it currently saves a static failure string as the assistant message (ready=true). The user sees a dead end — the GPU may come up seconds later with no one watching.
Solution: On timeout, keep the placeholder ready=false and enqueue to a new ChatGpuRetryQueue (SQS). A new ChatGpuRetryFunction retries every 2 minutes (up to 30 min total). When the GPU is healthy it runs the full reasoning pipeline, saves the answer (ready=true), and fires an Expo push notification to the user's device. The user can close the app; the retry is entirely backend-driven.
Frontend change: the GpuUnavailableError catch block shows "Starting up the GPU, we will notify you when we have an answer" instead of the current offline message. The existing polling loop already handles the eventual ready=true update with no further changes.
Push token: the app registers an Expo push token on startup and calls PUT /users/push-token. The Sessions table stores push_token keyed by user_id. The retry Lambda reads it when the answer is ready.
Mermaid Diagram
graph TD
App["Chat Screen\nmain/app/app/chat.tsx"]:::updated -->|"POST /memories/chat"| APIGW["API Gateway"]:::unchanged
APIGW --> Dispatcher["Dispatcher Lambda\n(unchanged)"]:::unchanged
Dispatcher -->|"save user msg + placeholder\nready=false\ninvoke worker async"| Worker["Async Worker Lambda\napi/memories/chat/app.py"]:::updated
Dispatcher -->|"chat_id message_id status=thinking"| App
App -->|"show: Starting up the GPU…\n(GpuUnavailableError path)"| App
App -->|"poll GET /chats/messages\nevery 500ms"| APIGW
Worker -->|"wait_for_gpu_health\nGPU_MAX_WAIT_MS"| GPU["GPU Worker EC2"]:::unchanged
GPU -->|"healthy"| Worker
Worker -->|"GPU healthy → save answer ready=true"| DDB["DynamoDB Messages"]:::unchanged
Worker -->|"GPU timeout → mark_gpu_pending\nenqueue retry_count=0 delay=120s"| SQS["ChatGpuRetryQueue\n(SQS)"]:::created
SQS -->|"trigger"| Retry["ChatGpuRetryFunction\napi/memories/chat/retry.py"]:::created
Retry -->|"wait_for_gpu_health\nGPU_RETRY_WAIT_MS=240s"| GPU <!-- pragma: allowlist secret -->
GPU -->|"healthy"| Retry
Retry -->|"save answer ready=true"| DDB
Retry -->|"read push_token"| Sessions["Sessions DynamoDB"]:::updated
Retry -->|"POST /push/send"| Expo["Expo Push API"]:::created
Expo -->|"push notification"| App
Retry -->|"GPU still down + retries < MAX\nre-enqueue delay=120s"| SQS
Retry -->|"max retries exceeded\nsave failure msg ready=true\nsend failure notification"| DDB
DDB -->|"ready=true"| App
classDef unchanged fill:#d3d3d3,stroke:#888
classDef updated fill:#ffe58a,stroke:#888
classDef created fill:#a8e6a3,stroke:#888 Black-Box Input/Output Contracts
Flow: worker-timeout — async worker GPU timeout path
Inputs (Lambda event): _async_worker=true, question, chat_id, message_id, _user_id
Behavior (GPU timeout branch — replaces current failure save): 1. GPU health check exhausts GPU_MAX_WAIT_MS with no healthy response. 2. Call repo.mark_gpu_pending(chat_id, message_id) — sets gpu_pending=True, keeps ready=False, content="". 3. Enqueue to ChatGpuRetryQueue: {chat_id, message_id, user_id, question, retry_count: 0}, DelaySeconds=120. 4. Return (Lambda exits; SQS drives the rest).
Behavior (GPU healthy — unchanged): answer saved, ready=True, returns normally.
Side effects: - gpu_pending=True attribute written to Messages table on timeout - SQS message enqueued with 2-minute initial delay
Flow: gpu-retry — ChatGpuRetryFunction
Inputs (SQS record body): chat_id, message_id, user_id, question, retry_count
Outputs — GPU becomes healthy: - repo.update_message_ready(chat_id, message_id, content=answer) — clears gpu_pending, sets ready=True - Expo push notification sent: title "Your answer is ready", body = first 80 chars of question
Outputs — GPU still unavailable, retry_count < GPU_RETRY_MAX_ATTEMPTS (default 15): - Re-enqueue {...retry_count+1} with DelaySeconds=120 - Return normally (SQS message deleted; retry is the new message)
Outputs — max retries exceeded: - repo.update_message_ready(chat_id, message_id, content="Sorry, the GPU took too long to start. Please try again.") - Expo push notification sent: title "Couldn't answer", body = "The GPU took too long to start. Please open the app and retry."
Flow: push-token — PUT /users/push-token
Inputs: push_token (string, required)
Outputs:
HTTP 200.Validation: token must be non-empty string. No format enforcement (supports Expo, APNs, FCM).
Side effects: - Sessions table item for user_id updated: SET push_token = :t
Acceptance Criteria
Test 1: worker-timeout-enqueues-retry
Given: Worker Lambda, no GPU available, GPU_MAX_WAIT_MS exhausted, message_id set Then: mark_gpu_pending called on placeholder; SQS send_message called with retry_count=0, DelaySeconds=120; worker returns without saving failure string
Test 2: retry-gpu-healthy-saves-answer-and-notifies
Given: ChatGpuRetryFunction receives SQS message, GPU becomes healthy within GPU_RETRY_WAIT_MS Then: handle_chat called; update_message_ready called with answer content; Expo Push API called with user's push token
Test 3: retry-gpu-unhealthy-reenqueues
Given: ChatGpuRetryFunction, GPU still unavailable, retry_count=3 (< MAX) Then: SQS send_message called with retry_count=4, DelaySeconds=120; no answer saved; no push sent
Test 4: retry-max-attempts-saves-failure-and-notifies
Given: ChatGpuRetryFunction, GPU still unavailable, retry_count=15 (= MAX) Then: update_message_ready called with failure string; Expo Push API called with failure notification; SQS not re-enqueued
Test 5: retry-no-push-token-skips-notification
Given: ChatGpuRetryFunction, GPU healthy, Sessions table has no push_token for user Then: Answer saved normally; Expo Push API NOT called; no error raised
Test 6: push-token-stored
Given: PUT /users/push-token with {push_token: "ExponentPushToken[xxx]"}, authenticated user Then: HTTP 200 {success: true}; Sessions table item has push_token set
Test 7: frontend-shows-starting-up-message
Given: chatWithMemory throws GpuUnavailableError Then: Chat screen appends assistant bubble with "Starting up the GPU, we will notify you when we have an answer"
Implementation Checklist
Files checklist
- [ ]
main/server/layers/shared/python/shared/chat/repository.py— addmark_gpu_pending(chat_id, message_id) - [ ]
main/server/api/memories/chat/app.py— replace GPU-timeout failure path: callmark_gpu_pending+ enqueue to SQS instead of saving error string - [ ]
main/server/api/memories/chat/retry.py— NEW:retry_handler(event, context)— resolve GPU, run chat, save answer, send push notification, re-enqueue or fail - [ ]
main/server/api/users/push_token/__init__.py— NEW: empty init - [ ]
main/server/api/users/push_token/app.py— NEW:PUT /users/push-tokenhandler, updates Sessions table - [ ]
main/server/template.yaml— addChatGpuRetryQueue,ChatGpuRetryFunction,UsersPushTokenFunction; add SQS send permission +CHAT_GPU_RETRY_QUEUE_URLenv var toMemoriesChatFunction - [ ]
main/app/app/chat.tsx— changeGpuUnavailableErrormessage to "Starting up the GPU, we will notify you when we have an answer" - [ ]
main/app/lib/push-notifications.ts— NEW:registerForPushNotifications()usingexpo-notifications - [ ]
main/app/lib/api/users/pushToken.ts— NEW:savePushToken(token)API client - [ ]
main/app/app/_layout.tsx— callregisterForPushNotifications()+savePushToken()after auth
Flows checklist
- [ ] worker-timeout: mark_gpu_pending + enqueue on GPU timeout (replaces error-string save)
- [ ] gpu-retry: ChatGpuRetryFunction retries GPU, saves answer, sends push notification
- [ ] push-token: PUT /users/push-token stores token in Sessions table
- [ ] frontend-error-copy: GpuUnavailableError shows "Starting up the GPU…" message
- [ ] push-registration: app registers Expo token on startup and sends to backend
Notes
mark_gpu_pendingonly called whenmessage_idis set (async dispatcher path). Legacy synchronous path (nomessage_id) keeps existing save-failure-string behavior.ChatGpuRetryQueuevisibility timeout = 300s (longer thanGPU_RETRY_WAIT_MS=240s) to prevent double-processing.- Expo Push API is called with a simple
requests.post— no SDK needed. Token validation is Expo's responsibility. GPU_RETRY_MAX_ATTEMPTSdefaults to 15 (≈ 30 minutes total at 2-min intervals).- Polling in the frontend is unchanged — it will pick up
ready=truewhether it arrives from the worker or the retry Lambda.