GPU Retry & Push Notification

System Intent

When the async worker Lambda hits GPU_MAX_WAIT_MS without a healthy GPU, it currently saves a static failure string as the assistant message (ready=true). The user sees a dead end — the GPU may come up seconds later with no one watching.

Solution: On timeout, keep the placeholder ready=false and enqueue to a new ChatGpuRetryQueue (SQS). A new ChatGpuRetryFunction retries every 2 minutes (up to 30 min total). When the GPU is healthy it runs the full reasoning pipeline, saves the answer (ready=true), and fires an Expo push notification to the user's device. The user can close the app; the retry is entirely backend-driven.

Frontend change: the GpuUnavailableError catch block shows "Starting up the GPU, we will notify you when we have an answer" instead of the current offline message. The existing polling loop already handles the eventual ready=true update with no further changes.

Push token: the app registers an Expo push token on startup and calls PUT /users/push-token. The Sessions table stores push_token keyed by user_id. The retry Lambda reads it when the answer is ready.

Mermaid Diagram

graph TD
    App["Chat Screen\nmain/app/app/chat.tsx"]:::updated -->|"POST /memories/chat"| APIGW["API Gateway"]:::unchanged
    APIGW --> Dispatcher["Dispatcher Lambda\n(unchanged)"]:::unchanged
    Dispatcher -->|"save user msg + placeholder\nready=false\ninvoke worker async"| Worker["Async Worker Lambda\napi/memories/chat/app.py"]:::updated
    Dispatcher -->|"chat_id message_id status=thinking"| App
    App -->|"show: Starting up the GPU…\n(GpuUnavailableError path)"| App
    App -->|"poll GET /chats/messages\nevery 500ms"| APIGW

    Worker -->|"wait_for_gpu_health\nGPU_MAX_WAIT_MS"| GPU["GPU Worker EC2"]:::unchanged
    GPU -->|"healthy"| Worker
    Worker -->|"GPU healthy → save answer ready=true"| DDB["DynamoDB Messages"]:::unchanged
    Worker -->|"GPU timeout → mark_gpu_pending\nenqueue retry_count=0 delay=120s"| SQS["ChatGpuRetryQueue\n(SQS)"]:::created

    SQS -->|"trigger"| Retry["ChatGpuRetryFunction\napi/memories/chat/retry.py"]:::created
    Retry -->|"wait_for_gpu_health\nGPU_RETRY_WAIT_MS=240s"| GPU <!-- pragma: allowlist secret -->
    GPU -->|"healthy"| Retry
    Retry -->|"save answer ready=true"| DDB
    Retry -->|"read push_token"| Sessions["Sessions DynamoDB"]:::updated
    Retry -->|"POST /push/send"| Expo["Expo Push API"]:::created
    Expo -->|"push notification"| App
    Retry -->|"GPU still down + retries < MAX\nre-enqueue delay=120s"| SQS
    Retry -->|"max retries exceeded\nsave failure msg ready=true\nsend failure notification"| DDB

    DDB -->|"ready=true"| App

    classDef unchanged fill:#d3d3d3,stroke:#888
    classDef updated fill:#ffe58a,stroke:#888
    classDef created fill:#a8e6a3,stroke:#888

Black-Box Input/Output Contracts

Flow: worker-timeout — async worker GPU timeout path

Inputs (Lambda event): _async_worker=true, question, chat_id, message_id, _user_id

Behavior (GPU timeout branch — replaces current failure save): 1. GPU health check exhausts GPU_MAX_WAIT_MS with no healthy response. 2. Call repo.mark_gpu_pending(chat_id, message_id) — sets gpu_pending=True, keeps ready=False, content="". 3. Enqueue to ChatGpuRetryQueue: {chat_id, message_id, user_id, question, retry_count: 0}, DelaySeconds=120. 4. Return (Lambda exits; SQS drives the rest).

Behavior (GPU healthy — unchanged): answer saved, ready=True, returns normally.

Side effects: - gpu_pending=True attribute written to Messages table on timeout - SQS message enqueued with 2-minute initial delay

Flow: gpu-retry — ChatGpuRetryFunction

Inputs (SQS record body): chat_id, message_id, user_id, question, retry_count

Outputs — GPU becomes healthy: - repo.update_message_ready(chat_id, message_id, content=answer) — clears gpu_pending, sets ready=True - Expo push notification sent: title "Your answer is ready", body = first 80 chars of question

Outputs — GPU still unavailable, retry_count < GPU_RETRY_MAX_ATTEMPTS (default 15): - Re-enqueue {...retry_count+1} with DelaySeconds=120 - Return normally (SQS message deleted; retry is the new message)

Outputs — max retries exceeded: - repo.update_message_ready(chat_id, message_id, content="Sorry, the GPU took too long to start. Please try again.") - Expo push notification sent: title "Couldn't answer", body = "The GPU took too long to start. Please open the app and retry."

Flow: push-token — PUT /users/push-token

Inputs: push_token (string, required)

Outputs:

{ "success": true }

HTTP 200.

Validation: token must be non-empty string. No format enforcement (supports Expo, APNs, FCM).

Side effects: - Sessions table item for user_id updated: SET push_token = :t

Acceptance Criteria

Test 1: worker-timeout-enqueues-retry

Given: Worker Lambda, no GPU available, GPU_MAX_WAIT_MS exhausted, message_id set Then: mark_gpu_pending called on placeholder; SQS send_message called with retry_count=0, DelaySeconds=120; worker returns without saving failure string

Test 2: retry-gpu-healthy-saves-answer-and-notifies

Given: ChatGpuRetryFunction receives SQS message, GPU becomes healthy within GPU_RETRY_WAIT_MS Then: handle_chat called; update_message_ready called with answer content; Expo Push API called with user's push token

Test 3: retry-gpu-unhealthy-reenqueues

Given: ChatGpuRetryFunction, GPU still unavailable, retry_count=3 (< MAX) Then: SQS send_message called with retry_count=4, DelaySeconds=120; no answer saved; no push sent

Test 4: retry-max-attempts-saves-failure-and-notifies

Given: ChatGpuRetryFunction, GPU still unavailable, retry_count=15 (= MAX) Then: update_message_ready called with failure string; Expo Push API called with failure notification; SQS not re-enqueued

Test 5: retry-no-push-token-skips-notification

Given: ChatGpuRetryFunction, GPU healthy, Sessions table has no push_token for user Then: Answer saved normally; Expo Push API NOT called; no error raised

Test 6: push-token-stored

Given: PUT /users/push-token with {push_token: "ExponentPushToken[xxx]"}, authenticated user Then: HTTP 200 {success: true}; Sessions table item has push_token set

Test 7: frontend-shows-starting-up-message

Given: chatWithMemory throws GpuUnavailableError Then: Chat screen appends assistant bubble with "Starting up the GPU, we will notify you when we have an answer"

Implementation Checklist

Files checklist

[ ] main/server/layers/shared/python/shared/chat/repository.py — add mark_gpu_pending(chat_id, message_id)
[ ] main/server/api/memories/chat/app.py — replace GPU-timeout failure path: call mark_gpu_pending + enqueue to SQS instead of saving error string
[ ] main/server/api/memories/chat/retry.py — NEW: retry_handler(event, context) — resolve GPU, run chat, save answer, send push notification, re-enqueue or fail
[ ] main/server/api/users/push_token/__init__.py — NEW: empty init
[ ] main/server/api/users/push_token/app.py — NEW: PUT /users/push-token handler, updates Sessions table
[ ] main/server/template.yaml — add ChatGpuRetryQueue, ChatGpuRetryFunction, UsersPushTokenFunction; add SQS send permission + CHAT_GPU_RETRY_QUEUE_URL env var to MemoriesChatFunction
[ ] main/app/app/chat.tsx — change GpuUnavailableError message to "Starting up the GPU, we will notify you when we have an answer"
[ ] main/app/lib/push-notifications.ts — NEW: registerForPushNotifications() using expo-notifications
[ ] main/app/lib/api/users/pushToken.ts — NEW: savePushToken(token) API client
[ ] main/app/app/_layout.tsx — call registerForPushNotifications() + savePushToken() after auth

Flows checklist

[ ] worker-timeout: mark_gpu_pending + enqueue on GPU timeout (replaces error-string save)
[ ] gpu-retry: ChatGpuRetryFunction retries GPU, saves answer, sends push notification
[ ] push-token: PUT /users/push-token stores token in Sessions table
[ ] frontend-error-copy: GpuUnavailableError shows "Starting up the GPU…" message
[ ] push-registration: app registers Expo token on startup and sends to backend

Notes

mark_gpu_pending only called when message_id is set (async dispatcher path). Legacy synchronous path (no message_id) keeps existing save-failure-string behavior.
ChatGpuRetryQueue visibility timeout = 300s (longer than GPU_RETRY_WAIT_MS=240s) to prevent double-processing.
Expo Push API is called with a simple requests.post — no SDK needed. Token validation is Expo's responsibility.
GPU_RETRY_MAX_ATTEMPTS defaults to 15 (≈ 30 minutes total at 2-min intervals).
Polling in the frontend is unchanged — it will pick up ready=true whether it arrives from the worker or the retry Lambda.