Chat "Sorry I couldn't process that" — GPU_MAX_WAIT_MS Stripped + Slow Inference 504s
Metadata
- Date:
2026-05-15 - Status:
in-progress - Severity:
high - Related issue/ticket: N/A (follow-on from
2026-05-14-chat-504-api-gateway-29s-hard-timeout.md) - Owner: N/A
About
Overview: The user reports seeing "Sorry, I couldn't process that. Please try again." on chat even though the GPU is believed to be running. Previous debugging established two root causes (see 2026-05-14-chat-504-api-gateway-29s-hard-timeout.md). This session investigates what is still failing after the GPU_MAX_WAIT_MS=15000 fix was applied.
Root Causes Found:
Root Cause 1 (Immediate): GPU_MAX_WAIT_MS was stripped from the deployed Lambda
The env var GPU_MAX_WAIT_MS=15000 that was set in a previous deploy is no longer present in the live Lambda function configuration. The most recent sam deploy did not include GPU_MAX_WAIT_MS in template.yaml's global or function-level environment variables, so the deploy overwrote the Lambda env back to the default (120000ms). With GPU_MAX_WAIT_MS=120000 and a Lambda timeout of 60s, health-poll requests that need to wait for GPU cold-start will be killed by Lambda timeout at 60s → API Gateway returns 504.
Evidence: aws lambda get-function-configuration shows no GPU_MAX_WAIT_MS key in deployed Variables.
Root Cause 2 (Ongoing): Slow GPU inference exceeds API Gateway 29s hard timeout
Even with a warm GPU (health check passes immediately in 5-150ms), ReasoningAgent.answer() takes 40-44 seconds for complex queries. The API Gateway REST API integration timeout is hard-capped at 29 seconds. These requests always return 504 to the client. The Lambda finishes the work (logs show completed duration_ms=44018) but API Gateway has already closed the connection.
Evidence in CloudWatch logs: - memories_chat_app completed duration_ms=44018 — Lambda completed successfully - But API Gateway 29s timeout fired before Lambda responded, so client saw 504
This is the same Cause 2 identified on 2026-05-14. It was not fixed — only Cause 1 (GPU health poll overshoot) was fixed.
Resources: - /home/lewibs/github/encache1/encache1/main/server/template.yaml — GPU_MAX_WAIT_MS absent from MemoriesChatFunction env - /home/lewibs/github/encache1/encache1/main/server/api/memories/chat/app.py — inference in handle_chat(), no timeout wrapper - /home/lewibs/github/encache1/encache1/main/app/app/chat.tsx:99 — "Sorry, I couldn't process that" on catch - /home/lewibs/github/encache1/encache1/main/app/lib/api/memory/chatWithMemory.ts — throws Error("No answer received") if no answer; client timeout 60s but API GW fires first
Failure Timeline (from CloudWatch)
| Time (UTC) | Event | Duration | Outcome |
|---|---|---|---|
| 03:54:00 | GPU stopped → start_instances → no URL | 2s | Fallback "GPU starting" message |
| 03:54:46 | GPU still starting → health poll → exhausted max_wait=20000, elapsed=30261 | 30s | Fallback (overshoot bug) |
| 03:57:36 | GPU healthy (elapsed=152ms, retries=0) → inference | 3s | Success |
| 05:13:44 | GPU stopped → start_instances → no URL | 2s | Fallback |
| 05:16:08 | GPU starting → health poll → exhausted max_wait=15000, elapsed=30255 | 30s | Fallback (overshoot bug still in old deploy) |
| 05:18:39 | GPU healthy → inference | 43s | Lambda completed OK but API GW returned 504 |
| 05:25:23 | GPU healthy → inference | 3s | Success (fast query) |
| 05:25:36 | GPU healthy → inference | 41s | Lambda completed OK but API GW returned 504 |
| 05:26:17 | GPU healthy → inference | 44s | Lambda completed OK but API GW returned 504 |
The 05:16 overshoot event (elapsed_ms=30255 with max_wait_ms=15000) is from before the backoff-cap fix was deployed. The fix IS in the code now (sleep_ms = min(retry_delay_ms, remaining_ms)), but the env var was stripped by the subsequent deploy.
Failing Path: Slow Inference 504
Mobile app → POST /memories/chat → Cloudflare → API Gateway (29s limit)
→ Lambda invokes → GPU health OK (fast) → ReasoningAgent.answer() (40-44s)
→ API Gateway fires 504 at 29s → mobile app catch block → "Sorry, I couldn't process that"
→ Lambda continues running in background → logs "completed duration_ms=44018"
Fix Applied (This Session)
Fix 1: Restore GPU_MAX_WAIT_MS=15000 in template.yaml
Added GPU_MAX_WAIT_MS: "15000" to the MemoriesChatFunction environment variables in template.yaml. This prevents the GPU health-poll path from timing out the Lambda when GPU is cold-starting.
Fix 2 (not applied — architectural): Slow inference 504
The slow inference path cannot be fixed with a timeout change — 29s is the API Gateway hard limit and inference takes 40-44s. The correct fix is async chat (Lambda returns immediately, background worker generates answer, client polls). This requires a larger architectural change and is tracked separately.
Short-term workaround considered but rejected: Adding a requests.post(timeout=25) in the GPU client for inference would cause the Lambda to throw after 25s, returning "Something went wrong — please try again." instead of a 504. However the user still gets an error on long queries, just a more explicit one. Decided not to add this noise — the async architecture is the right fix.
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Pull CloudWatch logs | Last 30-60 min from /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ | AWS CLI |
| 2 | Identify failure pattern | GPU healthy, Lambda completes in 40-44s, API GW fires 504 at 29s | Log streams with duration_ms=44018, 41525, 43755, 43804 |
| 3 | Trace "sorry I couldn't process that" to source | chat.tsx:99 — catch block, any thrown error | chatWithMemory.ts throws on network error or missing answer |
| 4 | Check deployed env vars | GPU_MAX_WAIT_MS absent from live Lambda | aws lambda get-function-configuration |
| 5 | Check template.yaml | GPU_MAX_WAIT_MS not in global or function-level env vars | template.yaml lines 44-61 |
| 6 | Confirm overshoot fix is in code | sleep_ms = min(retry_delay_ms, remaining_ms) at line 464 | api/memories/chat/app.py |
| 7 | Restore GPU_MAX_WAIT_MS=15000 in template.yaml | Added to MemoriesChatFunction Environment.Variables | template.yaml |
| 8 | Confirm slow inference is Cause 2 | duration_ms=44018 in logs with gpu_retry_healthy elapsed_ms=151 — GPU was ready, inference was slow | CloudWatch 1cf4b13d... stream |
| 9 | Deploy via sam build && sam deploy | Restores GPU_MAX_WAIT_MS=15000 to live Lambda | main/server/ |
| 10 | Verify deployed env | aws lambda get-function-configuration shows GPU_MAX_WAIT_MS=15000 | AWS CLI |
| 11 | Live test with GPU off | Lambda returned fallback in 16s total (elapsed_ms=15143) — under 29s API GW limit | CloudWatch stream 1c4576d6... |
Verification
- [x] GPU_MAX_WAIT_MS absent from deployed Lambda (confirmed via get-function-configuration)
- [x] Overshoot fix present in code (sleep_ms = min(retry_delay_ms, remaining_ms))
- [x] Slow inference (40-44s) confirmed via CloudWatch — exceeds 29s API GW limit
- [x] GPU_MAX_WAIT_MS=15000 restored in template.yaml
- [x] SAM deploy completed successfully — only
MemoriesChatFunctionwas updated (correct) - [x] Live test: GPU health-poll path returns fallback in 16s total Lambda duration (
elapsed_ms=15143, max_wait_ms=15000) — under 29s API GW limit - [x]
GPU_MAX_WAIT_MS=15000confirmed present in live Lambda viaget-function-configuration - [ ] Live test: warm GPU fast queries complete in <10s (no 504) — GPU currently off
- [ ] Slow inference 504 (Cause 2): async chat architecture needed (not fixed in this session)