Skip to content

Chat "Sorry I couldn't process that" — GPU_MAX_WAIT_MS Stripped + Slow Inference 504s

Metadata

  • Date: 2026-05-15
  • Status: in-progress
  • Severity: high
  • Related issue/ticket: N/A (follow-on from 2026-05-14-chat-504-api-gateway-29s-hard-timeout.md)
  • Owner: N/A

About

Overview: The user reports seeing "Sorry, I couldn't process that. Please try again." on chat even though the GPU is believed to be running. Previous debugging established two root causes (see 2026-05-14-chat-504-api-gateway-29s-hard-timeout.md). This session investigates what is still failing after the GPU_MAX_WAIT_MS=15000 fix was applied.

Root Causes Found:

Root Cause 1 (Immediate): GPU_MAX_WAIT_MS was stripped from the deployed Lambda

The env var GPU_MAX_WAIT_MS=15000 that was set in a previous deploy is no longer present in the live Lambda function configuration. The most recent sam deploy did not include GPU_MAX_WAIT_MS in template.yaml's global or function-level environment variables, so the deploy overwrote the Lambda env back to the default (120000ms). With GPU_MAX_WAIT_MS=120000 and a Lambda timeout of 60s, health-poll requests that need to wait for GPU cold-start will be killed by Lambda timeout at 60s → API Gateway returns 504.

Evidence: aws lambda get-function-configuration shows no GPU_MAX_WAIT_MS key in deployed Variables.

Root Cause 2 (Ongoing): Slow GPU inference exceeds API Gateway 29s hard timeout

Even with a warm GPU (health check passes immediately in 5-150ms), ReasoningAgent.answer() takes 40-44 seconds for complex queries. The API Gateway REST API integration timeout is hard-capped at 29 seconds. These requests always return 504 to the client. The Lambda finishes the work (logs show completed duration_ms=44018) but API Gateway has already closed the connection.

Evidence in CloudWatch logs: - memories_chat_app completed duration_ms=44018 — Lambda completed successfully - But API Gateway 29s timeout fired before Lambda responded, so client saw 504

This is the same Cause 2 identified on 2026-05-14. It was not fixed — only Cause 1 (GPU health poll overshoot) was fixed.

Resources: - /home/lewibs/github/encache1/encache1/main/server/template.yaml — GPU_MAX_WAIT_MS absent from MemoriesChatFunction env - /home/lewibs/github/encache1/encache1/main/server/api/memories/chat/app.py — inference in handle_chat(), no timeout wrapper - /home/lewibs/github/encache1/encache1/main/app/app/chat.tsx:99 — "Sorry, I couldn't process that" on catch - /home/lewibs/github/encache1/encache1/main/app/lib/api/memory/chatWithMemory.ts — throws Error("No answer received") if no answer; client timeout 60s but API GW fires first

Failure Timeline (from CloudWatch)

Time (UTC) Event Duration Outcome
03:54:00 GPU stopped → start_instances → no URL 2s Fallback "GPU starting" message
03:54:46 GPU still starting → health poll → exhausted max_wait=20000, elapsed=30261 30s Fallback (overshoot bug)
03:57:36 GPU healthy (elapsed=152ms, retries=0) → inference 3s Success
05:13:44 GPU stopped → start_instances → no URL 2s Fallback
05:16:08 GPU starting → health poll → exhausted max_wait=15000, elapsed=30255 30s Fallback (overshoot bug still in old deploy)
05:18:39 GPU healthy → inference 43s Lambda completed OK but API GW returned 504
05:25:23 GPU healthy → inference 3s Success (fast query)
05:25:36 GPU healthy → inference 41s Lambda completed OK but API GW returned 504
05:26:17 GPU healthy → inference 44s Lambda completed OK but API GW returned 504

The 05:16 overshoot event (elapsed_ms=30255 with max_wait_ms=15000) is from before the backoff-cap fix was deployed. The fix IS in the code now (sleep_ms = min(retry_delay_ms, remaining_ms)), but the env var was stripped by the subsequent deploy.

Failing Path: Slow Inference 504

Mobile app → POST /memories/chat → Cloudflare → API Gateway (29s limit)
   → Lambda invokes → GPU health OK (fast) → ReasoningAgent.answer() (40-44s)
   → API Gateway fires 504 at 29s → mobile app catch block → "Sorry, I couldn't process that"
   → Lambda continues running in background → logs "completed duration_ms=44018"

Fix Applied (This Session)

Fix 1: Restore GPU_MAX_WAIT_MS=15000 in template.yaml

Added GPU_MAX_WAIT_MS: "15000" to the MemoriesChatFunction environment variables in template.yaml. This prevents the GPU health-poll path from timing out the Lambda when GPU is cold-starting.

Fix 2 (not applied — architectural): Slow inference 504

The slow inference path cannot be fixed with a timeout change — 29s is the API Gateway hard limit and inference takes 40-44s. The correct fix is async chat (Lambda returns immediately, background worker generates answer, client polls). This requires a larger architectural change and is tracked separately.

Short-term workaround considered but rejected: Adding a requests.post(timeout=25) in the GPU client for inference would cause the Lambda to throw after 25s, returning "Something went wrong — please try again." instead of a 504. However the user still gets an error on long queries, just a more explicit one. Decided not to add this noise — the async architecture is the right fix.

Audit Log

ID Action Note Context
1 Pull CloudWatch logs Last 30-60 min from /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ AWS CLI
2 Identify failure pattern GPU healthy, Lambda completes in 40-44s, API GW fires 504 at 29s Log streams with duration_ms=44018, 41525, 43755, 43804
3 Trace "sorry I couldn't process that" to source chat.tsx:99 — catch block, any thrown error chatWithMemory.ts throws on network error or missing answer
4 Check deployed env vars GPU_MAX_WAIT_MS absent from live Lambda aws lambda get-function-configuration
5 Check template.yaml GPU_MAX_WAIT_MS not in global or function-level env vars template.yaml lines 44-61
6 Confirm overshoot fix is in code sleep_ms = min(retry_delay_ms, remaining_ms) at line 464 api/memories/chat/app.py
7 Restore GPU_MAX_WAIT_MS=15000 in template.yaml Added to MemoriesChatFunction Environment.Variables template.yaml
8 Confirm slow inference is Cause 2 duration_ms=44018 in logs with gpu_retry_healthy elapsed_ms=151 — GPU was ready, inference was slow CloudWatch 1cf4b13d... stream
9 Deploy via sam build && sam deploy Restores GPU_MAX_WAIT_MS=15000 to live Lambda main/server/
10 Verify deployed env aws lambda get-function-configuration shows GPU_MAX_WAIT_MS=15000 AWS CLI
11 Live test with GPU off Lambda returned fallback in 16s total (elapsed_ms=15143) — under 29s API GW limit CloudWatch stream 1c4576d6...

Verification

  • [x] GPU_MAX_WAIT_MS absent from deployed Lambda (confirmed via get-function-configuration)
  • [x] Overshoot fix present in code (sleep_ms = min(retry_delay_ms, remaining_ms))
  • [x] Slow inference (40-44s) confirmed via CloudWatch — exceeds 29s API GW limit
  • [x] GPU_MAX_WAIT_MS=15000 restored in template.yaml
  • [x] SAM deploy completed successfully — only MemoriesChatFunction was updated (correct)
  • [x] Live test: GPU health-poll path returns fallback in 16s total Lambda duration (elapsed_ms=15143, max_wait_ms=15000) — under 29s API GW limit
  • [x] GPU_MAX_WAIT_MS=15000 confirmed present in live Lambda via get-function-configuration
  • [ ] Live test: warm GPU fast queries complete in <10s (no 504) — GPU currently off
  • [ ] Slow inference 504 (Cause 2): async chat architecture needed (not fixed in this session)