Chat "Sorry I couldn't process that" — GPU_MAX_WAIT_MS Stripped + Slow Inference 504s

Metadata

Date: 2026-05-15
Status: in-progress
Severity: high
Related issue/ticket: N/A (follow-on from 2026-05-14-chat-504-api-gateway-29s-hard-timeout.md)
Owner: N/A

About

Overview: The user reports seeing "Sorry, I couldn't process that. Please try again." on chat even though the GPU is believed to be running. Previous debugging established two root causes (see 2026-05-14-chat-504-api-gateway-29s-hard-timeout.md). This session investigates what is still failing after the GPU_MAX_WAIT_MS=15000 fix was applied.

Root Causes Found:

Root Cause 1 (Immediate): GPU_MAX_WAIT_MS was stripped from the deployed Lambda

The env var GPU_MAX_WAIT_MS=15000 that was set in a previous deploy is no longer present in the live Lambda function configuration. The most recent sam deploy did not include GPU_MAX_WAIT_MS in template.yaml's global or function-level environment variables, so the deploy overwrote the Lambda env back to the default (120000ms). With GPU_MAX_WAIT_MS=120000 and a Lambda timeout of 60s, health-poll requests that need to wait for GPU cold-start will be killed by Lambda timeout at 60s → API Gateway returns 504.

Evidence: aws lambda get-function-configuration shows no GPU_MAX_WAIT_MS key in deployed Variables.

Root Cause 2 (Ongoing): Slow GPU inference exceeds API Gateway 29s hard timeout

Even with a warm GPU (health check passes immediately in 5-150ms), ReasoningAgent.answer() takes 40-44 seconds for complex queries. The API Gateway REST API integration timeout is hard-capped at 29 seconds. These requests always return 504 to the client. The Lambda finishes the work (logs show completed duration_ms=44018) but API Gateway has already closed the connection.

Evidence in CloudWatch logs: - memories_chat_app completed duration_ms=44018 — Lambda completed successfully - But API Gateway 29s timeout fired before Lambda responded, so client saw 504

This is the same Cause 2 identified on 2026-05-14. It was not fixed — only Cause 1 (GPU health poll overshoot) was fixed.

Resources: - /home/lewibs/github/encache1/encache1/main/server/template.yaml — GPU_MAX_WAIT_MS absent from MemoriesChatFunction env - /home/lewibs/github/encache1/encache1/main/server/api/memories/chat/app.py — inference in handle_chat(), no timeout wrapper - /home/lewibs/github/encache1/encache1/main/app/app/chat.tsx:99 — "Sorry, I couldn't process that" on catch - /home/lewibs/github/encache1/encache1/main/app/lib/api/memory/chatWithMemory.ts — throws Error("No answer received") if no answer; client timeout 60s but API GW fires first

Failure Timeline (from CloudWatch)

Time (UTC)	Event	Duration	Outcome
03:54:00	GPU stopped → start_instances → no URL	2s	Fallback "GPU starting" message
03:54:46	GPU still starting → health poll → exhausted max_wait=20000, elapsed=30261	30s	Fallback (overshoot bug)
03:57:36	GPU healthy (elapsed=152ms, retries=0) → inference	3s	Success
05:13:44	GPU stopped → start_instances → no URL	2s	Fallback
05:16:08	GPU starting → health poll → exhausted max_wait=15000, elapsed=30255	30s	Fallback (overshoot bug still in old deploy)
05:18:39	GPU healthy → inference	43s	Lambda completed OK but API GW returned 504
05:25:23	GPU healthy → inference	3s	Success (fast query)
05:25:36	GPU healthy → inference	41s	Lambda completed OK but API GW returned 504
05:26:17	GPU healthy → inference	44s	Lambda completed OK but API GW returned 504

The 05:16 overshoot event (elapsed_ms=30255 with max_wait_ms=15000) is from before the backoff-cap fix was deployed. The fix IS in the code now (sleep_ms = min(retry_delay_ms, remaining_ms)), but the env var was stripped by the subsequent deploy.

Failing Path: Slow Inference 504

Mobile app → POST /memories/chat → Cloudflare → API Gateway (29s limit)
   → Lambda invokes → GPU health OK (fast) → ReasoningAgent.answer() (40-44s)
   → API Gateway fires 504 at 29s → mobile app catch block → "Sorry, I couldn't process that"
   → Lambda continues running in background → logs "completed duration_ms=44018"

Fix Applied (This Session)

Fix 1: Restore GPU_MAX_WAIT_MS=15000 in template.yaml

Added GPU_MAX_WAIT_MS: "15000" to the MemoriesChatFunction environment variables in template.yaml. This prevents the GPU health-poll path from timing out the Lambda when GPU is cold-starting.

Fix 2 (not applied — architectural): Slow inference 504

The slow inference path cannot be fixed with a timeout change — 29s is the API Gateway hard limit and inference takes 40-44s. The correct fix is async chat (Lambda returns immediately, background worker generates answer, client polls). This requires a larger architectural change and is tracked separately.

Short-term workaround considered but rejected: Adding a requests.post(timeout=25) in the GPU client for inference would cause the Lambda to throw after 25s, returning "Something went wrong — please try again." instead of a 504. However the user still gets an error on long queries, just a more explicit one. Decided not to add this noise — the async architecture is the right fix.

Audit Log

ID	Action	Note	Context
1	Pull CloudWatch logs	Last 30-60 min from `/aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ`	AWS CLI
2	Identify failure pattern	GPU healthy, Lambda completes in 40-44s, API GW fires 504 at 29s	Log streams with `duration_ms=44018`, `41525`, `43755`, `43804`
3	Trace "sorry I couldn't process that" to source	`chat.tsx:99` — catch block, any thrown error	`chatWithMemory.ts` throws on network error or missing answer
4	Check deployed env vars	`GPU_MAX_WAIT_MS` absent from live Lambda	`aws lambda get-function-configuration`
5	Check template.yaml	`GPU_MAX_WAIT_MS` not in global or function-level env vars	`template.yaml` lines 44-61
6	Confirm overshoot fix is in code	`sleep_ms = min(retry_delay_ms, remaining_ms)` at line 464	`api/memories/chat/app.py`
7	Restore GPU_MAX_WAIT_MS=15000 in template.yaml	Added to MemoriesChatFunction Environment.Variables	`template.yaml`
8	Confirm slow inference is Cause 2	`duration_ms=44018` in logs with `gpu_retry_healthy elapsed_ms=151` — GPU was ready, inference was slow	CloudWatch `1cf4b13d...` stream
9	Deploy via sam build && sam deploy	Restores GPU_MAX_WAIT_MS=15000 to live Lambda	`main/server/`
10	Verify deployed env	`aws lambda get-function-configuration` shows `GPU_MAX_WAIT_MS=15000`	AWS CLI
11	Live test with GPU off	Lambda returned fallback in 16s total (`elapsed_ms=15143`) — under 29s API GW limit	CloudWatch stream `1c4576d6...`

Verification

[x] GPU_MAX_WAIT_MS absent from deployed Lambda (confirmed via get-function-configuration)
[x] Overshoot fix present in code (sleep_ms = min(retry_delay_ms, remaining_ms))
[x] Slow inference (40-44s) confirmed via CloudWatch — exceeds 29s API GW limit
[x] GPU_MAX_WAIT_MS=15000 restored in template.yaml
[x] SAM deploy completed successfully — only MemoriesChatFunction was updated (correct)
[x] Live test: GPU health-poll path returns fallback in 16s total Lambda duration (elapsed_ms=15143, max_wait_ms=15000) — under 29s API GW limit
[x] GPU_MAX_WAIT_MS=15000 confirmed present in live Lambda via get-function-configuration
[ ] Live test: warm GPU fast queries complete in <10s (no 504) — GPU currently off
[ ] Slow inference 504 (Cause 2): async chat architecture needed (not fixed in this session)