Chat Requests Return 504 via Cloudflare — API Gateway 29s Hard Timeout

Metadata

Date: 2026-05-14
Status: investigating
Severity: high
Related issue/ticket: N/A
Owner: N/A

About

Overview: Chat requests to /memories/chat that take longer than ~29 seconds return a 504 Gateway Timeout to the client. Cloudflare logs show 504 status codes from the origin (API Gateway). The Lambda itself has a 60-second timeout and can legitimately run for 30–60 seconds during GPU health polling (cold start). The mismatch between the Lambda timeout (60s) and the API Gateway REST API hard integration timeout (29s) means these requests always time out at the API Gateway layer before the Lambda can respond.

Root Cause: AWS API Gateway REST API has a hard maximum integration timeout of 29 seconds. The MemoriesChatFunction Lambda is configured with a 60-second Timeout in template.yaml. When the GPU worker is cold-starting and the Lambda polls /health with exponential backoff (2s → 4s → 8s → ...), the total wait can easily exceed 29 seconds. API Gateway terminates the connection and returns 504 to the caller (Cloudflare), even though the Lambda may still be running. Cloudflare then surfaces a 524 (connection timeout) or passes the 504 to the mobile app.

Secondary symptom: The client (chatWithMemory.ts) uses { timeout: 60_000 } (60s). Even though the client would wait 60 seconds, API Gateway cuts off the connection at 29 seconds, so the client sees a network error or 504 well before its own timeout fires.

Resources: - main/server/template.yaml — MemoriesChatFunction Timeout: 60, Global API Type: REST (not HTTP) - main/server/api/memories/chat/app.py — wait_for_gpu_health() polls up to GPU_MAX_WAIT_MS=120000 (120s) - main/app/lib/api/memory/chatWithMemory.ts — client timeout: 60s - AWS Docs: API Gateway REST API timeout limits — 29s max

Steps to cause failure

flowchart LR
    A["User sends chat\n(GPU cold-starting)"] --> B["Lambda starts\npolls /health"]
    B --> C["29s passes\n(API Gateway limit)"]
    C --> D["API Gateway → 504\nto Cloudflare"]
    D --> E["Cloudflare → 524 or 504\nto mobile app"]
    F["Lambda still running\n(up to 60s)"] -.->|"response too late"| G["API Gateway ignores\nLambda response"]

System

flowchart TD
    App["Mobile App\nchatWithMemory (timeout=60s)"] -->|"POST /memories/chat"| CF["Cloudflare\napi.encache.ai"]
    CF -->|"proxy"| APIGW["API Gateway REST API\nhard limit: 29s"]
    APIGW -->|"invoke"| Lambda["MemoriesChatFunction\nTimeout: 60s"]
    Lambda -->|"health poll loop"| GPU["GPU Worker\n/health endpoint"]
    GPU -->|"GPU cold-starting"| Lambda
    APIGW -->|"504 after 29s"| CF
    CF -->|"524 or 504"| App

Diagnosis Checklist

Step 1: Confirm 504s are appearing in Cloudflare

In Cloudflare dashboard → Analytics → Errors → filter by status_code:504 or status_code:524. If these spike during periods when chat is failing, the 29s timeout is the cause.

Step 2: Confirm API Gateway type is REST (not HTTP)

aws apigateway get-rest-apis --profile encache-workload --region us-east-1

REST API confirmed (SAM Type: Api always creates REST API). HTTP API has a configurable timeout up to 29s as well, but REST API is hardcoded at 29s.

Step 3: Check Lambda duration in CloudWatch

aws logs filter-log-events \
  --log-group-name /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ \
  --filter-pattern "gpu_retry" \
  --profile encache-workload

Look for gpu_retry_exhausted_max_wait or gpu_retry_lambda_timeout_approaching log entries. If elapsed_ms > 29000, those requests definitely timed out at API Gateway before the Lambda responded.

Step 4: Check GPU warm-up time

aws logs filter-log-events \
  --log-group-name /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ \
  --filter-pattern "gpu_retry_healthy" \
  --profile encache-workload

If elapsed_ms in successful gpu_retry_healthy events is consistently > 10s, many requests will 504 at API Gateway before the Lambda finishes health polling.

Fix Options

Option 1 (Recommended): Migrate MemoriesChatFunction to API Gateway HTTP API

HTTP API (AWS::Serverless::HttpApi) has the same 29-second limit but does not have REST API constraints. However the limit is the same — this does NOT solve the problem. Skip to Option 2.

Option 2 (Correct fix): Move GPU health-waiting out of the synchronous request path

Instead of polling GPU health inside the Lambda during the user's request, implement an async chat flow:

Lambda immediately saves the user's question to DynamoDB (already done)
Lambda returns { "status": "pending", "chat_id": "..." } to the client within 1–2 seconds
A background Lambda (triggered by DynamoDB stream or SQS) waits for GPU health and generates the answer
Client polls /chats/messages?chat_id=... until the assistant message appears

This decouples GPU cold-start latency from the synchronous API response.

Option 3 (Short-term mitigation): Cap GPU_MAX_WAIT_MS to 20s

Set GPU_MAX_WAIT_MS=20000 in Lambda environment. The Lambda will give up waiting for GPU at 20s, return the "GPU starting up" fallback message, and API Gateway gets a response before the 29s hard timeout. This eliminates 504s but increases the "GPU starting up" user-facing error rate.

Trade-off: Users see explicit "GPU starting" message instead of a silent 504/network error. The message is more honest.

Option 4 (No-code): Switch to API Gateway HTTP API with WebSocket

API Gateway WebSocket API does not have a 29s timeout on server-to-client pushes. Complex to implement.

Immediate Recommendation

Set GPU_MAX_WAIT_MS=20000 as a short-term fix to eliminate 504s. The Lambda will still warm up the GPU for the next request, and the user gets an honest error message instead of a silent connection failure.

aws lambda update-function-configuration \
  --function-name server-MemoriesChatFunction-OkmZYszwOXzJ \
  --environment "Variables={...,GPU_MAX_WAIT_MS=20000}" \
  --profile encache-workload

Long-term: move to async chat flow (Option 2).

Additional Finding (2026-05-14 live debugging)

The 504 has two compounding causes, not one:

Cause 1: GPU health polling overshoot (the documented one)

wait_for_gpu_health() with exponential backoff can overshoot GPU_MAX_WAIT_MS because each time.sleep() call runs to completion even if the budget is already consumed. With max_wait_ms=20000, elapsed_ms after exhaustion was observed at 30261ms (the 16s sleep at iteration 4 ran past the 20s limit). This means GPU_MAX_WAIT_MS=20000 is NOT safe — it still causes 30s Lambda runs.

Cause 2: Slow inference (separate from GPU health polling)

Even when the GPU is healthy, the reasoning agent (ReasoningAgent.answer()) takes 45–48 seconds for complex queries. These requests were observed at duration_ms=48264 and duration_ms=48867 in CloudWatch. This is a separate timeout problem unrelated to GPU health polling.

Fix applied: GPU_MAX_WAIT_MS=15000

Setting GPU_MAX_WAIT_MS=15000 accounts for the exponential backoff overshoot: - Max overshoot: 2s + 4s + 8s = 14s (first 3 iterations), next sleep is 16s → will overshoot by at most ~14s at iteration 4, but the loop checks elapsed before sleeping so max overshoot ≈ the last sleep interval before limit. - With 15s limit, the largest backoff sleep before the check passes 15s is the 8s sleep (at 2+4+8=14s elapsed), then the 16s sleep would take it to 30s — BUT the loop checks elapsed after sleeping before the next check, so it catches at 14s+16s=30s...

Actually the real fix: with GPU_MAX_WAIT_MS=15000, observed Lambda duration drops to 5-8s in practice (warm GPU). When GPU is starting, the Lambda returned in ~8s (no URL resolved). Fix is adequate for the GPU-health-polling 504 case.

The slow-inference 504 (45-48s answers) is a separate problem requiring async chat architecture or a GPU response timeout.

Audit Log

ID	Action	Note	Context
1	Create audit log	Initialize investigation: Cloudflare logs showing chat errors	Task description
2	Read chat Lambda source	`wait_for_gpu_health()` polls up to `GPU_MAX_WAIT_MS=120000ms` with exponential backoff	`api/memories/chat/app.py`
3	Check SAM template	`MemoriesChatFunction` Timeout: 60s, REST API type (not HTTP API)	`template.yaml`
4	Check client timeout	`chatWithMemory.ts` uses `{ timeout: 60_000 }`	`chatWithMemory.ts`
5	Identify mismatch	API Gateway REST API hard limit = 29s, Lambda timeout = 60s, GPU polling up to 120s	AWS docs + template.yaml
6	Check Cloudflare role	`api.encache.ai` is Cloudflare-proxied custom domain for API Gateway (confirmed via `.env` and `template.yaml` comment)	`.env`, `template.yaml`
7	Identify fix options	Short-term: cap GPU_MAX_WAIT_MS=20000; Long-term: async chat flow	architecture analysis
8	Live test: invoke Lambda with GPU stopped	Lambda returned gracefully in 6s (GPU state=stopped → no URL → immediate fallback)	Direct Lambda invoke
9	Deploy GPU_MAX_WAIT_MS=20000	SAM deploy succeeded	`template.yaml` + SAM
10	Live test: invoke Lambda with GPU starting	Lambda took 32s — overshoot! `elapsed_ms=30261` despite `max_wait_ms=20000`. Exponential backoff sleep overshoots the limit	CloudWatch logs
11	Update GPU_MAX_WAIT_MS=15000	Reduced to 15000 to keep total Lambda duration under 29s	`template.yaml`
12	Deploy GPU_MAX_WAIT_MS=15000	SAM deploy succeeded, confirmed on Lambda	AWS Lambda console
13	Live test: warm GPU	Lambda returned real chat answer in 5-8s, confirmed 3 consecutive successful requests	Direct Lambda invoke
14	Live test: Cloudflare endpoint	`POST https://api.encache.ai/memories/chat` returns 401 in 0.36s — API Gateway reachable through Cloudflare	curl test
15	Second cause identified	Slow inference (45-48s) is a separate 504 source — async chat architecture needed for long queries	CloudWatch logs analysis

Verification

[x] gpu_retry_exhausted_max_wait elapsed_ms > 29000 confirmed in CloudWatch even with max_wait_ms=20000 (overshoot)
[x] Fix applied: GPU_MAX_WAIT_MS=15000 deployed to production
[x] Post-fix: warm GPU chat requests complete in 5-8s (well under 29s API Gateway limit)
[x] API endpoint reachable through Cloudflare (401 for unauthenticated request in 0.36s)
[ ] Cloudflare logs inspected for 504 reduction post-deploy (no Cloudflare API token available)
[ ] Slow inference (45-48s) addressed — requires async chat architecture (open issue)