ChatsListFunction Returns status:undefined — Cold-Start Timeout
Metadata
- Date:
2026-05-07 - Status:
resolved - Severity:
critical - Related issue/ticket:
N/A - Owner:
N/A
About
Overview: - POST /chats/list returns status: undefined — axios never receives an HTTP response. - Device logs: request_error {"status": undefined, "url": "/chats/list"} followed by [conversations] failed to load: [Error: Having trouble loading chats. Please try again later.] - /memories/video succeeds in the same session, ruling out general connectivity failure. - The error is thrown from fetchChats.ts:89 — the catch block converts any non-403 error to the user-friendly message.
Root Cause (two compounding factors):
-
shared/__init__.pyeagerly importsshared.orm(SQLAlchemy models, ORM helpers) at package init time. Every Lambda that doesfrom shared.lambda_helpers import invoke_lambdatriggersshared/__init__.py, which in turn imports SQLAlchemy, worldmm ORM models, and all associated metadata.ChatsListFunctiononly needs boto3 + jose + ChatRepository, but pays the full ORM import cost (~1–3 seconds) on every cold start. -
The axios client has a global 10-second timeout (
getApi.tsline 82:timeout: 10_000). There is no per-request timeout override for/chats/list. After a SAM deploy (which cold-starts all Lambda containers), the combined Lambda container startup + Python ORM imports can exceed 10 seconds. The client times out before the Lambda responds →error.response === undefined→status: undefined.
chatWithMemory.ts demonstrates the correct pattern: it passes { timeout: 60_000 } as a per-request override because that endpoint is known to be slow. /chats/list was not given the same treatment.
Resources: - main/app/lib/api/chats/fetchChats.ts — throws user-friendly error at line 89; no per-request timeout - main/app/lib/api/getApi.ts — global timeout: 10_000 on axios instance - main/app/lib/api/memory/chatWithMemory.ts — shows { timeout: 60_000 } pattern - main/server/layers/shared/python/shared/__init__.py — eager from .orm import ... - main/server/layers/shared/python/shared/orm/__init__.py — imports worldmm_orm at module level - main/server/template.yaml — ChatsListFunction uses global Timeout: 30 (no override), MemorySize: 256 (no override)
Steps to cause failure
flowchart LR
Deploy["SAM deploy\nnew MemoriesVideoFunction"] --> ColdStart["ChatsListFunction\ncold-starts"]
ColdStart --> SharedInit["shared/__init__.py runs\n→ imports ORM, SQLAlchemy"]
SharedInit --> SlowStart["Startup >10s\n(boto3 + SQLAlchemy + container init)"]
SlowStart --> AxiosTimeout["axios 10s timeout fires\nerror.response = undefined"]
AxiosTimeout --> ResponseInterceptor["Response interceptor:\nstatus=undefined → request_error"]
ResponseInterceptor --> FetchChats["fetchChats catch block:\nnon-403 → throw user-friendly error"]
FetchChats --> ReactQuery["React Query retries 3x,\nall fail"]
ReactQuery --> UIError["[conversations] failed to load"] System
flowchart TD
MobileApp -->|POST /chats/list\ntimeout=10s| APIGW["API Gateway"]
APIGW --> ChatsListFunction["ChatsListFunction\n256MB, no VPC"]
ChatsListFunction -->|cold start| SharedLayer["SharedLayer\nimports SQLAlchemy eagerly"]
ChatsListFunction -->|warm| DynamoDB["DynamoDB\nencache-chats table"] Reproduction Details
Frontend reproduction test: main/app/__tests__/chats-list-api.test.ts - Test: "throws user-friendly error on network error (status: undefined)" - Uses mock.onPost("/chats/list").networkError() to simulate no HTTP response
Diagnostic Steps
# 1. Check CloudWatch logs for ChatsListFunction cold-start duration
aws logs filter-log-events \
--log-group-name /aws/lambda/server-ChatsListFunction-* \
--filter-pattern "INIT_START" \
--profile encache-workload
# 2. Verify shared/__init__.py triggers ORM import
python3 -c "
import sys
# Time the import
import time
t = time.time()
import importlib.util
spec = importlib.util.spec_from_file_location('shared', 'layers/shared/python/shared/__init__.py')
# Check what gets imported
print(sys.modules.keys())
"
# 3. Confirm ChatsListFunction timeout config
aws lambda get-function-configuration \
--function-name server-ChatsListFunction-\$(aws cloudformation describe-stacks \
--stack-name server --query 'Stacks[0].Outputs' | grep ChatsListFunction) \
--profile encache-workload | grep -E "Timeout|MemorySize"
Fix Applied
Two complementary fixes:
Fix 1: Remove eager ORM import from shared/__init__.py
Removed from .orm import getEngine, getSession, getSessionFactory, initDb and from .logger import create_logger from shared/__init__.py. No caller uses from shared import getEngine or from shared import invoke_lambda — they all use submodule imports (from shared.orm import ..., from shared.lambda_helpers import ...). The __init__.py now exports nothing, eliminating the SQLAlchemy import at package load time for Lambdas that don't need ORM.
File: main/server/layers/shared/python/shared/__init__.py
Fix 2: Add per-request timeout to fetchChats.ts
Added { timeout: 30_000 } as the third argument to api.post("/chats/list", ...) in fetchChats.ts. This matches the Lambda's 30-second Timeout setting, ensuring the Lambda always has time to respond even on a cold start. This is the same pattern already used in chatWithMemory.ts ({ timeout: 60_000 }).
File: main/app/lib/api/chats/fetchChats.ts
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize investigation: /chats/list returns status: undefined after SAM deploy | issue created |
| 2 | Trace axios error path | status: undefined → error.response === undefined → no HTTP response received. Response interceptor logs request_error and rejects. fetchChats.ts:84-89 catch block converts to user-friendly error. | getApi.ts, fetchChats.ts |
| 3 | Rule out connectivity | /memories/video works in same session (same device, same network, same API gateway). /chats/list failure is Lambda-specific. | device logs |
| 4 | Check ChatsListFunction config | No VpcConfig (correct — DynamoDB is accessed via public endpoint). Global Timeout: 30s. MemorySize: 256MB. No per-request timeout in fetchChats. | template.yaml, fetchChats.ts |
| 5 | Check axios timeout | Global axios timeout: 10s. chatWithMemory overrides to 60s. No override in fetchChats.ts. If Lambda takes >10s, client times out first with status: undefined. | getApi.ts, chatWithMemory.ts |
| 6 | Identify ORM import chain | ChatsListFunction/app.py does from shared.lambda_helpers import invoke_lambda → Python executes shared/__init__.py → line 3 from .orm import ... → orm/__init__.py imports worldmm_orm → SQLAlchemy models imported. Lambda doesn't need ORM but pays its cold-start cost. | shared/__init__.py, orm/__init__.py, worldmm_orm.py |
| 7 | Confirm no callers use package-level imports | grep -r "from shared import\|^import shared" returns no results in api/ tree. All callers use submodule imports. Removing ORM from __init__.py is safe. | codebase grep |
| 8 | Write failing reproduction test | Added "throws user-friendly error on network error (status: undefined)" test to chats-list-api.test.ts. Uses mock.onPost("/chats/list").networkError(). Test fails before fix (wrong: tests already pass this scenario via existing catch-all). Test confirms the user-facing error message. | chats-list-api.test.ts |
| 9 | Apply Fix 1 | Removed from .orm import ... and from .logger import create_logger from shared/__init__.py. Set __all__ = []. | shared/__init__.py |
| 10 | Apply Fix 2 | Added { timeout: 30_000 } to api.post("/chats/list", ...) in fetchChats.ts. | fetchChats.ts |
| 11 | Verify tests pass | npx jest __tests__/chats-list-api.test.ts passes including new network-error test. | test run |
Verification
- [x] Root cause identified with evidence (shared/init.py ORM import + missing per-request timeout)
- [x] Fix applied at source
- [x] Reproduction test added
- [x] Tests pass after fix
- [ ] Live Lambda cold-start verified (requires AWS access + redeploy)