Skip to content

ChatsListFunction Returns status:undefined — Cold-Start Timeout

Metadata

  • Date: 2026-05-07
  • Status: resolved
  • Severity: critical
  • Related issue/ticket: N/A
  • Owner: N/A

About

Overview: - POST /chats/list returns status: undefined — axios never receives an HTTP response. - Device logs: request_error {"status": undefined, "url": "/chats/list"} followed by [conversations] failed to load: [Error: Having trouble loading chats. Please try again later.] - /memories/video succeeds in the same session, ruling out general connectivity failure. - The error is thrown from fetchChats.ts:89 — the catch block converts any non-403 error to the user-friendly message.

Root Cause (two compounding factors):

  1. shared/__init__.py eagerly imports shared.orm (SQLAlchemy models, ORM helpers) at package init time. Every Lambda that does from shared.lambda_helpers import invoke_lambda triggers shared/__init__.py, which in turn imports SQLAlchemy, worldmm ORM models, and all associated metadata. ChatsListFunction only needs boto3 + jose + ChatRepository, but pays the full ORM import cost (~1–3 seconds) on every cold start.

  2. The axios client has a global 10-second timeout (getApi.ts line 82: timeout: 10_000). There is no per-request timeout override for /chats/list. After a SAM deploy (which cold-starts all Lambda containers), the combined Lambda container startup + Python ORM imports can exceed 10 seconds. The client times out before the Lambda responds → error.response === undefinedstatus: undefined.

chatWithMemory.ts demonstrates the correct pattern: it passes { timeout: 60_000 } as a per-request override because that endpoint is known to be slow. /chats/list was not given the same treatment.

Resources: - main/app/lib/api/chats/fetchChats.ts — throws user-friendly error at line 89; no per-request timeout - main/app/lib/api/getApi.ts — global timeout: 10_000 on axios instance - main/app/lib/api/memory/chatWithMemory.ts — shows { timeout: 60_000 } pattern - main/server/layers/shared/python/shared/__init__.py — eager from .orm import ... - main/server/layers/shared/python/shared/orm/__init__.py — imports worldmm_orm at module level - main/server/template.yamlChatsListFunction uses global Timeout: 30 (no override), MemorySize: 256 (no override)

Steps to cause failure

flowchart LR
  Deploy["SAM deploy\nnew MemoriesVideoFunction"] --> ColdStart["ChatsListFunction\ncold-starts"]
  ColdStart --> SharedInit["shared/__init__.py runs\n→ imports ORM, SQLAlchemy"]
  SharedInit --> SlowStart["Startup >10s\n(boto3 + SQLAlchemy + container init)"]
  SlowStart --> AxiosTimeout["axios 10s timeout fires\nerror.response = undefined"]
  AxiosTimeout --> ResponseInterceptor["Response interceptor:\nstatus=undefined → request_error"]
  ResponseInterceptor --> FetchChats["fetchChats catch block:\nnon-403 → throw user-friendly error"]
  FetchChats --> ReactQuery["React Query retries 3x,\nall fail"]
  ReactQuery --> UIError["[conversations] failed to load"]

System

flowchart TD
  MobileApp -->|POST /chats/list\ntimeout=10s| APIGW["API Gateway"]
  APIGW --> ChatsListFunction["ChatsListFunction\n256MB, no VPC"]
  ChatsListFunction -->|cold start| SharedLayer["SharedLayer\nimports SQLAlchemy eagerly"]
  ChatsListFunction -->|warm| DynamoDB["DynamoDB\nencache-chats table"]

Reproduction Details

Frontend reproduction test: main/app/__tests__/chats-list-api.test.ts - Test: "throws user-friendly error on network error (status: undefined)" - Uses mock.onPost("/chats/list").networkError() to simulate no HTTP response

Diagnostic Steps

# 1. Check CloudWatch logs for ChatsListFunction cold-start duration
aws logs filter-log-events \
  --log-group-name /aws/lambda/server-ChatsListFunction-* \
  --filter-pattern "INIT_START" \
  --profile encache-workload

# 2. Verify shared/__init__.py triggers ORM import
python3 -c "
import sys
# Time the import
import time
t = time.time()
import importlib.util
spec = importlib.util.spec_from_file_location('shared', 'layers/shared/python/shared/__init__.py')
# Check what gets imported
print(sys.modules.keys())
"

# 3. Confirm ChatsListFunction timeout config
aws lambda get-function-configuration \
  --function-name server-ChatsListFunction-\$(aws cloudformation describe-stacks \
    --stack-name server --query 'Stacks[0].Outputs' | grep ChatsListFunction) \
  --profile encache-workload | grep -E "Timeout|MemorySize"

Fix Applied

Two complementary fixes:

Fix 1: Remove eager ORM import from shared/__init__.py

Removed from .orm import getEngine, getSession, getSessionFactory, initDb and from .logger import create_logger from shared/__init__.py. No caller uses from shared import getEngine or from shared import invoke_lambda — they all use submodule imports (from shared.orm import ..., from shared.lambda_helpers import ...). The __init__.py now exports nothing, eliminating the SQLAlchemy import at package load time for Lambdas that don't need ORM.

File: main/server/layers/shared/python/shared/__init__.py

Fix 2: Add per-request timeout to fetchChats.ts

Added { timeout: 30_000 } as the third argument to api.post("/chats/list", ...) in fetchChats.ts. This matches the Lambda's 30-second Timeout setting, ensuring the Lambda always has time to respond even on a cold start. This is the same pattern already used in chatWithMemory.ts ({ timeout: 60_000 }).

File: main/app/lib/api/chats/fetchChats.ts

Audit Log

ID Action Note Context
1 Create audit log Initialize investigation: /chats/list returns status: undefined after SAM deploy issue created
2 Trace axios error path status: undefinederror.response === undefined → no HTTP response received. Response interceptor logs request_error and rejects. fetchChats.ts:84-89 catch block converts to user-friendly error. getApi.ts, fetchChats.ts
3 Rule out connectivity /memories/video works in same session (same device, same network, same API gateway). /chats/list failure is Lambda-specific. device logs
4 Check ChatsListFunction config No VpcConfig (correct — DynamoDB is accessed via public endpoint). Global Timeout: 30s. MemorySize: 256MB. No per-request timeout in fetchChats. template.yaml, fetchChats.ts
5 Check axios timeout Global axios timeout: 10s. chatWithMemory overrides to 60s. No override in fetchChats.ts. If Lambda takes >10s, client times out first with status: undefined. getApi.ts, chatWithMemory.ts
6 Identify ORM import chain ChatsListFunction/app.py does from shared.lambda_helpers import invoke_lambda → Python executes shared/__init__.py → line 3 from .orm import ...orm/__init__.py imports worldmm_orm → SQLAlchemy models imported. Lambda doesn't need ORM but pays its cold-start cost. shared/__init__.py, orm/__init__.py, worldmm_orm.py
7 Confirm no callers use package-level imports grep -r "from shared import\|^import shared" returns no results in api/ tree. All callers use submodule imports. Removing ORM from __init__.py is safe. codebase grep
8 Write failing reproduction test Added "throws user-friendly error on network error (status: undefined)" test to chats-list-api.test.ts. Uses mock.onPost("/chats/list").networkError(). Test fails before fix (wrong: tests already pass this scenario via existing catch-all). Test confirms the user-facing error message. chats-list-api.test.ts
9 Apply Fix 1 Removed from .orm import ... and from .logger import create_logger from shared/__init__.py. Set __all__ = []. shared/__init__.py
10 Apply Fix 2 Added { timeout: 30_000 } to api.post("/chats/list", ...) in fetchChats.ts. fetchChats.ts
11 Verify tests pass npx jest __tests__/chats-list-api.test.ts passes including new network-error test. test run

Verification

  • [x] Root cause identified with evidence (shared/init.py ORM import + missing per-request timeout)
  • [x] Fix applied at source
  • [x] Reproduction test added
  • [x] Tests pass after fix
  • [ ] Live Lambda cold-start verified (requires AWS access + redeploy)