GPU Chat Blocked: g6e InsufficientCapacity + Stopped Instances Ignored When SSM is None

Metadata

Date: 2026-05-14
Status: fixed
Severity: high
Related issue/ticket: N/A
Owner: N/A

About

Overview: User sends chat messages but always receives "Chat is not available right now — the GPU worker is starting up." The GPU instance is not starting despite stopped instances being available in the account.

Root cause (two compounding bugs):

_launch_gpu_instance swallows exceptions with bare except Exception without logging the error string, so failures (like InsufficientInstanceCapacity) are invisible in CloudWatch.
When SSM holds "none", implementation() goes straight to _launch_gpu_instance — it does NOT check for existing stopped instances by tag. If the launch fails (capacity, quota, AMI), there is no fallback to restart a stopped instance.

Why it happened today: SSM was set to "none" at 2026-05-13T20:06:30 PDT. All three GPU instances (i-094304012bc773d5f, i-0c8b3f59aaeba79db as g6e.2xlarge, i-0d8401817dec5d12f as g5.2xlarge) were stopped. run_instances was failing silently with InsufficientInstanceCapacity for g6e.2xlarge. start_instances on either g6e also returned InsufficientInstanceCapacity. The g5.2xlarge had capacity available but the code never tried it because it only tried to launch new instances.

Technical Questions: - Lambda GPU_INSTANCE_ID env var is baked at deploy time via {{resolve:ssm:...}} — updating SSM live does not update the running Lambda. A lambda update-function-configuration or redeploy is required. - The watchdog (runs every ~15min) will stop a running GPU instance after IDLE_AFTER_MIN=10 minutes if IngestWindowFunction invocations = 0. Chat usage does NOT count toward this metric.

Resources: - main/server/api/memories/chat/app.py — _launch_gpu_instance, _find_stopped_gpu_instance_by_tag, implementation() - CloudWatch: /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ - CloudWatch: /aws/lambda/encache-gpu-watchdog - SSM: /encache/gpu/instance_id - Related: docs/bugs/2026-05-09-gpu-stale-ssm-instance-id.md

Steps to cause failure

flowchart LR
    A["User sends chat"] --> B["Lambda: GPU_INSTANCE_ID=none"]
    B --> C["_launch_gpu_instance(lt-0cf39db4cd6dff510)"]
    C --> D["run_instances → InsufficientInstanceCapacity"]
    D --> E["Exception swallowed, logs chat_gpu_launch_failed"]
    E --> F["gpu_worker_url = None"]
    F --> G["'GPU worker is starting up'"]

System

flowchart TD
    SSM[("SSM\n/encache/gpu/instance_id = 'none'")] -->|"Baked at deploy time"| LambdaEnv["Lambda env var\nGPU_INSTANCE_ID = 'none'"]
    LambdaEnv --> Code["implementation()\ngpu_instance_id = None"]
    Code --> Launch["_launch_gpu_instance(template)"]
    Launch -->|"InsufficientInstanceCapacity"| Fail["Exception swallowed\nreturns None"]
    Fail --> Fallback["No stopped-instance fallback\ngpu_worker_url = None"]
    Fallback --> Msg["'GPU worker is starting up'"]

    EC2Stopped["3 stopped g-instances\n(available but not checked)"] -. ignored .-> Code

Reproduction Details

Set SSM /encache/gpu/instance_id to "none" (or redeploy Lambda with SSM at "none")
Ensure all GPU instances by tag Name=encache-gpu-worker are stopped, and g6e.2xlarge has no available capacity
Send a chat POST to /memories/chat
Observe chat_gpu_launch_failed in CloudWatch with no error detail; user gets fallback message

Reproduction test: Requires live AWS environment. Observable via CloudWatch filter "chat_gpu_launch_failed".

Notes for PR

Fix 1 — Log the actual exception in _launch_gpu_instance: Changed except Exception to except Exception as exc and added "error": str(exc) to the log payload. Future failures will show the actual AWS error code (e.g. InsufficientInstanceCapacity, VcpuLimitExceeded) in CloudWatch.

Fix 2 — Try stopped instances before launching new ones: Added _find_stopped_gpu_instance_by_tag() (mirrors _find_running_gpu_instance_by_tag but filters stopped state). In implementation(), when SSM is "none", the code now first scans for a stopped instance by tag and calls start_instances on it before falling back to _launch_gpu_instance. Also updates SSM and the in-process env var on success.

Immediate recovery (performed live): - Started i-0d8401817dec5d12f (g5.2xlarge, stopped) via aws ec2 start-instances - Updated SSM to i-0d8401817dec5d12f - Updated Lambda GPU_INSTANCE_ID env var directly via aws lambda update-function-configuration - Verified GPU worker healthy: {"status":"ready","model":"VLM2Vec-2B+Qwen2-VL-7B","device":"cuda:0"}

Known remaining risk — watchdog: The watchdog stops the GPU after 10 min with 0 IngestWindowFunction invocations. Chat usage does not count. If no new memories are ingested, the GPU will be stopped again and the Lambda env var will be stale. A full redeploy is needed for the code fix to apply (so the Lambda can self-recover via the new stopped-instance path without a manual env var update).

Audit Log

ID	Action	Note	Context
1	Create audit log	Initialize investigation	User report: chat blocked, GPU not starting
2	Check SSM	SSM = "none" (set 2026-05-13T20:06:30 PDT)	`aws ssm get-parameter`
3	Check EC2 instances by tag	3 stopped instances: 2×g6e.2xlarge, 1×g5.2xlarge	`aws ec2 describe-instances`
4	Check CloudWatch logs	Confirmed `chat_gpu_launch_failed` with no error detail on every invocation	CloudWatch filter `"chat_gpu"`
5	Probe root cause	`start-instances i-094304012bc773d5f` → `InsufficientInstanceCapacity`	`aws ec2 start-instances`
6	Try g5.2xlarge	`start-instances i-0d8401817dec5d12f` → success, now running	`aws ec2 start-instances`
7	Update SSM	Set `/encache/gpu/instance_id = i-0d8401817dec5d12f`	`aws ssm put-parameter`
8	Update Lambda env var	`GPU_INSTANCE_ID = i-0d8401817dec5d12f` via `update-function-configuration`	`aws lambda update-function-configuration`
9	Verify GPU healthy	`{"status":"ready","model":"VLM2Vec-2B+Qwen2-VL-7B","device":"cuda:0"}`	SSM RunShellScript health check
10	Fix code: log exception	Added `str(exc)` to `chat_gpu_launch_failed` log	`main/server/api/memories/chat/app.py`
11	Fix code: stopped-instance fallback	Added `_find_stopped_gpu_instance_by_tag()` + start path in `implementation()`	`main/server/api/memories/chat/app.py`

Verification

[x] Reproduced failure before fix (via CloudWatch: chat_gpu_launch_failed on every invocation)
[ ] Reproduction test fails before fix (N/A — requires live AWS)
[x] Root cause identified with evidence (InsufficientInstanceCapacity confirmed by direct start-instances call)
[x] Fix applied at source (_launch_gpu_instance logs error; implementation() now tries stopped instances first)
[x] GPU worker healthy and chat unblocked (status=ready confirmed via SSM command)
[ ] Regression test added/updated (N/A — behavior requires live AWS EC2; existing watchdog tests in test_gpu_watchdog.py cover adjacent paths)
[x] Verified no duplicate solved-bug log exists for same root cause