GPU Chat Blocked: g6e InsufficientCapacity + Stopped Instances Ignored When SSM is None
Metadata
- Date:
2026-05-14 - Status:
fixed - Severity:
high - Related issue/ticket: N/A
- Owner: N/A
About
Overview: User sends chat messages but always receives "Chat is not available right now — the GPU worker is starting up." The GPU instance is not starting despite stopped instances being available in the account.
Root cause (two compounding bugs):
_launch_gpu_instanceswallows exceptions with bareexcept Exceptionwithout logging the error string, so failures (likeInsufficientInstanceCapacity) are invisible in CloudWatch.- When SSM holds "none",
implementation()goes straight to_launch_gpu_instance— it does NOT check for existing stopped instances by tag. If the launch fails (capacity, quota, AMI), there is no fallback to restart a stopped instance.
Why it happened today: SSM was set to "none" at 2026-05-13T20:06:30 PDT. All three GPU instances (i-094304012bc773d5f, i-0c8b3f59aaeba79db as g6e.2xlarge, i-0d8401817dec5d12f as g5.2xlarge) were stopped. run_instances was failing silently with InsufficientInstanceCapacity for g6e.2xlarge. start_instances on either g6e also returned InsufficientInstanceCapacity. The g5.2xlarge had capacity available but the code never tried it because it only tried to launch new instances.
Technical Questions: - Lambda GPU_INSTANCE_ID env var is baked at deploy time via {{resolve:ssm:...}} — updating SSM live does not update the running Lambda. A lambda update-function-configuration or redeploy is required. - The watchdog (runs every ~15min) will stop a running GPU instance after IDLE_AFTER_MIN=10 minutes if IngestWindowFunction invocations = 0. Chat usage does NOT count toward this metric.
Resources: - main/server/api/memories/chat/app.py — _launch_gpu_instance, _find_stopped_gpu_instance_by_tag, implementation() - CloudWatch: /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ - CloudWatch: /aws/lambda/encache-gpu-watchdog - SSM: /encache/gpu/instance_id - Related: docs/bugs/2026-05-09-gpu-stale-ssm-instance-id.md
Steps to cause failure
flowchart LR
A["User sends chat"] --> B["Lambda: GPU_INSTANCE_ID=none"]
B --> C["_launch_gpu_instance(lt-0cf39db4cd6dff510)"]
C --> D["run_instances → InsufficientInstanceCapacity"]
D --> E["Exception swallowed, logs chat_gpu_launch_failed"]
E --> F["gpu_worker_url = None"]
F --> G["'GPU worker is starting up'"] System
flowchart TD
SSM[("SSM\n/encache/gpu/instance_id = 'none'")] -->|"Baked at deploy time"| LambdaEnv["Lambda env var\nGPU_INSTANCE_ID = 'none'"]
LambdaEnv --> Code["implementation()\ngpu_instance_id = None"]
Code --> Launch["_launch_gpu_instance(template)"]
Launch -->|"InsufficientInstanceCapacity"| Fail["Exception swallowed\nreturns None"]
Fail --> Fallback["No stopped-instance fallback\ngpu_worker_url = None"]
Fallback --> Msg["'GPU worker is starting up'"]
EC2Stopped["3 stopped g-instances\n(available but not checked)"] -. ignored .-> Code Reproduction Details
- Set SSM
/encache/gpu/instance_idto"none"(or redeploy Lambda with SSM at"none") - Ensure all GPU instances by tag
Name=encache-gpu-workerare stopped, and g6e.2xlarge has no available capacity - Send a chat POST to
/memories/chat - Observe
chat_gpu_launch_failedin CloudWatch with no error detail; user gets fallback message
Reproduction test: Requires live AWS environment. Observable via CloudWatch filter "chat_gpu_launch_failed".
Notes for PR
Fix 1 — Log the actual exception in _launch_gpu_instance: Changed except Exception to except Exception as exc and added "error": str(exc) to the log payload. Future failures will show the actual AWS error code (e.g. InsufficientInstanceCapacity, VcpuLimitExceeded) in CloudWatch.
Fix 2 — Try stopped instances before launching new ones: Added _find_stopped_gpu_instance_by_tag() (mirrors _find_running_gpu_instance_by_tag but filters stopped state). In implementation(), when SSM is "none", the code now first scans for a stopped instance by tag and calls start_instances on it before falling back to _launch_gpu_instance. Also updates SSM and the in-process env var on success.
Immediate recovery (performed live): - Started i-0d8401817dec5d12f (g5.2xlarge, stopped) via aws ec2 start-instances - Updated SSM to i-0d8401817dec5d12f - Updated Lambda GPU_INSTANCE_ID env var directly via aws lambda update-function-configuration - Verified GPU worker healthy: {"status":"ready","model":"VLM2Vec-2B+Qwen2-VL-7B","device":"cuda:0"}
Known remaining risk — watchdog: The watchdog stops the GPU after 10 min with 0 IngestWindowFunction invocations. Chat usage does not count. If no new memories are ingested, the GPU will be stopped again and the Lambda env var will be stale. A full redeploy is needed for the code fix to apply (so the Lambda can self-recover via the new stopped-instance path without a manual env var update).
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize investigation | User report: chat blocked, GPU not starting |
| 2 | Check SSM | SSM = "none" (set 2026-05-13T20:06:30 PDT) | aws ssm get-parameter |
| 3 | Check EC2 instances by tag | 3 stopped instances: 2×g6e.2xlarge, 1×g5.2xlarge | aws ec2 describe-instances |
| 4 | Check CloudWatch logs | Confirmed chat_gpu_launch_failed with no error detail on every invocation | CloudWatch filter "chat_gpu" |
| 5 | Probe root cause | start-instances i-094304012bc773d5f → InsufficientInstanceCapacity | aws ec2 start-instances |
| 6 | Try g5.2xlarge | start-instances i-0d8401817dec5d12f → success, now running | aws ec2 start-instances |
| 7 | Update SSM | Set /encache/gpu/instance_id = i-0d8401817dec5d12f | aws ssm put-parameter |
| 8 | Update Lambda env var | GPU_INSTANCE_ID = i-0d8401817dec5d12f via update-function-configuration | aws lambda update-function-configuration |
| 9 | Verify GPU healthy | {"status":"ready","model":"VLM2Vec-2B+Qwen2-VL-7B","device":"cuda:0"} | SSM RunShellScript health check |
| 10 | Fix code: log exception | Added str(exc) to chat_gpu_launch_failed log | main/server/api/memories/chat/app.py |
| 11 | Fix code: stopped-instance fallback | Added _find_stopped_gpu_instance_by_tag() + start path in implementation() | main/server/api/memories/chat/app.py |
Verification
- [x] Reproduced failure before fix (via CloudWatch:
chat_gpu_launch_failedon every invocation) - [ ] Reproduction test fails before fix (N/A — requires live AWS)
- [x] Root cause identified with evidence (
InsufficientInstanceCapacityconfirmed by directstart-instancescall) - [x] Fix applied at source (
_launch_gpu_instancelogs error;implementation()now tries stopped instances first) - [x] GPU worker healthy and chat unblocked (
status=readyconfirmed via SSM command) - [ ] Regression test added/updated (N/A — behavior requires live AWS EC2; existing watchdog tests in
test_gpu_watchdog.pycover adjacent paths) - [x] Verified no duplicate solved-bug log exists for same root cause