Skip to content

GPU Chat Blocked: g6e InsufficientCapacity + Stopped Instances Ignored When SSM is None

Metadata

  • Date: 2026-05-14
  • Status: fixed
  • Severity: high
  • Related issue/ticket: N/A
  • Owner: N/A

About

Overview: User sends chat messages but always receives "Chat is not available right now — the GPU worker is starting up." The GPU instance is not starting despite stopped instances being available in the account.

Root cause (two compounding bugs):

  1. _launch_gpu_instance swallows exceptions with bare except Exception without logging the error string, so failures (like InsufficientInstanceCapacity) are invisible in CloudWatch.
  2. When SSM holds "none", implementation() goes straight to _launch_gpu_instance — it does NOT check for existing stopped instances by tag. If the launch fails (capacity, quota, AMI), there is no fallback to restart a stopped instance.

Why it happened today: SSM was set to "none" at 2026-05-13T20:06:30 PDT. All three GPU instances (i-094304012bc773d5f, i-0c8b3f59aaeba79db as g6e.2xlarge, i-0d8401817dec5d12f as g5.2xlarge) were stopped. run_instances was failing silently with InsufficientInstanceCapacity for g6e.2xlarge. start_instances on either g6e also returned InsufficientInstanceCapacity. The g5.2xlarge had capacity available but the code never tried it because it only tried to launch new instances.

Technical Questions: - Lambda GPU_INSTANCE_ID env var is baked at deploy time via {{resolve:ssm:...}} — updating SSM live does not update the running Lambda. A lambda update-function-configuration or redeploy is required. - The watchdog (runs every ~15min) will stop a running GPU instance after IDLE_AFTER_MIN=10 minutes if IngestWindowFunction invocations = 0. Chat usage does NOT count toward this metric.

Resources: - main/server/api/memories/chat/app.py_launch_gpu_instance, _find_stopped_gpu_instance_by_tag, implementation() - CloudWatch: /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ - CloudWatch: /aws/lambda/encache-gpu-watchdog - SSM: /encache/gpu/instance_id - Related: docs/bugs/2026-05-09-gpu-stale-ssm-instance-id.md

Steps to cause failure

flowchart LR
    A["User sends chat"] --> B["Lambda: GPU_INSTANCE_ID=none"]
    B --> C["_launch_gpu_instance(lt-0cf39db4cd6dff510)"]
    C --> D["run_instances → InsufficientInstanceCapacity"]
    D --> E["Exception swallowed, logs chat_gpu_launch_failed"]
    E --> F["gpu_worker_url = None"]
    F --> G["'GPU worker is starting up'"]

System

flowchart TD
    SSM[("SSM\n/encache/gpu/instance_id = 'none'")] -->|"Baked at deploy time"| LambdaEnv["Lambda env var\nGPU_INSTANCE_ID = 'none'"]
    LambdaEnv --> Code["implementation()\ngpu_instance_id = None"]
    Code --> Launch["_launch_gpu_instance(template)"]
    Launch -->|"InsufficientInstanceCapacity"| Fail["Exception swallowed\nreturns None"]
    Fail --> Fallback["No stopped-instance fallback\ngpu_worker_url = None"]
    Fallback --> Msg["'GPU worker is starting up'"]

    EC2Stopped["3 stopped g-instances\n(available but not checked)"] -. ignored .-> Code

Reproduction Details

  1. Set SSM /encache/gpu/instance_id to "none" (or redeploy Lambda with SSM at "none")
  2. Ensure all GPU instances by tag Name=encache-gpu-worker are stopped, and g6e.2xlarge has no available capacity
  3. Send a chat POST to /memories/chat
  4. Observe chat_gpu_launch_failed in CloudWatch with no error detail; user gets fallback message

Reproduction test: Requires live AWS environment. Observable via CloudWatch filter "chat_gpu_launch_failed".

Notes for PR

Fix 1 — Log the actual exception in _launch_gpu_instance: Changed except Exception to except Exception as exc and added "error": str(exc) to the log payload. Future failures will show the actual AWS error code (e.g. InsufficientInstanceCapacity, VcpuLimitExceeded) in CloudWatch.

Fix 2 — Try stopped instances before launching new ones: Added _find_stopped_gpu_instance_by_tag() (mirrors _find_running_gpu_instance_by_tag but filters stopped state). In implementation(), when SSM is "none", the code now first scans for a stopped instance by tag and calls start_instances on it before falling back to _launch_gpu_instance. Also updates SSM and the in-process env var on success.

Immediate recovery (performed live): - Started i-0d8401817dec5d12f (g5.2xlarge, stopped) via aws ec2 start-instances - Updated SSM to i-0d8401817dec5d12f - Updated Lambda GPU_INSTANCE_ID env var directly via aws lambda update-function-configuration - Verified GPU worker healthy: {"status":"ready","model":"VLM2Vec-2B+Qwen2-VL-7B","device":"cuda:0"}

Known remaining risk — watchdog: The watchdog stops the GPU after 10 min with 0 IngestWindowFunction invocations. Chat usage does not count. If no new memories are ingested, the GPU will be stopped again and the Lambda env var will be stale. A full redeploy is needed for the code fix to apply (so the Lambda can self-recover via the new stopped-instance path without a manual env var update).

Audit Log

ID Action Note Context
1 Create audit log Initialize investigation User report: chat blocked, GPU not starting
2 Check SSM SSM = "none" (set 2026-05-13T20:06:30 PDT) aws ssm get-parameter
3 Check EC2 instances by tag 3 stopped instances: 2×g6e.2xlarge, 1×g5.2xlarge aws ec2 describe-instances
4 Check CloudWatch logs Confirmed chat_gpu_launch_failed with no error detail on every invocation CloudWatch filter "chat_gpu"
5 Probe root cause start-instances i-094304012bc773d5fInsufficientInstanceCapacity aws ec2 start-instances
6 Try g5.2xlarge start-instances i-0d8401817dec5d12f → success, now running aws ec2 start-instances
7 Update SSM Set /encache/gpu/instance_id = i-0d8401817dec5d12f aws ssm put-parameter
8 Update Lambda env var GPU_INSTANCE_ID = i-0d8401817dec5d12f via update-function-configuration aws lambda update-function-configuration
9 Verify GPU healthy {"status":"ready","model":"VLM2Vec-2B+Qwen2-VL-7B","device":"cuda:0"} SSM RunShellScript health check
10 Fix code: log exception Added str(exc) to chat_gpu_launch_failed log main/server/api/memories/chat/app.py
11 Fix code: stopped-instance fallback Added _find_stopped_gpu_instance_by_tag() + start path in implementation() main/server/api/memories/chat/app.py

Verification

  • [x] Reproduced failure before fix (via CloudWatch: chat_gpu_launch_failed on every invocation)
  • [ ] Reproduction test fails before fix (N/A — requires live AWS)
  • [x] Root cause identified with evidence (InsufficientInstanceCapacity confirmed by direct start-instances call)
  • [x] Fix applied at source (_launch_gpu_instance logs error; implementation() now tries stopped instances first)
  • [x] GPU worker healthy and chat unblocked (status=ready confirmed via SSM command)
  • [ ] Regression test added/updated (N/A — behavior requires live AWS EC2; existing watchdog tests in test_gpu_watchdog.py cover adjacent paths)
  • [x] Verified no duplicate solved-bug log exists for same root cause