GPU Chat Flow: Stale SSM Instance ID
Date: 2026-05-09
Status: Debugging in progress
Severity: High — Chat is completely unavailable to users
Component: GPU chat flow, SSM parameter storage, EC2 instance lifecycle
Failure Signature
Users see: "Chat is not available right now — the GPU worker is starting up" message repeatedly, even when an EC2 instance IS running.
Observed Pattern (from Discord monitoring)
Instance IDs keep cycling across watchdog cycles: - i-0181e265a91efe35d → stopped (invocations=0, age=10min) - i-01418d46682fdd805 → launched → running → stopped (invocations=0, age=21min) - i-083a2cfcb05c7fd7e → launched → running → stopped (invocations=0, age=15min) - i-09dd04835da144612 → launched → running
Watchdog Discord messages show: "GPU watchdog stopped — no upstream demand — invocations=0" for EVERY instance.
This means: 1. Instances are launching and running 2. Chat requests NEVER reach the GPU (invocations stay at 0) 3. After 10+ minutes idle, watchdog stops them 4. A new instance is launched 5. Cycle repeats
Root Cause Analysis
The SSM Instance ID is Stale
Flow breakdown:
- Watchdog stops instance I1 (say
i-0181...) - Logs to Discord, deletes no SSM parameter
-
SSM still holds
/encache/gpu/instance_id = i-0181...(STALE) -
Chat Lambda calls
_resolve_gpu_url()with instance ID from SSM - Reads SSM:
instance_id = i-0181...(the stopped one) - EC2 describe_instances finds it in "stopped" state
- Calls
start_instances([i-0181...]) - Returns
Nonebecause instance is not yet running -
Lambda responds to user with unavailability message
-
Chat Lambda calls
_launch_gpu_instance()if instance_id is empty - But instance_id is NOT empty (it's
i-0181...from SSM) - So
_launch_gpu_instance()is never called -
Chat Lambda never even checks if a running instance exists
-
Meanwhile in ingest_window.py
- Also reads the stale SSM instance ID
- Tries
describe_instances([i-0181...])and gets "stopped" - Cannot encode video, request fails
-
Ingest Lambda never invokes GPU, so CloudWatch Invocations stay at 0
-
Watchdog runs again
- Sees zero invocations in the last 10 minutes
- Stops the running instance (because SSM is still pointing to the old one)
- Cycle repeats
Why the Instance Never Gets Chat Requests
The chat flow is: 1. Read GPU_INSTANCE_ID from SSM (stale) 2. Call _resolve_gpu_url(instance_id) → EC2 lookup 3. If instance is "stopped" or "terminated", try to start/relaunch it 4. Return None if not ready 5. Check if gpu_worker_url is empty → return unavailable response
The code does NOT: - Clear SSM when an instance is stopped - Fall back to scanning EC2 by tag when SSM holds a stopped/terminated instance - Attempt recovery before giving up
Where SSM Gets Written (Currently)
chat/app.py_launch_gpu_instance()— writes new instance ID to SSM only if launched from scratchvisual/encoder.pyVLM2VecClient._persist_instance_id_to_ssm()— writes ID after adoption or new creation- Watchdog: does NOT touch SSM (this is the bug)
Where SSM Gets Read (Currently)
chat/app.pyimplementation()— reads from env varGPU_INSTANCE_ID(comes from SSM via env injection)ingest_window.py_read_gpu_instance_id()— reads directly from SSM parameter- Both assume the ID in SSM is accurate
Initial Observations
Code Evidence
From chat/app.py line 244:
From chat/app.py lines 264-266:
else:
gpu_worker_url = _resolve_gpu_url(
gpu_instance_id, gpu_port, gpu_launch_template_id, region
)
_resolve_gpu_url() with that ID. From chat/app.py lines 101-109:
if state in ("stopped", "stopping"):
logger.info(
"GPU instance %s is %s, starting it for next request",
instance_id,
state,
)
if state == "stopped":
ec2.start_instances(InstanceIds=[instance_id])
return None
None (unavailable). From watchdog.py line 92:
From visual/encoder.py lines 93-104:
existing = self._find_existing_by_tag(ec2)
if existing:
self._instance_id = existing["InstanceId"]
...
self._persist_instance_id_to_ssm()
logger.info("Adopted existing GPU instance %s", self._instance_id)
return
Next Steps (Debugging Phase)
- Verify that SSM parameter
/encache/gpu/instance_idis indeed holding a stale/stopped instance ID - Verify that chat requests are being rejected before reaching the GPU (check logs for "starting it for next request" vs. actual invocations)
- Implement the fix: When
_resolve_gpu_url()finds a stopped/terminated instance, update SSM to the "none" sentinel and scan EC2 by tag for a running instance before giving up
Hypothesis Summary
- H1 (Primary): SSM parameter is stale because watchdog stops instances without clearing the parameter. Chat Lambda tries to start the old instance, gives up when it's not ready, and never discovers the running replacement.
- H2 (Secondary): Ingest Lambda also reads stale SSM, so GPU never receives requests, invocations stay at 0, watchdog kills the running instance as "no demand".
- Fix Strategy: When SSM instance is stopped/terminated, clear SSM to sentinel and scan EC2 by tag as fallback before returning unavailable.