GPU Chat Flow: Stale SSM Instance ID

Date: 2026-05-09
Status: Debugging in progress
Severity: High — Chat is completely unavailable to users
Component: GPU chat flow, SSM parameter storage, EC2 instance lifecycle

Failure Signature

Users see: "Chat is not available right now — the GPU worker is starting up" message repeatedly, even when an EC2 instance IS running.

Observed Pattern (from Discord monitoring)

Instance IDs keep cycling across watchdog cycles: - i-0181e265a91efe35d → stopped (invocations=0, age=10min) - i-01418d46682fdd805 → launched → running → stopped (invocations=0, age=21min) - i-083a2cfcb05c7fd7e → launched → running → stopped (invocations=0, age=15min) - i-09dd04835da144612 → launched → running

Watchdog Discord messages show: "GPU watchdog stopped — no upstream demand — invocations=0" for EVERY instance.

This means: 1. Instances are launching and running 2. Chat requests NEVER reach the GPU (invocations stay at 0) 3. After 10+ minutes idle, watchdog stops them 4. A new instance is launched 5. Cycle repeats

Root Cause Analysis

The SSM Instance ID is Stale

Flow breakdown:

Watchdog stops instance I1 (say i-0181...)
Logs to Discord, deletes no SSM parameter
SSM still holds /encache/gpu/instance_id = i-0181... (STALE)
Chat Lambda calls _resolve_gpu_url() with instance ID from SSM
Reads SSM: instance_id = i-0181... (the stopped one)
EC2 describe_instances finds it in "stopped" state
Calls start_instances([i-0181...])
Returns None because instance is not yet running
Lambda responds to user with unavailability message
Chat Lambda calls _launch_gpu_instance() if instance_id is empty
But instance_id is NOT empty (it's i-0181... from SSM)
So _launch_gpu_instance() is never called
Chat Lambda never even checks if a running instance exists
Meanwhile in ingest_window.py
Also reads the stale SSM instance ID
Tries describe_instances([i-0181...]) and gets "stopped"
Cannot encode video, request fails
Ingest Lambda never invokes GPU, so CloudWatch Invocations stay at 0
Watchdog runs again
Sees zero invocations in the last 10 minutes
Stops the running instance (because SSM is still pointing to the old one)
Cycle repeats

Why the Instance Never Gets Chat Requests

The chat flow is: 1. Read GPU_INSTANCE_ID from SSM (stale) 2. Call _resolve_gpu_url(instance_id) → EC2 lookup 3. If instance is "stopped" or "terminated", try to start/relaunch it 4. Return None if not ready 5. Check if gpu_worker_url is empty → return unavailable response

The code does NOT: - Clear SSM when an instance is stopped - Fall back to scanning EC2 by tag when SSM holds a stopped/terminated instance - Attempt recovery before giving up

Where SSM Gets Written (Currently)

chat/app.py _launch_gpu_instance() — writes new instance ID to SSM only if launched from scratch
visual/encoder.py VLM2VecClient._persist_instance_id_to_ssm() — writes ID after adoption or new creation
Watchdog: does NOT touch SSM (this is the bug)

Where SSM Gets Read (Currently)

chat/app.py implementation() — reads from env var GPU_INSTANCE_ID (comes from SSM via env injection)
ingest_window.py _read_gpu_instance_id() — reads directly from SSM parameter
Both assume the ID in SSM is accurate

Initial Observations

Code Evidence

From chat/app.py line 244:

gpu_instance_id = normalize_gpu_instance_id(os.environ.get("GPU_INSTANCE_ID"))

The instance ID is injected from the environment (SSM → Lambda env vars at deploy time or via SAM/CDK).

From chat/app.py lines 264-266:

else:
    gpu_worker_url = _resolve_gpu_url(
        gpu_instance_id, gpu_port, gpu_launch_template_id, region
    )

If instance_id is not empty, it calls _resolve_gpu_url() with that ID.

From chat/app.py lines 101-109:

if state in ("stopped", "stopping"):
    logger.info(
        "GPU instance %s is %s, starting it for next request",
        instance_id,
        state,
    )
    if state == "stopped":
        ec2.start_instances(InstanceIds=[instance_id])
    return None

When the instance is stopped, it tries to start it but returns None (unavailable).

From watchdog.py line 92:

ec2.stop_instances(InstanceIds=[instance_id])

Watchdog stops instances but never clears SSM parameter.

From visual/encoder.py lines 93-104:

existing = self._find_existing_by_tag(ec2)
if existing:
    self._instance_id = existing["InstanceId"]
    ...
    self._persist_instance_id_to_ssm()
    logger.info("Adopted existing GPU instance %s", self._instance_id)
    return

The VLM2VecClient can adopt an existing instance by tag and update SSM, but the chat Lambda does NOT use this fallback.

Next Steps (Debugging Phase)

Verify that SSM parameter /encache/gpu/instance_id is indeed holding a stale/stopped instance ID
Verify that chat requests are being rejected before reaching the GPU (check logs for "starting it for next request" vs. actual invocations)
Implement the fix: When _resolve_gpu_url() finds a stopped/terminated instance, update SSM to the "none" sentinel and scan EC2 by tag for a running instance before giving up

Hypothesis Summary

H1 (Primary): SSM parameter is stale because watchdog stops instances without clearing the parameter. Chat Lambda tries to start the old instance, gives up when it's not ready, and never discovers the running replacement.
H2 (Secondary): Ingest Lambda also reads stale SSM, so GPU never receives requests, invocations stay at 0, watchdog kills the running instance as "no demand".
Fix Strategy: When SSM instance is stopped/terminated, clear SSM to sentinel and scan EC2 by tag as fallback before returning unavailable.