Skip to content

PR 450: Stale SSM Fix Fails Due to Race Condition on IP Assignment

Date: 2026-05-09
Status: Debugging
Severity: High — GPU chat is completely unavailable despite deployed fix
Component: Chat Lambda GPU resolution, EC2 instance recovery, IP assignment
Related PR: #450 (commit f6c518db)
Related Commits: b8d71880 (exception path fix), c79441e4 (IP order fix)

Failure Signature

Users see: "Chat is not available right now — the GPU worker is starting up"

Despite: 1. PR 450 code deployed via SAM 2. Running EC2 instance with tag encache-gpu-worker exists 3. Lambda has correct IAM permissions (ec2:DescribeInstances, ssm:PutParameter) 4. Code includes tag-scan fallback

Root Cause Analysis

The Race Condition

When watchdog stops/terminates a GPU instance and a new one is launched:

  1. T=0s: Watchdog stops instance i-old, SSM still holds i-old
  2. T=0.5s: New instance i-new launched by cloud infrastructure
  3. T=1s: Instance i-new is in EC2, state=running, but VPC hasn't assigned IP yet
  4. T=1.5s: Chat Lambda invoked, calls _resolve_gpu_url(i-old)
  5. T=1.6s: EC2 describe_instances finds i-old is stopped
  6. T=1.7s: Code calls tag-scan, finds i-new (state=running)
  7. T=1.8s: Code tries to get IP: ip = instance.get("PrivateIpAddress") or instance.get("PublicIpAddress")
  8. T=1.9s: ip = None because VPC still assigning it (typical 2-5 sec delay)
  9. T=2.0s: Code checks if ip:FALSE
  10. T=2.1s: Falls through to line 191: if state == "stopped":
  11. T=2.2s: Tries to start the already-running i-new? NO — it starts the OLD i-old!
  12. T=2.3s: Returns None → user gets "GPU starting up" message

The Code Path (chat/app.py lines 166-199)

if state in ("stopped", "stopping"):
    logger(...)
    running = _find_running_gpu_instance_by_tag(region)
    if running:
        running_id = running["InstanceId"]
        ip = running.get("PrivateIpAddress") or running.get("PublicIpAddress")
        logger(...)
        _update_ssm_gpu_instance_id(running_id, region)
        if ip:
            return f"http://{ip}:{port}"  # <-- ONLY executes if ip is not None
    # FALLS THROUGH HERE if ip is None!
    if state == "stopped":
        logger(...)
        ec2.start_instances(InstanceIds=[instance_id])  # <-- Starts the OLD instance!
    return None  # <-- Always returns None if ip was None

Critical Issue: If tag-scan finds an instance but it has no IP yet, the function: 1. Updates SSM correctly (instance ID is saved) 2. BUT falls through and returns None anyway 3. Even worse, tries to start the OLD instance again 4. Never returns the new instance's URL, so health check never happens

Why This Wasn't Caught

  1. Assumption: Running instances always have an IP immediately
  2. This is false: VPC IP assignment has 1-5 second latency
  3. Not tested: Tag-scan with no-IP scenario
  4. Tests mock instances with IPs already present
  5. Silent failure: Returns None instead of raising/logging the IP issue
  6. SSM is updated (good), but URL is not returned (bad)

Verification

CloudWatch Log Evidence

If this bug is happening, logs will show:

step: "gpu_instance_state_issue"
step: "gpu_instance_adopted_by_tag"  <-- Instance WAS found
additional: { old_instance_id: "i-old...", new_instance_id: "i-new..." }
step: "gpu_instance_starting_stopped"  <-- Then we try to start the old one!

This sequence proves the bug: instance adopted, then code tries to start the old one anyway.

EC2 State Evidence

If bug is active: 1. Multiple instances with tag encache-gpu-worker will exist 2. SSM parameter keeps changing: /encache/gpu/instance_id bounces between IDs 3. Watchdog logs show "no invocations" repeatedly 4. Instances cycle: launched → stopped → launched → stopped

Impact

  1. User Impact: Chat completely unavailable (100% fallback message rate)
  2. Infrastructure Impact: Watchdog keeps stopping instances, new ones keep launching
  3. Cost Impact: High EC2 launch/stop churn
  4. Reliability: No automatic recovery even though tag-scan logic is present

Fix

Two approaches:

Retry the IP lookup a few times with backoff before giving up:

if running:
    running_id = running["InstanceId"]
    ip = running.get("PrivateIpAddress") or running.get("PublicIpAddress")

    # Retry IP lookup if not assigned yet (common on fresh launches)
    retry_count = 0
    while not ip and retry_count < 5:
        time.sleep(0.5)
        desc = ec2.describe_instances(InstanceIds=[running_id])
        inst = desc["Reservations"][0]["Instances"][0]
        ip = inst.get("PrivateIpAddress") or inst.get("PublicIpAddress")
        retry_count += 1

    logger({...})
    _update_ssm_gpu_instance_id(running_id, region)
    if ip:
        return f"http://{ip}:{port}"
    # If still no IP after retry, skip return and start old instance for next time

Option B: Return even without IP (Quick Fix)

If instance exists and is running, assume IP will be assigned soon:

if running:
    running_id = running["InstanceId"]
    ip = running.get("PrivateIpAddress") or running.get("PublicIpAddress")
    logger({...})
    _update_ssm_gpu_instance_id(running_id, region)

    # Instance is running — return URL with whatever IP we have
    if ip:
        return f"http://{ip}:{port}"
    else:
        # No IP yet, but instance is running — return placeholder
        # Let health check on next retry find it
        logger({"step": "gpu_instance_ip_pending"})
        return None  # Still None, but SSM is updated for next time

Option C: Separate instance and IP checks

Split the logic to handle instance discovery and IP assignment separately.

System Architecture Impact

This bug breaks the entire premise of the PR 450 fix: - Goal: Auto-recover when SSM is stale by finding running instance by tag - Implementation: Find instance, update SSM, return URL - Reality: Find instance, update SSM, but fail to return URL if IP is pending

The second Lambda invocation will succeed (SSM now points to correct instance), but the first request fails.

Affected Components

Component Impact
chat/app.py::_resolve_gpu_url() Doesn't return URL for found instance without IP
ingest_window.py::_resolve_gpu_url() Same bug present
ingest_window.py::_gpu_available() Returns False even though instance found (only skips return)

Tests That Would Catch This

  1. Test instance found but no IP: Mock instance with no PrivateIpAddress/PublicIpAddress
  2. Test lambda invocation sequence: First call finds instance (no IP) → should retry or wait
  3. Test SSM update happens even on IP failure: Verify parameter is written
  4. Test second invocation uses updated SSM: After first call updates SSM, next call should succeed

Next Steps

  1. Examine CloudWatch logs for the sequence: gpu_instance_state_issuegpu_instance_adopted_by_taggpu_instance_starting_stopped
  2. Check EC2 API calls for repeated start_instances on the same old instance ID
  3. Verify SSM parameter /encache/gpu/instance_id is being updated (proves code reaches line 187)
  4. Implement Option A (wait for IP) or Option C (separate checks)
  5. Add test coverage for no-IP scenario
  6. Redeploy and verify chat resolves GPU on first try

Code References

  • main/server/api/memories/chat/app.py lines 166-199 (stopped state)
  • main/server/api/memories/chat/app.py lines 201-236 (terminated state)
  • main/server/worldmm/pipeline/ingest_window.py lines 102-132 (same bug)
  • Tests: main/server/tests/unit/test_memories_chat_app.py needs new test case