PR 450: Stale SSM Fix Fails Due to Race Condition on IP Assignment
Date: 2026-05-09
Status: Debugging
Severity: High — GPU chat is completely unavailable despite deployed fix
Component: Chat Lambda GPU resolution, EC2 instance recovery, IP assignment
Related PR: #450 (commit f6c518db)
Related Commits: b8d71880 (exception path fix), c79441e4 (IP order fix)
Failure Signature
Users see: "Chat is not available right now — the GPU worker is starting up"
Despite: 1. PR 450 code deployed via SAM 2. Running EC2 instance with tag encache-gpu-worker exists 3. Lambda has correct IAM permissions (ec2:DescribeInstances, ssm:PutParameter) 4. Code includes tag-scan fallback
Root Cause Analysis
The Race Condition
When watchdog stops/terminates a GPU instance and a new one is launched:
- T=0s: Watchdog stops instance
i-old, SSM still holdsi-old - T=0.5s: New instance
i-newlaunched by cloud infrastructure - T=1s: Instance
i-newis in EC2, state=running, but VPC hasn't assigned IP yet - T=1.5s: Chat Lambda invoked, calls
_resolve_gpu_url(i-old) - T=1.6s: EC2 describe_instances finds
i-oldisstopped - T=1.7s: Code calls tag-scan, finds
i-new(state=running) - T=1.8s: Code tries to get IP:
ip = instance.get("PrivateIpAddress") or instance.get("PublicIpAddress") - T=1.9s:
ip = Nonebecause VPC still assigning it (typical 2-5 sec delay) - T=2.0s: Code checks
if ip:→ FALSE - T=2.1s: Falls through to line 191:
if state == "stopped": - T=2.2s: Tries to start the already-running
i-new? NO — it starts the OLDi-old! - T=2.3s: Returns
None→ user gets "GPU starting up" message
The Code Path (chat/app.py lines 166-199)
if state in ("stopped", "stopping"):
logger(...)
running = _find_running_gpu_instance_by_tag(region)
if running:
running_id = running["InstanceId"]
ip = running.get("PrivateIpAddress") or running.get("PublicIpAddress")
logger(...)
_update_ssm_gpu_instance_id(running_id, region)
if ip:
return f"http://{ip}:{port}" # <-- ONLY executes if ip is not None
# FALLS THROUGH HERE if ip is None!
if state == "stopped":
logger(...)
ec2.start_instances(InstanceIds=[instance_id]) # <-- Starts the OLD instance!
return None # <-- Always returns None if ip was None
Critical Issue: If tag-scan finds an instance but it has no IP yet, the function: 1. Updates SSM correctly (instance ID is saved) 2. BUT falls through and returns None anyway 3. Even worse, tries to start the OLD instance again 4. Never returns the new instance's URL, so health check never happens
Why This Wasn't Caught
- Assumption: Running instances always have an IP immediately
- This is false: VPC IP assignment has 1-5 second latency
- Not tested: Tag-scan with no-IP scenario
- Tests mock instances with IPs already present
- Silent failure: Returns None instead of raising/logging the IP issue
- SSM is updated (good), but URL is not returned (bad)
Verification
CloudWatch Log Evidence
If this bug is happening, logs will show:
step: "gpu_instance_state_issue"
step: "gpu_instance_adopted_by_tag" <-- Instance WAS found
additional: { old_instance_id: "i-old...", new_instance_id: "i-new..." }
step: "gpu_instance_starting_stopped" <-- Then we try to start the old one!
This sequence proves the bug: instance adopted, then code tries to start the old one anyway.
EC2 State Evidence
If bug is active: 1. Multiple instances with tag encache-gpu-worker will exist 2. SSM parameter keeps changing: /encache/gpu/instance_id bounces between IDs 3. Watchdog logs show "no invocations" repeatedly 4. Instances cycle: launched → stopped → launched → stopped
Impact
- User Impact: Chat completely unavailable (100% fallback message rate)
- Infrastructure Impact: Watchdog keeps stopping instances, new ones keep launching
- Cost Impact: High EC2 launch/stop churn
- Reliability: No automatic recovery even though tag-scan logic is present
Fix
Two approaches:
Option A: Wait for IP (Recommended)
Retry the IP lookup a few times with backoff before giving up:
if running:
running_id = running["InstanceId"]
ip = running.get("PrivateIpAddress") or running.get("PublicIpAddress")
# Retry IP lookup if not assigned yet (common on fresh launches)
retry_count = 0
while not ip and retry_count < 5:
time.sleep(0.5)
desc = ec2.describe_instances(InstanceIds=[running_id])
inst = desc["Reservations"][0]["Instances"][0]
ip = inst.get("PrivateIpAddress") or inst.get("PublicIpAddress")
retry_count += 1
logger({...})
_update_ssm_gpu_instance_id(running_id, region)
if ip:
return f"http://{ip}:{port}"
# If still no IP after retry, skip return and start old instance for next time
Option B: Return even without IP (Quick Fix)
If instance exists and is running, assume IP will be assigned soon:
if running:
running_id = running["InstanceId"]
ip = running.get("PrivateIpAddress") or running.get("PublicIpAddress")
logger({...})
_update_ssm_gpu_instance_id(running_id, region)
# Instance is running — return URL with whatever IP we have
if ip:
return f"http://{ip}:{port}"
else:
# No IP yet, but instance is running — return placeholder
# Let health check on next retry find it
logger({"step": "gpu_instance_ip_pending"})
return None # Still None, but SSM is updated for next time
Option C: Separate instance and IP checks
Split the logic to handle instance discovery and IP assignment separately.
System Architecture Impact
This bug breaks the entire premise of the PR 450 fix: - Goal: Auto-recover when SSM is stale by finding running instance by tag - Implementation: Find instance, update SSM, return URL - Reality: Find instance, update SSM, but fail to return URL if IP is pending
The second Lambda invocation will succeed (SSM now points to correct instance), but the first request fails.
Affected Components
| Component | Impact |
|---|---|
chat/app.py::_resolve_gpu_url() | Doesn't return URL for found instance without IP |
ingest_window.py::_resolve_gpu_url() | Same bug present |
ingest_window.py::_gpu_available() | Returns False even though instance found (only skips return) |
Tests That Would Catch This
- Test instance found but no IP: Mock instance with no PrivateIpAddress/PublicIpAddress
- Test lambda invocation sequence: First call finds instance (no IP) → should retry or wait
- Test SSM update happens even on IP failure: Verify parameter is written
- Test second invocation uses updated SSM: After first call updates SSM, next call should succeed
Next Steps
- Examine CloudWatch logs for the sequence:
gpu_instance_state_issue→gpu_instance_adopted_by_tag→gpu_instance_starting_stopped - Check EC2 API calls for repeated start_instances on the same old instance ID
- Verify SSM parameter
/encache/gpu/instance_idis being updated (proves code reaches line 187) - Implement Option A (wait for IP) or Option C (separate checks)
- Add test coverage for no-IP scenario
- Redeploy and verify chat resolves GPU on first try
Code References
main/server/api/memories/chat/app.pylines 166-199 (stopped state)main/server/api/memories/chat/app.pylines 201-236 (terminated state)main/server/worldmm/pipeline/ingest_window.pylines 102-132 (same bug)- Tests:
main/server/tests/unit/test_memories_chat_app.pyneeds new test case