GPU Instance Type Fallback
Plan Metadata
- Plan type:
plan - Parent plan: N/A
- Depends on: N/A
- Status:
documentation
Status semantics: - draft: Plan is being created or updated and is not final. - approved: Plan is approved but not yet applied in code. - documentation: Code currently exists and matches the plan contract.
Update rule: - When an existing plan is edited, set status to draft until re-approved.
System Intent
- What is being built: A fallback system so GPU chat continues working when the primary instance type (g6e.2xlarge) has no AWS regional capacity. When launching or restarting a GPU instance fails with
InsufficientInstanceCapacity, the system retries with alternative instance types (g5.2xlarge, g4dn.2xlarge) in priority order. The watchdog is also updated to count chat Lambda invocations as demand, so the GPU is not stopped while users are actively chatting. - Primary consumer(s): Chat Lambda (
main/server/api/memories/chat/app.py), GPU watchdog (main/devops/lambdas/discord_alerts/gpu_watchdog.py) - Boundary (black-box scope only): GPU instance lifecycle — launch, restart, and idle-stop decisions. Does not change chat response format, auth, or memory retrieval.
Stage Gate Tracker
- [x] Stage 1 Mermaid approved
- [x] Stage 2 I/O contracts approved
- [x] Stage 3 pseudocode/technical details approved or skipped
1. Mermaid Diagram
Reference: .agent/skills/create-mermaid-diagram/SKILL.md
flowchart TD
ChatLambda["Chat Lambda\nchat/app.py"] -->|"GPU_INSTANCE_ID=none"| Resolve["_resolve_gpu_with_fallback()"]
Resolve --> ScanRunning["Scan EC2 by tag\nstate=running"]
ScanRunning -->|"found"| AdoptRunning["Adopt running instance\nupdate SSM + env var\nreturn URL"]
ScanRunning -->|"none"| ScanStopped["Scan EC2 by tag\nstate=stopped"]
ScanStopped -->|"found"| TryStart["start_instances(stopped_id)"]
TryStart -->|"success"| UpdateSSM1["Update SSM + env var\nreturn instance_id (cold)"]
TryStart -->|"InsufficientCapacity"| LaunchNew
ScanStopped -->|"none"| LaunchNew["_launch_gpu_with_type_fallback()\ntry types in order"]
LaunchNew -->|"g6e.2xlarge"| Try1["run_instances"]
Try1 -->|"success"| UpdateSSM2["Update SSM + env var\nreturn instance_id (cold)"]
Try1 -->|"InsufficientCapacity"| Try2["g5.2xlarge\nrun_instances"]
Try2 -->|"success"| UpdateSSM2
Try2 -->|"InsufficientCapacity"| Try3["g4dn.2xlarge\nrun_instances"]
Try3 -->|"success"| UpdateSSM2
Try3 -->|"InsufficientCapacity"| AllFailed["Log all_types_exhausted\nreturn None"]
Watchdog["GPU Watchdog Lambda\ngpu_watchdog.py"] --> CheckAge["age > MAX_AGE_MIN?"]
CheckAge -->|"yes"| Stop["stop_instances\npost Discord"]
CheckAge -->|"no, age > IDLE_AFTER_MIN"| CheckDemand["Check demand:\nIngestWindowFunction invocations\nOR MemoriesChatFunction invocations\nin last INVOCATION_WINDOW_MIN"]
CheckDemand -->|"both = 0"| Stop
CheckDemand -->|"either > 0"| Keep["Keep running"]
CheckAge -->|"no, age < IDLE_AFTER_MIN"| Keep 2. Black-Box Inputs and Outputs
Keep this short. Define types in JSON-style blocks and capture each flow with path-level rows.
Global Types
InstanceId {
value: string (EC2 instance ID, e.g. "i-0abc123")
}
GpuResolutionResult {
instance_id: InstanceId | None
worker_url: string | None (e.g. "http://10.0.0.5:8000", set only if instance already running)
}
Flow: gpuResolveWithFallback
- Test files:
main/server/tests/unit/test_memories_chat_app.py - Core files:
main/server/api/memories/chat/app.py
Type Definitions
GpuResolveInput {
region: string (AWS region, default "us-east-1")
gpu_port: string (default "8000")
launch_template_id: string (EC2 launch template ID)
fallback_instance_types: list[string] (ordered, e.g. ["g6e.2xlarge", "g5.2xlarge", "g4dn.2xlarge"])
}
GpuResolveOutput {
instance_id: InstanceId | None
worker_url: string | None
}
Paths
| path-name | input | output/expected state change | path-type | notes | updated |
|---|---|---|---|---|---|
gpuResolve.running-adopted | GpuResolveInput | GpuResolveOutput; instance_id + worker_url set; SSM updated | happy path | Running instance found by tag; no launch needed | Y |
gpuResolve.stopped-restarted | GpuResolveInput | GpuResolveOutput; instance_id set, worker_url=None (cold); SSM updated | happy path | Stopped instance started; URL not yet available (cold start) | Y |
gpuResolve.launched-fallback | GpuResolveInput | GpuResolveOutput; instance_id set, worker_url=None (cold); SSM updated | happy path | Primary type unavailable; launched on first available fallback type | Y |
gpuResolve.all-types-exhausted | GpuResolveInput | GpuResolveOutput; instance_id=None, worker_url=None | error | All instance types returned InsufficientInstanceCapacity; logged | Y |
gpuResolve.no-template | GpuResolveInput (launch_template_id="") | GpuResolveOutput; instance_id=None, worker_url=None | error | No launch template configured; skip launch attempt | Y |
Flow: watchdogDemandCheck
- Test files:
main/devops/tests/test_gpu_watchdog.py - Core files:
main/devops/lambdas/discord_alerts/gpu_watchdog.py
Type Definitions
WatchdogDemandInput {
instance_id: InstanceId
launch_time: timestamp
ingest_fn_name: string | None
chat_fn_name: string | None
}
WatchdogDemandOutput {
stopped: bool
reason: string | None
}
Paths
| path-name | input | output/expected state change | path-type | notes | updated |
|---|---|---|---|---|---|
watchdog.age-cap | instance age > MAX_AGE_MIN | stopped=True; instance stopped; Discord posted | happy path | Hard cap regardless of demand | |
watchdog.ingest-demand | ingest invocations > 0 in window | stopped=False | happy path | Ingest activity keeps GPU alive (existing) | |
watchdog.chat-demand | chat invocations > 0, ingest = 0 | stopped=False | happy path | Chat-only activity now keeps GPU alive (new) | Y |
watchdog.no-demand | both ingest AND chat invocations = 0 in window, age > IDLE_AFTER_MIN | stopped=True; instance stopped; Discord posted | happy path | GPU idle — stop it | Y |
watchdog.too-young | age < IDLE_AFTER_MIN | stopped=False | subpath | Grace period for cold start | |
watchdog.fn-unresolved | ingest_fn=None AND chat_fn=None | stopped=False; warning logged | error | Can't check demand — skip stop to be safe | Y |
3. Pseudocode / Technical Details for Critical Flows (Optional)
gpuResolve.launched-fallback — instance type iteration
FALLBACK_TYPES = ["g6e.2xlarge", "g5.2xlarge", "g4dn.2xlarge"]
# Read from GPU_FALLBACK_INSTANCE_TYPES env var (comma-separated) if set,
# otherwise use the constant above.
def _launch_gpu_with_type_fallback(launch_template_id, region):
for instance_type in fallback_types:
try:
resp = ec2.run_instances(
LaunchTemplate={"LaunchTemplateId": launch_template_id},
InstanceType=instance_type, # override template type
MinCount=1, MaxCount=1,
TagSpecifications=[...],
)
instance_id = resp["Instances"][0]["InstanceId"]
_update_ssm_gpu_instance_id(instance_id, region)
log(step="chat_gpu_instance_launched", instance_type=instance_type)
return instance_id
except ClientError as exc:
if exc.response["Error"]["Code"] == "InsufficientInstanceCapacity":
log(step="chat_gpu_capacity_unavailable", instance_type=instance_type)
continue # try next type
raise # unexpected error — do not silently swallow
log(step="chat_gpu_all_types_exhausted", types=fallback_types)
return None
watchdog.chat-demand — add chat fn lookup alongside ingest
CHAT_FN_PREFIX = "server-MemoriesChatFunction-"
def _resolve_chat_function():
# same pattern as _resolve_ingest_function()
for fn in list_all_lambdas():
if fn.startswith(CHAT_FN_PREFIX):
return fn
return None
# In lambda_handler demand check:
ingest_inv = _ingest_invocations_sum(ingest_fn, INVOCATION_WINDOW_MIN) if ingest_fn else 0
chat_inv = _ingest_invocations_sum(chat_fn, INVOCATION_WINDOW_MIN) if chat_fn else 0
if ingest_inv == 0 and chat_inv == 0:
stop(instance_id, reason="no demand — ingest=0, chat=0")
SSM + env var update on stopped-instance restart
When start_instances succeeds, immediately: 1. _update_ssm_gpu_instance_id(stopped_id, region) — durable across Lambda cold starts 2. os.environ["GPU_INSTANCE_ID"] = stopped_id — warm-container fast path
worker_url is NOT set (instance is cold starting); the existing wait_for_gpu_health backoff handles this.
Env var override for fallback types
Set in SAM template so ops can adjust without a code deploy.
4. Handoff to Related Plan Reconciliation
After all stages are approved, apply .agent/skills/reconcile-plans/SKILL.md to propagate contract updates across linked plans.