Skip to content

GPU Instance Type Fallback

Plan Metadata

  • Plan type: plan
  • Parent plan: N/A
  • Depends on: N/A
  • Status: documentation

Status semantics: - draft: Plan is being created or updated and is not final. - approved: Plan is approved but not yet applied in code. - documentation: Code currently exists and matches the plan contract.

Update rule: - When an existing plan is edited, set status to draft until re-approved.

System Intent

  • What is being built: A fallback system so GPU chat continues working when the primary instance type (g6e.2xlarge) has no AWS regional capacity. When launching or restarting a GPU instance fails with InsufficientInstanceCapacity, the system retries with alternative instance types (g5.2xlarge, g4dn.2xlarge) in priority order. The watchdog is also updated to count chat Lambda invocations as demand, so the GPU is not stopped while users are actively chatting.
  • Primary consumer(s): Chat Lambda (main/server/api/memories/chat/app.py), GPU watchdog (main/devops/lambdas/discord_alerts/gpu_watchdog.py)
  • Boundary (black-box scope only): GPU instance lifecycle — launch, restart, and idle-stop decisions. Does not change chat response format, auth, or memory retrieval.

Stage Gate Tracker

  • [x] Stage 1 Mermaid approved
  • [x] Stage 2 I/O contracts approved
  • [x] Stage 3 pseudocode/technical details approved or skipped

1. Mermaid Diagram

Reference: .agent/skills/create-mermaid-diagram/SKILL.md

flowchart TD
    ChatLambda["Chat Lambda\nchat/app.py"] -->|"GPU_INSTANCE_ID=none"| Resolve["_resolve_gpu_with_fallback()"]

    Resolve --> ScanRunning["Scan EC2 by tag\nstate=running"]
    ScanRunning -->|"found"| AdoptRunning["Adopt running instance\nupdate SSM + env var\nreturn URL"]
    ScanRunning -->|"none"| ScanStopped["Scan EC2 by tag\nstate=stopped"]

    ScanStopped -->|"found"| TryStart["start_instances(stopped_id)"]
    TryStart -->|"success"| UpdateSSM1["Update SSM + env var\nreturn instance_id (cold)"]
    TryStart -->|"InsufficientCapacity"| LaunchNew

    ScanStopped -->|"none"| LaunchNew["_launch_gpu_with_type_fallback()\ntry types in order"]

    LaunchNew -->|"g6e.2xlarge"| Try1["run_instances"]
    Try1 -->|"success"| UpdateSSM2["Update SSM + env var\nreturn instance_id (cold)"]
    Try1 -->|"InsufficientCapacity"| Try2["g5.2xlarge\nrun_instances"]
    Try2 -->|"success"| UpdateSSM2
    Try2 -->|"InsufficientCapacity"| Try3["g4dn.2xlarge\nrun_instances"]
    Try3 -->|"success"| UpdateSSM2
    Try3 -->|"InsufficientCapacity"| AllFailed["Log all_types_exhausted\nreturn None"]

    Watchdog["GPU Watchdog Lambda\ngpu_watchdog.py"] --> CheckAge["age > MAX_AGE_MIN?"]
    CheckAge -->|"yes"| Stop["stop_instances\npost Discord"]
    CheckAge -->|"no, age > IDLE_AFTER_MIN"| CheckDemand["Check demand:\nIngestWindowFunction invocations\nOR MemoriesChatFunction invocations\nin last INVOCATION_WINDOW_MIN"]
    CheckDemand -->|"both = 0"| Stop
    CheckDemand -->|"either > 0"| Keep["Keep running"]
    CheckAge -->|"no, age < IDLE_AFTER_MIN"| Keep

2. Black-Box Inputs and Outputs

Keep this short. Define types in JSON-style blocks and capture each flow with path-level rows.

Global Types

InstanceId {
  value: string (EC2 instance ID, e.g. "i-0abc123")
}

GpuResolutionResult {
  instance_id: InstanceId | None
  worker_url: string | None  (e.g. "http://10.0.0.5:8000", set only if instance already running)
}

Flow: gpuResolveWithFallback

  • Test files: main/server/tests/unit/test_memories_chat_app.py
  • Core files: main/server/api/memories/chat/app.py

Type Definitions

GpuResolveInput {
  region: string (AWS region, default "us-east-1")
  gpu_port: string (default "8000")
  launch_template_id: string (EC2 launch template ID)
  fallback_instance_types: list[string] (ordered, e.g. ["g6e.2xlarge", "g5.2xlarge", "g4dn.2xlarge"])
}

GpuResolveOutput {
  instance_id: InstanceId | None
  worker_url: string | None
}

Paths

path-name input output/expected state change path-type notes updated
gpuResolve.running-adopted GpuResolveInput GpuResolveOutput; instance_id + worker_url set; SSM updated happy path Running instance found by tag; no launch needed Y
gpuResolve.stopped-restarted GpuResolveInput GpuResolveOutput; instance_id set, worker_url=None (cold); SSM updated happy path Stopped instance started; URL not yet available (cold start) Y
gpuResolve.launched-fallback GpuResolveInput GpuResolveOutput; instance_id set, worker_url=None (cold); SSM updated happy path Primary type unavailable; launched on first available fallback type Y
gpuResolve.all-types-exhausted GpuResolveInput GpuResolveOutput; instance_id=None, worker_url=None error All instance types returned InsufficientInstanceCapacity; logged Y
gpuResolve.no-template GpuResolveInput (launch_template_id="") GpuResolveOutput; instance_id=None, worker_url=None error No launch template configured; skip launch attempt Y

Flow: watchdogDemandCheck

  • Test files: main/devops/tests/test_gpu_watchdog.py
  • Core files: main/devops/lambdas/discord_alerts/gpu_watchdog.py

Type Definitions

WatchdogDemandInput {
  instance_id: InstanceId
  launch_time: timestamp
  ingest_fn_name: string | None
  chat_fn_name: string | None
}

WatchdogDemandOutput {
  stopped: bool
  reason: string | None
}

Paths

path-name input output/expected state change path-type notes updated
watchdog.age-cap instance age > MAX_AGE_MIN stopped=True; instance stopped; Discord posted happy path Hard cap regardless of demand
watchdog.ingest-demand ingest invocations > 0 in window stopped=False happy path Ingest activity keeps GPU alive (existing)
watchdog.chat-demand chat invocations > 0, ingest = 0 stopped=False happy path Chat-only activity now keeps GPU alive (new) Y
watchdog.no-demand both ingest AND chat invocations = 0 in window, age > IDLE_AFTER_MIN stopped=True; instance stopped; Discord posted happy path GPU idle — stop it Y
watchdog.too-young age < IDLE_AFTER_MIN stopped=False subpath Grace period for cold start
watchdog.fn-unresolved ingest_fn=None AND chat_fn=None stopped=False; warning logged error Can't check demand — skip stop to be safe Y

3. Pseudocode / Technical Details for Critical Flows (Optional)

gpuResolve.launched-fallback — instance type iteration

FALLBACK_TYPES = ["g6e.2xlarge", "g5.2xlarge", "g4dn.2xlarge"]
# Read from GPU_FALLBACK_INSTANCE_TYPES env var (comma-separated) if set,
# otherwise use the constant above.

def _launch_gpu_with_type_fallback(launch_template_id, region):
    for instance_type in fallback_types:
        try:
            resp = ec2.run_instances(
                LaunchTemplate={"LaunchTemplateId": launch_template_id},
                InstanceType=instance_type,  # override template type
                MinCount=1, MaxCount=1,
                TagSpecifications=[...],
            )
            instance_id = resp["Instances"][0]["InstanceId"]
            _update_ssm_gpu_instance_id(instance_id, region)
            log(step="chat_gpu_instance_launched", instance_type=instance_type)
            return instance_id
        except ClientError as exc:
            if exc.response["Error"]["Code"] == "InsufficientInstanceCapacity":
                log(step="chat_gpu_capacity_unavailable", instance_type=instance_type)
                continue  # try next type
            raise  # unexpected error — do not silently swallow
    log(step="chat_gpu_all_types_exhausted", types=fallback_types)
    return None

watchdog.chat-demand — add chat fn lookup alongside ingest

CHAT_FN_PREFIX = "server-MemoriesChatFunction-"

def _resolve_chat_function():
    # same pattern as _resolve_ingest_function()
    for fn in list_all_lambdas():
        if fn.startswith(CHAT_FN_PREFIX):
            return fn
    return None

# In lambda_handler demand check:
ingest_inv = _ingest_invocations_sum(ingest_fn, INVOCATION_WINDOW_MIN) if ingest_fn else 0
chat_inv   = _ingest_invocations_sum(chat_fn,   INVOCATION_WINDOW_MIN) if chat_fn else 0
if ingest_inv == 0 and chat_inv == 0:
    stop(instance_id, reason="no demand — ingest=0, chat=0")

SSM + env var update on stopped-instance restart

When start_instances succeeds, immediately: 1. _update_ssm_gpu_instance_id(stopped_id, region) — durable across Lambda cold starts 2. os.environ["GPU_INSTANCE_ID"] = stopped_id — warm-container fast path

worker_url is NOT set (instance is cold starting); the existing wait_for_gpu_health backoff handles this.

Env var override for fallback types

GPU_FALLBACK_INSTANCE_TYPES=g6e.2xlarge,g5.2xlarge,g4dn.2xlarge

Set in SAM template so ops can adjust without a code deploy.

After all stages are approved, apply .agent/skills/reconcile-plans/SKILL.md to propagate contract updates across linked plans.