Skip to content

Chat Returns "GPU Worker Is Starting Up" — Spot Instance + Unmerged AMI Fix

Metadata

  • Date: 2026-05-06
  • Status: investigating
  • Severity: high
  • Related issue/ticket: PR #395 (fix(gpu): update launch template to valid Deep Learning AMI with user_data and IAM profile)
  • Owner: N/A

About

Overview: User receives "Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes." on every chat request. This is the same terminal error path as 2026-04-28-chat-gpu-worker-starting-up-indefinitely.md (PR #395 branch), but with an additional complication: the user reports the GPU is currently running as a spot instance, and the launch template in Terraform does not configure a spot market — meaning the spot instance was launched manually outside Terraform, creating a mismatch.

Technical Questions: - Is GPU_INSTANCE_ID env var set to a real instance ID or "none"? Unknown — AWS SSO tokens are expired; cannot query live. - If set to a real instance ID: what state is that spot instance in? Spot instances can be "stopped" (persistent) or "terminated" (one-time). _resolve_gpu_url only handles "stopped"→start_instances. If the instance is "terminated", describe_instances returns an empty Reservations list or the instance in "terminated" state — neither path triggers re-launch. - Is PR #395 merged? No. The fix (valid AMI, IAM role, user_data) exists only on origin/feature/fix-chat-gpu-startup and has NOT been merged into master or the feature/chat-gpu-startup worktree. - What AMI does the current launch template use? ami-0cdd9a78901f19368 (encache-gpu-worker-v2-20260401) — this AMI was deleted and will cause InvalidAMIID.NotFound on any run_instances attempt. - Does the Terraform launch template use spot? No — neither the current TF nor PR #395 adds instance_market_options { market_type = "spot" }. The user's spot instance was launched manually. - Can a spot instance be started with ec2.start_instances? Only if it is a persistent spot request (not one-time). One-time spot instances go directly to terminated state and cannot be restarted.

Resources: - main/server/api/memories/chat/app.py_resolve_gpu_url, implementation (chat Lambda) - main/devops/main.tfaws_launch_template.gpu_worker (still has deleted AMI ami-0cdd9a78901f19368) - main/server/layers/shared/python/shared/gpu_utils.pynormalize_gpu_instance_id - PR #395 — fix branch origin/feature/fix-chat-gpu-startup — valid AMI + IAM + user_data, NOT merged - CloudWatch: /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ — unable to access (SSO expired) - SSM: /encache/gpu/instance_id — unknown live value (SSO expired) - Prior bug file: docs/bugs/2026-04-28-chat-gpu-worker-starting-up-indefinitely.md (on PR #395 branch only, not in master)

Steps to cause failure

flowchart LR
    A["User sends chat message"] --> B["MemoriesChatFunction invoked"]
    B --> C["normalize_gpu_instance_id(GPU_INSTANCE_ID)"]
    C --> D{"GPU_INSTANCE_ID value?"}
    D -- "none sentinel" --> E["gpu_instance_id = None\n_resolve_gpu_url skipped"]
    D -- "real instance ID" --> F["_resolve_gpu_url called"]
    F --> G{"EC2 describe_instances"}
    G -- "terminated spot or\nInvalidInstanceID" --> H["exception caught\nreturns None"]
    G -- "stopped persistent spot" --> I["start_instances called\nreturns None (starting)"]
    G -- "running, but no IP" --> J["ip = None\nreturns None"]
    E --> K["gpu_worker_url = None"]
    H --> K
    I --> K
    J --> K
    K --> L["Fallback: 'GPU worker is starting up'"]

System

flowchart TD
    SSM[("SSM\n/encache/gpu/instance_id\n= 'none' OR spot instance ID")] -->|"GPU_INSTANCE_ID env var"| CHAT["api/memories/chat/app.py\nMemoriesChatFunction"]
    CHAT -->|"normalize → None or instance_id"| RESOLVE["_resolve_gpu_url(instance_id, port)"]
    RESOLVE -->|"EC2 describe_instances"| EC2["AWS EC2"]
    EC2 -->|"terminated / error / no IP"| RESOLVE
    RESOLVE -->|"returns None"| CHAT
    CHAT -->|"gpu_worker_url = None"| FB["'GPU worker is starting up'"]

    LT["aws_launch_template.gpu_worker\nami-0cdd9a78901f19368 DELETED\nno spot config\nno IAM/user_data"] -.->|"Only used if Lambda\nhad auto-launch code\n(reverted at b12a8d49)"| EC2

Notes: The auto-launch code (_launch_gpu_instance) was introduced in PR #391 and reverted in master at b12a8d49. The currently deployed Lambda on master does NOT call run_instances — it only calls ec2.start_instances for stopped instances. If the instance is terminated (common for one-time spot), there is no re-launch path.

Reproduction Details

  1. Confirm SSM: aws ssm get-parameter --name /encache/gpu/instance_id --region us-east-1
  2. If "none" → GPU_INSTANCE_ID is sentinel; _resolve_gpu_url never called; always "starting up".
  3. If real ID → proceed to step 2.
  4. Check EC2 state: aws ec2 describe-instances --instance-ids <id> --region us-east-1 --query "Reservations[*].Instances[*].State"
  5. If terminated → one-time spot terminated; describe_instances returns empty Reservations; exception caught; returns None.
  6. If stoppedstart_instances kicked; returns None; message shown until instance is running.
  7. If running → check if public IP present and /health returns 200.
  8. Confirm launch template AMI: aws ec2 describe-launch-template-versions --launch-template-id lt-0cf39db4cd6dff510 --region us-east-1 --query "LaunchTemplateVersions[*].LaunchTemplateData.ImageId"
  9. Expected to be ami-0cdd9a78901f19368 (deleted) from un-merged PR state.
  10. Send authenticated POST /memories/chat → always returns "starting up" fallback.

Reproduction test: N/A — behavior requires live AWS environment; observable via CloudWatch.

Notes for PR

There are two parallel issues:

Issue 1: PR #395 not merged (AMI fix)

The fix for the deleted AMI (ami-0cdd9a78901f19368ami-0e72acaa1863957cd) plus IAM role and user_data bootstrapping exists on origin/feature/fix-chat-gpu-startup (PR #395) but is NOT merged. The feature/chat-gpu-startup worktree still has the old broken launch template. Until PR #395 is merged and terraform apply is run, any new instance launch will fail with InvalidAMIID.NotFound.

Action: Merge PR #395. Then run:

cd main/devops && terraform apply

Note: PR #395 was previously terraform apply'd directly (changes already live in AWS per the bug log on the PR branch), but the git state is out of sync — the worktree TF does not reflect what was applied.

Issue 2: Spot instance cannot be started with start_instances

The user reports the GPU is currently a spot instance. The _resolve_gpu_url function calls ec2.start_instances for "stopped" instances — but: - One-time spot instances go to "terminated" when interrupted/stopped by AWS; start_instances is invalid for terminated instances. - Persistent spot instances can be "stopped" and restarted with start_instances.

If the spot instance was launched as one-time (the most common case via CLI: aws ec2 run-instances --instance-market-options MarketType=spot), it will show as "terminated" when stopped. describe_instances on a terminated spot returns the instance in "terminated" state. The code does NOT handle state == "terminated":

# Current code - missing terminated spot case:
if state in ("stopped", "stopping"):
    ...
    ec2.start_instances(InstanceIds=[instance_id])
    return None

ip = instance.get("PublicIpAddress") or instance.get("PrivateIpAddress")
if ip:
    return f"http://{ip}:{port}"
# Falls through to return None with no action for "terminated"

For terminated instances, the correct action is to launch a new instance from the launch template — but this requires the _launch_gpu_instance auto-launch code that was reverted in master at b12a8d49.

Issue 3: vCPU Quota

The prior investigation (PR #395 bug log) confirmed the account has a vCPU quota of 0 for G-instances. This means even if the launch template is fixed (Issue 1) and auto-launch is re-added (Issue 2), run_instances will fail with VcpuLimitExceeded. A quota increase request must be submitted.

  1. Merge PR #395 (fix AMI, add IAM profile, user_data).
  2. Run terraform apply in main/devops.
  3. If spot instance is currently in "terminated" state: manually launch a spot instance using the fixed launch template, specifying instance_market_options:
    aws ec2 run-instances \
      --launch-template LaunchTemplateId=lt-0cf39db4cd6dff510,Version=2 \
      --instance-market-options MarketType=spot \
      --region us-east-1
    
    Note: this requires the vCPU quota for G instances to be > 0. Request increase if needed.
  4. Once instance is running, update SSM:
    aws ssm put-parameter --name /encache/gpu/instance_id --value <new-instance-id> --type String --overwrite --region us-east-1
    
  5. Redeploy Lambda (SAM) to pick up the new GPU_INSTANCE_ID.
  6. For the start_instances-on-terminated-spot gap: add "terminated" to the state handling in _resolve_gpu_url to trigger re-launch from template (requires re-adding auto-launch logic or a separate fix).

Audit Log

ID Action Note Context
1 Create audit log Initialize investigation for "GPU starting up" error User report
2 Check existing bug files Found prior bug 2026-04-28-chat-gpu-worker-starting-up-indefinitely.md on PR #395 branch (not in local docs/bugs/) git show origin/feature/fix-chat-gpu-startup:docs/bugs/...
3 Check open PRs Found PR #395 (fix(gpu): update launch template) — NOT merged; PR #399 (GPU caption queue) — NOT merged gh pr list
4 Read PR #395 body Fix: AMI ami-0cdd9a78901f19368 (deleted) → ami-0e72acaa1863957cd + IAM role + user_data; blocked on vCPU quota = 0 gh pr view 395
5 Read current main.tf Launch template still uses deleted AMI ami-0cdd9a78901f19368; no spot config; no IAM/user_data /home/lewibs/github/encache1/encache1-chat-gpu-startup/main/devops/main.tf
6 Diff current vs PR#395 TF PR adds IAM role, instance profile, updated AMI, user_data — none of this is in the working tree git diff feature/chat-gpu-startup..origin/feature/fix-chat-gpu-startup -- main/devops/main.tf
7 Read chat app.py _resolve_gpu_url handles "stopped"→start_instances but NOT "terminated"; no run_instances path in master (reverted at b12a8d49) main/server/api/memories/chat/app.py
8 Identify spot instance gap Spot instances cannot be restarted with start_instances if one-time; "terminated" state falls through to return None with no action app.py:49-57
9 Check AWS credentials SSO tokens expired — cannot query CloudWatch logs, EC2 state, or SSM values live aws sts get-caller-identity → "Token has expired"
10 Confirm no duplicate No local docs/bugs/2026-04-28-* file in docs/bugs/; creating new file for spot-specific investigation ls docs/bugs/

Verification

  • [ ] Reproduced failure before fix (blocked — AWS SSO tokens expired)
  • [ ] Reproduction test fails before fix (N/A — live AWS required)
  • [x] Root cause identified with evidence (deleted AMI in unmerged PR, spot instance terminated state not handled, vCPU quota = 0)
  • [ ] Fix applied at source
  • [ ] Reproduction test passes after fix
  • [ ] Regression test added/updated
  • [x] Verified no duplicate solved-bug log exists for same root cause in local docs/bugs/