Chat Returns "GPU Worker Is Starting Up" — Spot Instance + Unmerged AMI Fix

Metadata

Date: 2026-05-06
Status: investigating
Severity: high
Related issue/ticket: PR #395 (fix(gpu): update launch template to valid Deep Learning AMI with user_data and IAM profile)
Owner: N/A

About

Overview: User receives "Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes." on every chat request. This is the same terminal error path as 2026-04-28-chat-gpu-worker-starting-up-indefinitely.md (PR #395 branch), but with an additional complication: the user reports the GPU is currently running as a spot instance, and the launch template in Terraform does not configure a spot market — meaning the spot instance was launched manually outside Terraform, creating a mismatch.

Technical Questions: - Is GPU_INSTANCE_ID env var set to a real instance ID or "none"? Unknown — AWS SSO tokens are expired; cannot query live. - If set to a real instance ID: what state is that spot instance in? Spot instances can be "stopped" (persistent) or "terminated" (one-time). _resolve_gpu_url only handles "stopped"→start_instances. If the instance is "terminated", describe_instances returns an empty Reservations list or the instance in "terminated" state — neither path triggers re-launch. - Is PR #395 merged? No. The fix (valid AMI, IAM role, user_data) exists only on origin/feature/fix-chat-gpu-startup and has NOT been merged into master or the feature/chat-gpu-startup worktree. - What AMI does the current launch template use? ami-0cdd9a78901f19368 (encache-gpu-worker-v2-20260401) — this AMI was deleted and will cause InvalidAMIID.NotFound on any run_instances attempt. - Does the Terraform launch template use spot? No — neither the current TF nor PR #395 adds instance_market_options { market_type = "spot" }. The user's spot instance was launched manually. - Can a spot instance be started with ec2.start_instances? Only if it is a persistent spot request (not one-time). One-time spot instances go directly to terminated state and cannot be restarted.

Resources: - main/server/api/memories/chat/app.py — _resolve_gpu_url, implementation (chat Lambda) - main/devops/main.tf — aws_launch_template.gpu_worker (still has deleted AMI ami-0cdd9a78901f19368) - main/server/layers/shared/python/shared/gpu_utils.py — normalize_gpu_instance_id - PR #395 — fix branch origin/feature/fix-chat-gpu-startup — valid AMI + IAM + user_data, NOT merged - CloudWatch: /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ — unable to access (SSO expired) - SSM: /encache/gpu/instance_id — unknown live value (SSO expired) - Prior bug file: docs/bugs/2026-04-28-chat-gpu-worker-starting-up-indefinitely.md (on PR #395 branch only, not in master)

Steps to cause failure

flowchart LR
    A["User sends chat message"] --> B["MemoriesChatFunction invoked"]
    B --> C["normalize_gpu_instance_id(GPU_INSTANCE_ID)"]
    C --> D{"GPU_INSTANCE_ID value?"}
    D -- "none sentinel" --> E["gpu_instance_id = None\n_resolve_gpu_url skipped"]
    D -- "real instance ID" --> F["_resolve_gpu_url called"]
    F --> G{"EC2 describe_instances"}
    G -- "terminated spot or\nInvalidInstanceID" --> H["exception caught\nreturns None"]
    G -- "stopped persistent spot" --> I["start_instances called\nreturns None (starting)"]
    G -- "running, but no IP" --> J["ip = None\nreturns None"]
    E --> K["gpu_worker_url = None"]
    H --> K
    I --> K
    J --> K
    K --> L["Fallback: 'GPU worker is starting up'"]

System

flowchart TD
    SSM[("SSM\n/encache/gpu/instance_id\n= 'none' OR spot instance ID")] -->|"GPU_INSTANCE_ID env var"| CHAT["api/memories/chat/app.py\nMemoriesChatFunction"]
    CHAT -->|"normalize → None or instance_id"| RESOLVE["_resolve_gpu_url(instance_id, port)"]
    RESOLVE -->|"EC2 describe_instances"| EC2["AWS EC2"]
    EC2 -->|"terminated / error / no IP"| RESOLVE
    RESOLVE -->|"returns None"| CHAT
    CHAT -->|"gpu_worker_url = None"| FB["'GPU worker is starting up'"]

    LT["aws_launch_template.gpu_worker\nami-0cdd9a78901f19368 DELETED\nno spot config\nno IAM/user_data"] -.->|"Only used if Lambda\nhad auto-launch code\n(reverted at b12a8d49)"| EC2

Notes: The auto-launch code (_launch_gpu_instance) was introduced in PR #391 and reverted in master at b12a8d49. The currently deployed Lambda on master does NOT call run_instances — it only calls ec2.start_instances for stopped instances. If the instance is terminated (common for one-time spot), there is no re-launch path.

Reproduction Details

Confirm SSM: aws ssm get-parameter --name /encache/gpu/instance_id --region us-east-1
If "none" → GPU_INSTANCE_ID is sentinel; _resolve_gpu_url never called; always "starting up".
If real ID → proceed to step 2.
Check EC2 state: aws ec2 describe-instances --instance-ids <id> --region us-east-1 --query "Reservations[*].Instances[*].State"
If terminated → one-time spot terminated; describe_instances returns empty Reservations; exception caught; returns None.
If stopped → start_instances kicked; returns None; message shown until instance is running.
If running → check if public IP present and /health returns 200.
Confirm launch template AMI: aws ec2 describe-launch-template-versions --launch-template-id lt-0cf39db4cd6dff510 --region us-east-1 --query "LaunchTemplateVersions[*].LaunchTemplateData.ImageId"
Expected to be ami-0cdd9a78901f19368 (deleted) from un-merged PR state.
Send authenticated POST /memories/chat → always returns "starting up" fallback.

Reproduction test: N/A — behavior requires live AWS environment; observable via CloudWatch.

Notes for PR

There are two parallel issues:

Issue 1: PR #395 not merged (AMI fix)

The fix for the deleted AMI (ami-0cdd9a78901f19368 → ami-0e72acaa1863957cd) plus IAM role and user_data bootstrapping exists on origin/feature/fix-chat-gpu-startup (PR #395) but is NOT merged. The feature/chat-gpu-startup worktree still has the old broken launch template. Until PR #395 is merged and terraform apply is run, any new instance launch will fail with InvalidAMIID.NotFound.

Action: Merge PR #395. Then run:

cd main/devops && terraform apply

Note: PR #395 was previously terraform apply'd directly (changes already live in AWS per the bug log on the PR branch), but the git state is out of sync — the worktree TF does not reflect what was applied.

Issue 2: Spot instance cannot be started with `start_instances`

The user reports the GPU is currently a spot instance. The _resolve_gpu_url function calls ec2.start_instances for "stopped" instances — but: - One-time spot instances go to "terminated" when interrupted/stopped by AWS; start_instances is invalid for terminated instances. - Persistent spot instances can be "stopped" and restarted with start_instances.

If the spot instance was launched as one-time (the most common case via CLI: aws ec2 run-instances --instance-market-options MarketType=spot), it will show as "terminated" when stopped. describe_instances on a terminated spot returns the instance in "terminated" state. The code does NOT handle state == "terminated":

# Current code - missing terminated spot case:
if state in ("stopped", "stopping"):
    ...
    ec2.start_instances(InstanceIds=[instance_id])
    return None

ip = instance.get("PublicIpAddress") or instance.get("PrivateIpAddress")
if ip:
    return f"http://{ip}:{port}"
# Falls through to return None with no action for "terminated"

For terminated instances, the correct action is to launch a new instance from the launch template — but this requires the _launch_gpu_instance auto-launch code that was reverted in master at b12a8d49.

Issue 3: vCPU Quota

The prior investigation (PR #395 bug log) confirmed the account has a vCPU quota of 0 for G-instances. This means even if the launch template is fixed (Issue 1) and auto-launch is re-added (Issue 2), run_instances will fail with VcpuLimitExceeded. A quota increase request must be submitted.

Recommended fix sequence

Merge PR #395 (fix AMI, add IAM profile, user_data).
Run terraform apply in main/devops.
If spot instance is currently in "terminated" state: manually launch a spot instance using the fixed launch template, specifying instance_market_options:
```
aws ec2 run-instances \
  --launch-template LaunchTemplateId=lt-0cf39db4cd6dff510,Version=2 \
  --instance-market-options MarketType=spot \
  --region us-east-1
```
Note: this requires the vCPU quota for G instances to be > 0. Request increase if needed.

Once instance is running, update SSM:

aws ssm put-parameter --name /encache/gpu/instance_id --value <new-instance-id> --type String --overwrite --region us-east-1

Redeploy Lambda (SAM) to pick up the new GPU_INSTANCE_ID.
For the start_instances-on-terminated-spot gap: add "terminated" to the state handling in _resolve_gpu_url to trigger re-launch from template (requires re-adding auto-launch logic or a separate fix).

Audit Log

ID	Action	Note	Context
1	Create audit log	Initialize investigation for "GPU starting up" error	User report
2	Check existing bug files	Found prior bug `2026-04-28-chat-gpu-worker-starting-up-indefinitely.md` on PR #395 branch (not in local `docs/bugs/`)	`git show origin/feature/fix-chat-gpu-startup:docs/bugs/...`
3	Check open PRs	Found PR #395 (`fix(gpu): update launch template`) — NOT merged; PR #399 (GPU caption queue) — NOT merged	`gh pr list`
4	Read PR #395 body	Fix: AMI `ami-0cdd9a78901f19368` (deleted) → `ami-0e72acaa1863957cd` + IAM role + user_data; blocked on vCPU quota = 0	`gh pr view 395`
5	Read current main.tf	Launch template still uses deleted AMI `ami-0cdd9a78901f19368`; no spot config; no IAM/user_data	`/home/lewibs/github/encache1/encache1-chat-gpu-startup/main/devops/main.tf`
6	Diff current vs PR#395 TF	PR adds IAM role, instance profile, updated AMI, user_data — none of this is in the working tree	`git diff feature/chat-gpu-startup..origin/feature/fix-chat-gpu-startup -- main/devops/main.tf`
7	Read chat app.py	`_resolve_gpu_url` handles "stopped"→`start_instances` but NOT "terminated"; no `run_instances` path in master (reverted at `b12a8d49`)	`main/server/api/memories/chat/app.py`
8	Identify spot instance gap	Spot instances cannot be restarted with `start_instances` if one-time; "terminated" state falls through to `return None` with no action	`app.py:49-57`
9	Check AWS credentials	SSO tokens expired — cannot query CloudWatch logs, EC2 state, or SSM values live	`aws sts get-caller-identity` → "Token has expired"
10	Confirm no duplicate	No local `docs/bugs/2026-04-28-*` file in `docs/bugs/`; creating new file for spot-specific investigation	`ls docs/bugs/`

Verification

[ ] Reproduced failure before fix (blocked — AWS SSO tokens expired)
[ ] Reproduction test fails before fix (N/A — live AWS required)
[x] Root cause identified with evidence (deleted AMI in unmerged PR, spot instance terminated state not handled, vCPU quota = 0)
[ ] Fix applied at source
[ ] Reproduction test passes after fix
[ ] Regression test added/updated
[x] Verified no duplicate solved-bug log exists for same root cause in local docs/bugs/