Chat Returns "GPU Worker Is Starting Up" — Spot Instance + Unmerged AMI Fix
Metadata
- Date:
2026-05-06 - Status:
investigating - Severity:
high - Related issue/ticket: PR #395 (
fix(gpu): update launch template to valid Deep Learning AMI with user_data and IAM profile) - Owner:
N/A
About
Overview: User receives "Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes." on every chat request. This is the same terminal error path as 2026-04-28-chat-gpu-worker-starting-up-indefinitely.md (PR #395 branch), but with an additional complication: the user reports the GPU is currently running as a spot instance, and the launch template in Terraform does not configure a spot market — meaning the spot instance was launched manually outside Terraform, creating a mismatch.
Technical Questions: - Is GPU_INSTANCE_ID env var set to a real instance ID or "none"? Unknown — AWS SSO tokens are expired; cannot query live. - If set to a real instance ID: what state is that spot instance in? Spot instances can be "stopped" (persistent) or "terminated" (one-time). _resolve_gpu_url only handles "stopped"→start_instances. If the instance is "terminated", describe_instances returns an empty Reservations list or the instance in "terminated" state — neither path triggers re-launch. - Is PR #395 merged? No. The fix (valid AMI, IAM role, user_data) exists only on origin/feature/fix-chat-gpu-startup and has NOT been merged into master or the feature/chat-gpu-startup worktree. - What AMI does the current launch template use? ami-0cdd9a78901f19368 (encache-gpu-worker-v2-20260401) — this AMI was deleted and will cause InvalidAMIID.NotFound on any run_instances attempt. - Does the Terraform launch template use spot? No — neither the current TF nor PR #395 adds instance_market_options { market_type = "spot" }. The user's spot instance was launched manually. - Can a spot instance be started with ec2.start_instances? Only if it is a persistent spot request (not one-time). One-time spot instances go directly to terminated state and cannot be restarted.
Resources: - main/server/api/memories/chat/app.py — _resolve_gpu_url, implementation (chat Lambda) - main/devops/main.tf — aws_launch_template.gpu_worker (still has deleted AMI ami-0cdd9a78901f19368) - main/server/layers/shared/python/shared/gpu_utils.py — normalize_gpu_instance_id - PR #395 — fix branch origin/feature/fix-chat-gpu-startup — valid AMI + IAM + user_data, NOT merged - CloudWatch: /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ — unable to access (SSO expired) - SSM: /encache/gpu/instance_id — unknown live value (SSO expired) - Prior bug file: docs/bugs/2026-04-28-chat-gpu-worker-starting-up-indefinitely.md (on PR #395 branch only, not in master)
Steps to cause failure
flowchart LR
A["User sends chat message"] --> B["MemoriesChatFunction invoked"]
B --> C["normalize_gpu_instance_id(GPU_INSTANCE_ID)"]
C --> D{"GPU_INSTANCE_ID value?"}
D -- "none sentinel" --> E["gpu_instance_id = None\n_resolve_gpu_url skipped"]
D -- "real instance ID" --> F["_resolve_gpu_url called"]
F --> G{"EC2 describe_instances"}
G -- "terminated spot or\nInvalidInstanceID" --> H["exception caught\nreturns None"]
G -- "stopped persistent spot" --> I["start_instances called\nreturns None (starting)"]
G -- "running, but no IP" --> J["ip = None\nreturns None"]
E --> K["gpu_worker_url = None"]
H --> K
I --> K
J --> K
K --> L["Fallback: 'GPU worker is starting up'"] System
flowchart TD
SSM[("SSM\n/encache/gpu/instance_id\n= 'none' OR spot instance ID")] -->|"GPU_INSTANCE_ID env var"| CHAT["api/memories/chat/app.py\nMemoriesChatFunction"]
CHAT -->|"normalize → None or instance_id"| RESOLVE["_resolve_gpu_url(instance_id, port)"]
RESOLVE -->|"EC2 describe_instances"| EC2["AWS EC2"]
EC2 -->|"terminated / error / no IP"| RESOLVE
RESOLVE -->|"returns None"| CHAT
CHAT -->|"gpu_worker_url = None"| FB["'GPU worker is starting up'"]
LT["aws_launch_template.gpu_worker\nami-0cdd9a78901f19368 DELETED\nno spot config\nno IAM/user_data"] -.->|"Only used if Lambda\nhad auto-launch code\n(reverted at b12a8d49)"| EC2 Notes: The auto-launch code (_launch_gpu_instance) was introduced in PR #391 and reverted in master at b12a8d49. The currently deployed Lambda on master does NOT call run_instances — it only calls ec2.start_instances for stopped instances. If the instance is terminated (common for one-time spot), there is no re-launch path.
Reproduction Details
- Confirm SSM:
aws ssm get-parameter --name /encache/gpu/instance_id --region us-east-1 - If
"none"→ GPU_INSTANCE_ID is sentinel;_resolve_gpu_urlnever called; always "starting up". - If real ID → proceed to step 2.
- Check EC2 state:
aws ec2 describe-instances --instance-ids <id> --region us-east-1 --query "Reservations[*].Instances[*].State" - If
terminated→ one-time spot terminated;describe_instancesreturns empty Reservations; exception caught; returnsNone. - If
stopped→start_instanceskicked; returnsNone; message shown until instance is running. - If
running→ check if public IP present and/healthreturns 200. - Confirm launch template AMI:
aws ec2 describe-launch-template-versions --launch-template-id lt-0cf39db4cd6dff510 --region us-east-1 --query "LaunchTemplateVersions[*].LaunchTemplateData.ImageId" - Expected to be
ami-0cdd9a78901f19368(deleted) from un-merged PR state. - Send authenticated POST
/memories/chat→ always returns "starting up" fallback.
Reproduction test: N/A — behavior requires live AWS environment; observable via CloudWatch.
Notes for PR
There are two parallel issues:
Issue 1: PR #395 not merged (AMI fix)
The fix for the deleted AMI (ami-0cdd9a78901f19368 → ami-0e72acaa1863957cd) plus IAM role and user_data bootstrapping exists on origin/feature/fix-chat-gpu-startup (PR #395) but is NOT merged. The feature/chat-gpu-startup worktree still has the old broken launch template. Until PR #395 is merged and terraform apply is run, any new instance launch will fail with InvalidAMIID.NotFound.
Action: Merge PR #395. Then run:
Note: PR #395 was previously terraform apply'd directly (changes already live in AWS per the bug log on the PR branch), but the git state is out of sync — the worktree TF does not reflect what was applied.
Issue 2: Spot instance cannot be started with start_instances
The user reports the GPU is currently a spot instance. The _resolve_gpu_url function calls ec2.start_instances for "stopped" instances — but: - One-time spot instances go to "terminated" when interrupted/stopped by AWS; start_instances is invalid for terminated instances. - Persistent spot instances can be "stopped" and restarted with start_instances.
If the spot instance was launched as one-time (the most common case via CLI: aws ec2 run-instances --instance-market-options MarketType=spot), it will show as "terminated" when stopped. describe_instances on a terminated spot returns the instance in "terminated" state. The code does NOT handle state == "terminated":
# Current code - missing terminated spot case:
if state in ("stopped", "stopping"):
...
ec2.start_instances(InstanceIds=[instance_id])
return None
ip = instance.get("PublicIpAddress") or instance.get("PrivateIpAddress")
if ip:
return f"http://{ip}:{port}"
# Falls through to return None with no action for "terminated"
For terminated instances, the correct action is to launch a new instance from the launch template — but this requires the _launch_gpu_instance auto-launch code that was reverted in master at b12a8d49.
Issue 3: vCPU Quota
The prior investigation (PR #395 bug log) confirmed the account has a vCPU quota of 0 for G-instances. This means even if the launch template is fixed (Issue 1) and auto-launch is re-added (Issue 2), run_instances will fail with VcpuLimitExceeded. A quota increase request must be submitted.
Recommended fix sequence
- Merge PR #395 (fix AMI, add IAM profile, user_data).
- Run
terraform applyinmain/devops. - If spot instance is currently in "terminated" state: manually launch a spot instance using the fixed launch template, specifying
instance_market_options: Note: this requires the vCPU quota for G instances to be > 0. Request increase if needed. - Once instance is running, update SSM:
- Redeploy Lambda (SAM) to pick up the new GPU_INSTANCE_ID.
- For the
start_instances-on-terminated-spot gap: add "terminated" to the state handling in_resolve_gpu_urlto trigger re-launch from template (requires re-adding auto-launch logic or a separate fix).
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize investigation for "GPU starting up" error | User report |
| 2 | Check existing bug files | Found prior bug 2026-04-28-chat-gpu-worker-starting-up-indefinitely.md on PR #395 branch (not in local docs/bugs/) | git show origin/feature/fix-chat-gpu-startup:docs/bugs/... |
| 3 | Check open PRs | Found PR #395 (fix(gpu): update launch template) — NOT merged; PR #399 (GPU caption queue) — NOT merged | gh pr list |
| 4 | Read PR #395 body | Fix: AMI ami-0cdd9a78901f19368 (deleted) → ami-0e72acaa1863957cd + IAM role + user_data; blocked on vCPU quota = 0 | gh pr view 395 |
| 5 | Read current main.tf | Launch template still uses deleted AMI ami-0cdd9a78901f19368; no spot config; no IAM/user_data | /home/lewibs/github/encache1/encache1-chat-gpu-startup/main/devops/main.tf |
| 6 | Diff current vs PR#395 TF | PR adds IAM role, instance profile, updated AMI, user_data — none of this is in the working tree | git diff feature/chat-gpu-startup..origin/feature/fix-chat-gpu-startup -- main/devops/main.tf |
| 7 | Read chat app.py | _resolve_gpu_url handles "stopped"→start_instances but NOT "terminated"; no run_instances path in master (reverted at b12a8d49) | main/server/api/memories/chat/app.py |
| 8 | Identify spot instance gap | Spot instances cannot be restarted with start_instances if one-time; "terminated" state falls through to return None with no action | app.py:49-57 |
| 9 | Check AWS credentials | SSO tokens expired — cannot query CloudWatch logs, EC2 state, or SSM values live | aws sts get-caller-identity → "Token has expired" |
| 10 | Confirm no duplicate | No local docs/bugs/2026-04-28-* file in docs/bugs/; creating new file for spot-specific investigation | ls docs/bugs/ |
Verification
- [ ] Reproduced failure before fix (blocked — AWS SSO tokens expired)
- [ ] Reproduction test fails before fix (N/A — live AWS required)
- [x] Root cause identified with evidence (deleted AMI in unmerged PR, spot instance terminated state not handled, vCPU quota = 0)
- [ ] Fix applied at source
- [ ] Reproduction test passes after fix
- [ ] Regression test added/updated
- [x] Verified no duplicate solved-bug log exists for same root cause in local
docs/bugs/