GPU Ingest Crashes — instance_id="none" Treated as Valid EC2 ID
Metadata
- Date:
2026-04-23 - Status:
fixed - Severity:
medium - Related issue/ticket:
N/A - Owner:
N/A
About
Overview: - The SSM parameter /encache/gpu/instance_id is set to the literal string "none" (a placeholder). Because "none" is truthy in Python, ingest_window.py passes it to _resolve_gpu_url and then to VLM2VecClient(instance_id="none"). The EC2 describe_instances call inside VLM2VecClient._start_or_create() raises InvalidInstanceID.Malformed. The outer try/except in the GPU enrichment block catches it, so segments are still saved as processing_status=failed — but an unnecessary EC2 API call is made and a misleading error is logged on every ingest. - Additionally, no GPU EC2 instance exists in the account — the GPU worker is offline.
Technical Questions: - os.environ.get("GPU_INSTANCE_ID") returns "none" — a non-empty string — so if gpu_instance_id: is True and the GPU path is entered. - The SSM sentinel value "none" was set as a placeholder when no GPU instance was running. - The GPU enrichment is already wrapped in try/except, so segments are saved correctly. The fix avoids the unnecessary EC2 call.
Resources: - main/server/worldmm/pipeline/ingest_window.py — GPU instance_id check - SSM: /encache/gpu/instance_id = "none" - Lambda env: GPU_INSTANCE_ID = "none" (sourced from SSM) - CloudWatch: /aws/lambda/server-IngestWindowFunction-44v8BXyEGwOz
Steps to cause failure
flowchart LR
A["IngestWindowFunction invoked"] --> B["gpu_instance_id = os.environ.get(GPU_INSTANCE_ID)"]
B --> C["gpu_instance_id = 'none' — truthy"]
C --> D["_resolve_gpu_url called with 'none'"]
D --> E["EC2 describe_instances 'none' — InvalidInstanceID.Malformed"]
E --> F["_resolve_gpu_url catches, returns 0.0.0.0"]
F --> G["VLM2VecClient(instance_id='none') created"]
G --> H["caption() calls _ensure_running()"]
H --> I["EC2 describe_instances 'none' again — uncaught inside VLM2VecClient"]
I --> J["Outer try/except catches — segment saved as failed"] System
flowchart TD
SSM[("SSM\n/encache/gpu/instance_id = 'none'")] -->|env var GPU_INSTANCE_ID=none| IW["ingest_window.py"]
IW -->|"if gpu_instance_id: (truthy)"| RESOLVE["_resolve_gpu_url('none')"]
RESOLVE -->|EC2 error caught| IW
IW -->|"VLM2VecClient(instance_id='none')"| VLM["VLM2VecClient"]
VLM -->|"EC2 describe_instances('none') — crashes"| IW
IW -->|"outer except catches"| DB[("PostgreSQL\nprocessing_status=failed")] Reproduction Details
- Set SSM
/encache/gpu/instance_idto"none". - Trigger any ingest window.
- Log shows
gpu_url_resolve_failedthengpu_enrichment_failedwithInvalidInstanceID.Malformed. - Segment is saved with
processing_status=failed.
Reproduction test: N/A — behavior is observable in CloudWatch; the fix is a one-line guard.
Notes for PR
Root cause: "none" is a truthy string. The guard if gpu_instance_id: does not exclude the sentinel value.
Fix: normalize the env var at read time — treat "none", "null", and empty string as absent:
_raw = os.environ.get("GPU_INSTANCE_ID", "")
gpu_instance_id = _raw if _raw and _raw.lower() not in ("none", "null") else None
When gpu_instance_id is None, _resolve_gpu_url is skipped entirely, VLM2VecClient is never instantiated with a bad ID, and the GPU block raises a RuntimeError("GPU not configured") immediately so the outer except marks the segment failed cleanly without any EC2 API calls.
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize bug investigation | User reports ingest always fails with InvalidInstanceID.Malformed |
| 2 | Read CloudWatch logs | gpu_url_resolve_failed instance_id="none" then gpu_enrichment_failed on every ingest | /aws/lambda/server-IngestWindowFunction-44v8BXyEGwOz |
| 3 | Check SSM | /encache/gpu/instance_id = "none" — literal sentinel string | aws ssm get-parameters-by-path --path /encache/gpu |
| 4 | Check Lambda env | GPU_INSTANCE_ID = "none" sourced from SSM | aws lambda get-function-configuration |
| 5 | Check EC2 | No GPU instances running in account | aws ec2 describe-instances --filters Name=tag:Name,Values=*gpu* |
| 6 | Identify root cause | "none" is truthy — GPU path entered, EC2 called twice with invalid ID | ingest_window.py GPU block |
| 7 | Apply fix | Normalize GPU_INSTANCE_ID at read time; raise early RuntimeError when no GPU configured | ingest_window.py |
Verification
- [x] Reproduced failure before fix (CloudWatch confirms on every ingest)
- [x] Root cause identified with evidence (SSM = "none", truthy guard)
- [x] Fix applied at source (read-time normalization)
- [ ] Reproduction test passes after fix (N/A — requires live AWS environment)
- [x] Regression test added (N/A — one-line guard, covered by existing
test_ingest_segment_first.pyGPU failure path) - [x] Verified no duplicate solved-bug log exists for same root cause