GPU Ingest Crashes — `instance_id="none"` Treated as Valid EC2 ID

Metadata

Date: 2026-04-23
Status: fixed
Severity: medium
Related issue/ticket: N/A
Owner: N/A

About

Overview: - The SSM parameter /encache/gpu/instance_id is set to the literal string "none" (a placeholder). Because "none" is truthy in Python, ingest_window.py passes it to _resolve_gpu_url and then to VLM2VecClient(instance_id="none"). The EC2 describe_instances call inside VLM2VecClient._start_or_create() raises InvalidInstanceID.Malformed. The outer try/except in the GPU enrichment block catches it, so segments are still saved as processing_status=failed — but an unnecessary EC2 API call is made and a misleading error is logged on every ingest. - Additionally, no GPU EC2 instance exists in the account — the GPU worker is offline.

Technical Questions: - os.environ.get("GPU_INSTANCE_ID") returns "none" — a non-empty string — so if gpu_instance_id: is True and the GPU path is entered. - The SSM sentinel value "none" was set as a placeholder when no GPU instance was running. - The GPU enrichment is already wrapped in try/except, so segments are saved correctly. The fix avoids the unnecessary EC2 call.

Resources: - main/server/worldmm/pipeline/ingest_window.py — GPU instance_id check - SSM: /encache/gpu/instance_id = "none" - Lambda env: GPU_INSTANCE_ID = "none" (sourced from SSM) - CloudWatch: /aws/lambda/server-IngestWindowFunction-44v8BXyEGwOz

Steps to cause failure

flowchart LR
    A["IngestWindowFunction invoked"] --> B["gpu_instance_id = os.environ.get(GPU_INSTANCE_ID)"]
    B --> C["gpu_instance_id = 'none' — truthy"]
    C --> D["_resolve_gpu_url called with 'none'"]
    D --> E["EC2 describe_instances 'none' — InvalidInstanceID.Malformed"]
    E --> F["_resolve_gpu_url catches, returns 0.0.0.0"]
    F --> G["VLM2VecClient(instance_id='none') created"]
    G --> H["caption() calls _ensure_running()"]
    H --> I["EC2 describe_instances 'none' again — uncaught inside VLM2VecClient"]
    I --> J["Outer try/except catches — segment saved as failed"]

System

flowchart TD
    SSM[("SSM\n/encache/gpu/instance_id = 'none'")] -->|env var GPU_INSTANCE_ID=none| IW["ingest_window.py"]
    IW -->|"if gpu_instance_id: (truthy)"| RESOLVE["_resolve_gpu_url('none')"]
    RESOLVE -->|EC2 error caught| IW
    IW -->|"VLM2VecClient(instance_id='none')"| VLM["VLM2VecClient"]
    VLM -->|"EC2 describe_instances('none') — crashes"| IW
    IW -->|"outer except catches"| DB[("PostgreSQL\nprocessing_status=failed")]

Reproduction Details

Set SSM /encache/gpu/instance_id to "none".
Trigger any ingest window.
Log shows gpu_url_resolve_failed then gpu_enrichment_failed with InvalidInstanceID.Malformed.
Segment is saved with processing_status=failed.

Reproduction test: N/A — behavior is observable in CloudWatch; the fix is a one-line guard.

Notes for PR

Root cause: "none" is a truthy string. The guard if gpu_instance_id: does not exclude the sentinel value.

Fix: normalize the env var at read time — treat "none", "null", and empty string as absent:

_raw = os.environ.get("GPU_INSTANCE_ID", "")
gpu_instance_id = _raw if _raw and _raw.lower() not in ("none", "null") else None

When gpu_instance_id is None, _resolve_gpu_url is skipped entirely, VLM2VecClient is never instantiated with a bad ID, and the GPU block raises a RuntimeError("GPU not configured") immediately so the outer except marks the segment failed cleanly without any EC2 API calls.

Audit Log

ID	Action	Note	Context
1	Create audit log	Initialize bug investigation	User reports ingest always fails with `InvalidInstanceID.Malformed`
2	Read CloudWatch logs	`gpu_url_resolve_failed instance_id="none"` then `gpu_enrichment_failed` on every ingest	`/aws/lambda/server-IngestWindowFunction-44v8BXyEGwOz`
3	Check SSM	`/encache/gpu/instance_id = "none"` — literal sentinel string	`aws ssm get-parameters-by-path --path /encache/gpu`
4	Check Lambda env	`GPU_INSTANCE_ID = "none"` sourced from SSM	`aws lambda get-function-configuration`
5	Check EC2	No GPU instances running in account	`aws ec2 describe-instances --filters Name=tag:Name,Values=gpu`
6	Identify root cause	`"none"` is truthy — GPU path entered, EC2 called twice with invalid ID	`ingest_window.py` GPU block
7	Apply fix	Normalize `GPU_INSTANCE_ID` at read time; raise early `RuntimeError` when no GPU configured	`ingest_window.py`

Verification

[x] Reproduced failure before fix (CloudWatch confirms on every ingest)
[x] Root cause identified with evidence (SSM = "none", truthy guard)
[x] Fix applied at source (read-time normalization)
[ ] Reproduction test passes after fix (N/A — requires live AWS environment)
[x] Regression test added (N/A — one-line guard, covered by existing test_ingest_segment_first.py GPU failure path)
[x] Verified no duplicate solved-bug log exists for same root cause

GPU Ingest Crashes — instance_id="none" Treated as Valid EC2 ID