Skip to content

GPU Ingest Crashes — instance_id="none" Treated as Valid EC2 ID

Metadata

  • Date: 2026-04-23
  • Status: fixed
  • Severity: medium
  • Related issue/ticket: N/A
  • Owner: N/A

About

Overview: - The SSM parameter /encache/gpu/instance_id is set to the literal string "none" (a placeholder). Because "none" is truthy in Python, ingest_window.py passes it to _resolve_gpu_url and then to VLM2VecClient(instance_id="none"). The EC2 describe_instances call inside VLM2VecClient._start_or_create() raises InvalidInstanceID.Malformed. The outer try/except in the GPU enrichment block catches it, so segments are still saved as processing_status=failed — but an unnecessary EC2 API call is made and a misleading error is logged on every ingest. - Additionally, no GPU EC2 instance exists in the account — the GPU worker is offline.

Technical Questions: - os.environ.get("GPU_INSTANCE_ID") returns "none" — a non-empty string — so if gpu_instance_id: is True and the GPU path is entered. - The SSM sentinel value "none" was set as a placeholder when no GPU instance was running. - The GPU enrichment is already wrapped in try/except, so segments are saved correctly. The fix avoids the unnecessary EC2 call.

Resources: - main/server/worldmm/pipeline/ingest_window.py — GPU instance_id check - SSM: /encache/gpu/instance_id = "none" - Lambda env: GPU_INSTANCE_ID = "none" (sourced from SSM) - CloudWatch: /aws/lambda/server-IngestWindowFunction-44v8BXyEGwOz

Steps to cause failure

flowchart LR
    A["IngestWindowFunction invoked"] --> B["gpu_instance_id = os.environ.get(GPU_INSTANCE_ID)"]
    B --> C["gpu_instance_id = 'none' — truthy"]
    C --> D["_resolve_gpu_url called with 'none'"]
    D --> E["EC2 describe_instances 'none' — InvalidInstanceID.Malformed"]
    E --> F["_resolve_gpu_url catches, returns 0.0.0.0"]
    F --> G["VLM2VecClient(instance_id='none') created"]
    G --> H["caption() calls _ensure_running()"]
    H --> I["EC2 describe_instances 'none' again — uncaught inside VLM2VecClient"]
    I --> J["Outer try/except catches — segment saved as failed"]

System

flowchart TD
    SSM[("SSM\n/encache/gpu/instance_id = 'none'")] -->|env var GPU_INSTANCE_ID=none| IW["ingest_window.py"]
    IW -->|"if gpu_instance_id: (truthy)"| RESOLVE["_resolve_gpu_url('none')"]
    RESOLVE -->|EC2 error caught| IW
    IW -->|"VLM2VecClient(instance_id='none')"| VLM["VLM2VecClient"]
    VLM -->|"EC2 describe_instances('none') — crashes"| IW
    IW -->|"outer except catches"| DB[("PostgreSQL\nprocessing_status=failed")]

Reproduction Details

  1. Set SSM /encache/gpu/instance_id to "none".
  2. Trigger any ingest window.
  3. Log shows gpu_url_resolve_failed then gpu_enrichment_failed with InvalidInstanceID.Malformed.
  4. Segment is saved with processing_status=failed.

Reproduction test: N/A — behavior is observable in CloudWatch; the fix is a one-line guard.

Notes for PR

Root cause: "none" is a truthy string. The guard if gpu_instance_id: does not exclude the sentinel value.

Fix: normalize the env var at read time — treat "none", "null", and empty string as absent:

_raw = os.environ.get("GPU_INSTANCE_ID", "")
gpu_instance_id = _raw if _raw and _raw.lower() not in ("none", "null") else None

When gpu_instance_id is None, _resolve_gpu_url is skipped entirely, VLM2VecClient is never instantiated with a bad ID, and the GPU block raises a RuntimeError("GPU not configured") immediately so the outer except marks the segment failed cleanly without any EC2 API calls.

Audit Log

ID Action Note Context
1 Create audit log Initialize bug investigation User reports ingest always fails with InvalidInstanceID.Malformed
2 Read CloudWatch logs gpu_url_resolve_failed instance_id="none" then gpu_enrichment_failed on every ingest /aws/lambda/server-IngestWindowFunction-44v8BXyEGwOz
3 Check SSM /encache/gpu/instance_id = "none" — literal sentinel string aws ssm get-parameters-by-path --path /encache/gpu
4 Check Lambda env GPU_INSTANCE_ID = "none" sourced from SSM aws lambda get-function-configuration
5 Check EC2 No GPU instances running in account aws ec2 describe-instances --filters Name=tag:Name,Values=*gpu*
6 Identify root cause "none" is truthy — GPU path entered, EC2 called twice with invalid ID ingest_window.py GPU block
7 Apply fix Normalize GPU_INSTANCE_ID at read time; raise early RuntimeError when no GPU configured ingest_window.py

Verification

  • [x] Reproduced failure before fix (CloudWatch confirms on every ingest)
  • [x] Root cause identified with evidence (SSM = "none", truthy guard)
  • [x] Fix applied at source (read-time normalization)
  • [ ] Reproduction test passes after fix (N/A — requires live AWS environment)
  • [x] Regression test added (N/A — one-line guard, covered by existing test_ingest_segment_first.py GPU failure path)
  • [x] Verified no duplicate solved-bug log exists for same root cause