GPU Worker (VLM2Vec + Qwen2-VL)

Metadata

System type: service

System Intent

What this is: A FastAPI server that runs on an EC2 GPU instance and exposes two models — a visual embedding model for semantic video search and a captioning/generation model for scene description, NER, and triple extraction. It is the GPU-side counterpart to VLM2VecClient and GPULLMClient.

Mermaid Diagram

flowchart TD
  Caller["API Lambda / Ingest Lambda"] -->|POST /encode-video| EV["encode_video()"]
  Caller -->|POST /encode-text| ET["encode_text()"]
  Caller -->|POST /caption| CAP["caption()"]
  Caller -->|POST /generate| GEN["generate()"]
  Caller -->|GET /health| H["health()"]

  EV --> EM["_embed_model\nQwen2-VL-2B-Instruct\n+ TIGER-Lab/VLM2Vec-Qwen2VL-2B LoRA"]
  ET --> EM
  CAP --> CM["_caption_model\nQwen/Qwen2-VL-7B-Instruct"]
  GEN --> CM

  EM -->|1536-dim float vector| Out1["EmbeddingResponse"]
  CM -->|text| Out2["CaptionResponse / GenerateResponse"]

Flows

Flow: `load_models`

Core files: main/server/worldmm/gpu_worker/server.py

Types

(no HTTP types — called internally at startup)

Paths

path	input	output	path-type	notes
`load_models.success`	—	models in globals	`happy path`	logs GPU memory used
`load_models.cuda-oom`	—	OOM error	`error`	requires g5.xlarge (24 GB VRAM)

Pseudocode

load Qwen/Qwen2-VL-7B-Instruct → _caption_model (fp16, device_map=auto)
load Qwen/Qwen2-VL-2B-Instruct base → base_2b (fp16, device_map=auto)
wrap base_2b with PeftModel.from_pretrained(base_2b, "TIGER-Lab/VLM2Vec-Qwen2VL-2B") → _embed_model
  NOTE: adapter must be VLM2Vec-Qwen2VL-2B (Qwen2-VL-compatible).
        VLM2Vec-LoRA targets Phi-3.5 and must NOT be used here (see bug #403).
load AutoProcessor for 7B → _processor
load AutoProcessor for 2B → _embed_processor
start keepalive thread during load to prevent idle-shutdown

Flow: `encode_video`

Core files: main/server/worldmm/gpu_worker/server.py

Types

EncodeVideoRequest {
  frames: list[string]  (base64-encoded JPEG)
}

EmbeddingResponse {
  embedding: list[float]  (1536-dim, last hidden state of final token)
}

Paths

path	input	output	path-type	notes
`encode_video.success`	`EncodeVideoRequest`	`EmbeddingResponse`	`happy path`	prompt: "Represent the given video clip."
`encode_video.not-loaded`	`EncodeVideoRequest`	`HTTP 503`	`error`	model not yet loaded
`encode_video.no-frames`	`EncodeVideoRequest`	`HTTP 400`	`error`	empty frames list

Flow: `encode_text`

Core files: main/server/worldmm/gpu_worker/server.py

Types

EncodeTextRequest {
  text: string
}

EmbeddingResponse {
  embedding: list[float]  (1536-dim)
}

Paths

path	input	output	path-type	notes
`encode_text.success`	`EncodeTextRequest`	`EmbeddingResponse`	`happy path`	prefix: "Represent the given query for retrieving relevant video clips: {text}"
`encode_text.not-loaded`	`EncodeTextRequest`	`HTTP 503`	`error`
`encode_text.no-text`	`EncodeTextRequest`	`HTTP 400`	`error`

Flow: `caption`

Core files: main/server/worldmm/gpu_worker/server.py

Types

CaptionRequest {
  frames: list[string]  (base64-encoded JPEG)
  transcript: string    (optional, audio transcript for context)
}

CaptionResponse {
  caption: string
}

Paths

path	input	output	path-type	notes
`caption.success`	`CaptionRequest`	`CaptionResponse`	`happy path`	max 512 new tokens
`caption.not-loaded`	`CaptionRequest`	`HTTP 503`	`error`
`caption.no-frames`	`CaptionRequest`	`HTTP 400`	`error`

Flow: `generate`

Core files: main/server/worldmm/gpu_worker/server.py

Types

GenerateRequest {
  messages: list[dict]    (OpenAI-style chat messages)
  max_new_tokens: int     (default 512)
}

GenerateResponse {
  text: string
}

Paths

path	input	output	path-type	notes
`generate.success`	`GenerateRequest`	`GenerateResponse`	`happy path`	used for NER, triple extraction
`generate.not-loaded`	`GenerateRequest`	`HTTP 503`	`error`

Models

Role	Base model	Adapter	Processor
Embedding (`_embed_model`)	`Qwen/Qwen2-VL-2B-Instruct`	`TIGER-Lab/VLM2Vec-Qwen2VL-2B` (PEFT LoRA)	`Qwen/Qwen2-VL-2B-Instruct`
Captioning/generation (`_caption_model`)	`Qwen/Qwen2-VL-7B-Instruct`	none	`Qwen/Qwen2-VL-7B-Instruct`

The embedding LoRA must be VLM2Vec-Qwen2VL-2B. The adapter VLM2Vec-LoRA targets Phi-3.5 and is incompatible with Qwen2-VL-2B-Instruct (issue #403).

Idle Watchdog

The server tracks the last-activity timestamp in /tmp/vlm2vec_last_activity. A background thread checks every 30 seconds; if idle for more than 480 seconds the instance calls sudo shutdown -h now. Activity is touched on every request and during model loading.

Known SSM write gap: The watchdog shuts the instance down without updating the SSM parameter /encache/gpu/instance_id. Callers (chat Lambda, ingest Lambda) that read the stale ID will address traffic to the terminated instance. Both Lambdas include a tag-based fallback — they scan EC2 for a running instance tagged Name=encache-gpu-worker and update SSM when they find one — so the stale ID is corrected on the first invocation that hits this path. See memories-chat.md and ingest-window.md for the full recovery logic.

Logs

Source	Location
GPU worker stdout	systemd journal on the EC2 instance (`journalctl -u vlm2vec -f`)

Deployment

Mechanism: EC2 (g5.xlarge, Deep Learning AMI — PyTorch 2.8, Amazon Linux 2023)
AMI: ami-0e72acaa1863957cd

Deploy command:

# Upload server code to S3, then launch from launch template or run user data manually:
aws s3 cp main/server/worldmm/gpu_worker/server.py s3://encache-raw-memory/gpu-worker/server.py
# EC2 user data at main/server/worldmm/gpu_worker/ec2_user_data.sh installs deps and starts the vlm2vec systemd service automatically.

Notes: Runs on port 8000. Callers resolve the instance IP via EC2 describe_instances using GPU_INSTANCE_ID (or the tag-based fallback if SSM is stale). Private IP is always used — both callers are in the same VPC and must not route through the NAT gateway. The instance is started on demand by the API Lambda if stopped; a new instance is launched from the launch template if terminated.

GPU Worker (VLM2Vec + Qwen2-VL)

Metadata

System Intent

Mermaid Diagram

Flows

Flow: load_models

Types

Paths

Pseudocode

Flow: encode_video

Types

Paths

Flow: encode_text

Types

Paths

Flow: caption

Types

Paths

Flow: generate

Types

Paths

Models

Idle Watchdog

Logs

Deployment

Flow: `load_models`

Flow: `encode_video`

Flow: `encode_text`

Flow: `caption`

Flow: `generate`