GPU Worker (VLM2Vec + Qwen2-VL)
System Intent
- What this is: A FastAPI server that runs on an EC2 GPU instance and exposes two models — a visual embedding model for semantic video search and a captioning/generation model for scene description, NER, and triple extraction. It is the GPU-side counterpart to
VLM2VecClient and GPULLMClient.
Mermaid Diagram
flowchart TD
Caller["API Lambda / Ingest Lambda"] -->|POST /encode-video| EV["encode_video()"]
Caller -->|POST /encode-text| ET["encode_text()"]
Caller -->|POST /caption| CAP["caption()"]
Caller -->|POST /generate| GEN["generate()"]
Caller -->|GET /health| H["health()"]
EV --> EM["_embed_model\nQwen2-VL-2B-Instruct\n+ TIGER-Lab/VLM2Vec-Qwen2VL-2B LoRA"]
ET --> EM
CAP --> CM["_caption_model\nQwen/Qwen2-VL-7B-Instruct"]
GEN --> CM
EM -->|1536-dim float vector| Out1["EmbeddingResponse"]
CM -->|text| Out2["CaptionResponse / GenerateResponse"]
Flows
Flow: load_models
- Core files:
main/server/worldmm/gpu_worker/server.py
Types
(no HTTP types — called internally at startup)
Paths
| path | input | output | path-type | notes |
load_models.success | — | models in globals | happy path | logs GPU memory used |
load_models.cuda-oom | — | OOM error | error | requires g5.xlarge (24 GB VRAM) |
Pseudocode
load Qwen/Qwen2-VL-7B-Instruct → _caption_model (fp16, device_map=auto)
load Qwen/Qwen2-VL-2B-Instruct base → base_2b (fp16, device_map=auto)
wrap base_2b with PeftModel.from_pretrained(base_2b, "TIGER-Lab/VLM2Vec-Qwen2VL-2B") → _embed_model
NOTE: adapter must be VLM2Vec-Qwen2VL-2B (Qwen2-VL-compatible).
VLM2Vec-LoRA targets Phi-3.5 and must NOT be used here (see bug #403).
load AutoProcessor for 7B → _processor
load AutoProcessor for 2B → _embed_processor
start keepalive thread during load to prevent idle-shutdown
Flow: encode_video
- Core files:
main/server/worldmm/gpu_worker/server.py
Types
EncodeVideoRequest {
frames: list[string] (base64-encoded JPEG)
}
EmbeddingResponse {
embedding: list[float] (1536-dim, last hidden state of final token)
}
Paths
| path | input | output | path-type | notes |
encode_video.success | EncodeVideoRequest | EmbeddingResponse | happy path | prompt: "Represent the given video clip." |
encode_video.not-loaded | EncodeVideoRequest | HTTP 503 | error | model not yet loaded |
encode_video.no-frames | EncodeVideoRequest | HTTP 400 | error | empty frames list |
Flow: encode_text
- Core files:
main/server/worldmm/gpu_worker/server.py
Types
EncodeTextRequest {
text: string
}
EmbeddingResponse {
embedding: list[float] (1536-dim)
}
Paths
| path | input | output | path-type | notes |
encode_text.success | EncodeTextRequest | EmbeddingResponse | happy path | prefix: "Represent the given query for retrieving relevant video clips: {text}" |
encode_text.not-loaded | EncodeTextRequest | HTTP 503 | error | |
encode_text.no-text | EncodeTextRequest | HTTP 400 | error | |
Flow: caption
- Core files:
main/server/worldmm/gpu_worker/server.py
Types
CaptionRequest {
frames: list[string] (base64-encoded JPEG)
transcript: string (optional, audio transcript for context)
}
CaptionResponse {
caption: string
}
Paths
| path | input | output | path-type | notes |
caption.success | CaptionRequest | CaptionResponse | happy path | max 512 new tokens |
caption.not-loaded | CaptionRequest | HTTP 503 | error | |
caption.no-frames | CaptionRequest | HTTP 400 | error | |
Flow: generate
- Core files:
main/server/worldmm/gpu_worker/server.py
Types
GenerateRequest {
messages: list[dict] (OpenAI-style chat messages)
max_new_tokens: int (default 512)
}
GenerateResponse {
text: string
}
Paths
| path | input | output | path-type | notes |
generate.success | GenerateRequest | GenerateResponse | happy path | used for NER, triple extraction |
generate.not-loaded | GenerateRequest | HTTP 503 | error | |
Models
| Role | Base model | Adapter | Processor |
Embedding (_embed_model) | Qwen/Qwen2-VL-2B-Instruct | TIGER-Lab/VLM2Vec-Qwen2VL-2B (PEFT LoRA) | Qwen/Qwen2-VL-2B-Instruct |
Captioning/generation (_caption_model) | Qwen/Qwen2-VL-7B-Instruct | none | Qwen/Qwen2-VL-7B-Instruct |
The embedding LoRA must be VLM2Vec-Qwen2VL-2B. The adapter VLM2Vec-LoRA targets Phi-3.5 and is incompatible with Qwen2-VL-2B-Instruct (issue #403).
Idle Watchdog
The server tracks the last-activity timestamp in /tmp/vlm2vec_last_activity. A background thread checks every 30 seconds; if idle for more than 480 seconds the instance calls sudo shutdown -h now. Activity is touched on every request and during model loading.
Known SSM write gap: The watchdog shuts the instance down without updating the SSM parameter /encache/gpu/instance_id. Callers (chat Lambda, ingest Lambda) that read the stale ID will address traffic to the terminated instance. Both Lambdas include a tag-based fallback — they scan EC2 for a running instance tagged Name=encache-gpu-worker and update SSM when they find one — so the stale ID is corrected on the first invocation that hits this path. See memories-chat.md and ingest-window.md for the full recovery logic.
Logs
| Source | Location |
| GPU worker stdout | systemd journal on the EC2 instance (journalctl -u vlm2vec -f) |
Deployment
- Mechanism: EC2 (g5.xlarge, Deep Learning AMI — PyTorch 2.8, Amazon Linux 2023)
- AMI:
ami-0e72acaa1863957cd - Deploy command:
# Upload server code to S3, then launch from launch template or run user data manually:
aws s3 cp main/server/worldmm/gpu_worker/server.py s3://encache-raw-memory/gpu-worker/server.py
# EC2 user data at main/server/worldmm/gpu_worker/ec2_user_data.sh installs deps and starts the vlm2vec systemd service automatically.
- Notes: Runs on port 8000. Callers resolve the instance IP via EC2
describe_instances using GPU_INSTANCE_ID (or the tag-based fallback if SSM is stale). Private IP is always used — both callers are in the same VPC and must not route through the NAT gateway. The instance is started on demand by the API Lambda if stopped; a new instance is launched from the launch template if terminated.