WorldMM Vertical Slice — EgoLife Validation Design Spec

Date: 2026-03-22 Depends on: docs/superpowers/specs/2026-03-22-worldmm-live-memory-design.md Location: main/server/worldmm/

Goal

Validate the WorldMM implementation by running the full pipeline against EgoLife data and evaluating accuracy on 500 QA questions. Compare results to the paper's reported 65.6% (GPT-5).

Two Components

1. GPU Worker (AWS g5.xlarge)

VLM2Vec-V2 FastAPI server for visual embedding.

Model: TIGER-Lab/VLM2Vec-LoRA + Qwen2-VL base (~16GB total)
Instance: g5.xlarge (1x A10G, 24GB VRAM)
Setup: download weights to Samsung HDD first, SCP to instance, load into GPU
Endpoints: POST /encode-video (16 frames → embedding), POST /encode-text (query → embedding)
Lifecycle: manual SSH, start server, stop (not terminate) when done. EBS preserves weights on disk; GPU memory reloads on restart (~1-2 min).

2. Pipeline Orchestrator (local Python script)

Processes EgoLife data through the full WorldMM pipeline, then runs evaluation.

Data Sources (Samsung HDD)

Data	Path	Format
Video clips	`/Volumes/Samsung 2TB HDD/egolife/A1_JAKE/DAY*/`	~17s mp4 clips, 828 per day
DenseCaption	`/Volumes/Samsung 2TB HDD/egolife/EgoLifeCap/DenseCaption/A1_JAKE/DAY*/`	Chinese SRT, needs translation
Transcripts	`/Volumes/Samsung 2TB HDD/egolife/EgoLifeCap/Transcript/A1_JAKE/DAY*/`	Bilingual SRT (Chinese + English), speaker-tagged
QA questions	`/Volumes/Samsung 2TB HDD/egolife/EgoLifeQA/EgoLifeQA_A1_JAKE.json`	500 multiple-choice questions with answer key

Pipeline Steps

Step 1: Translate DenseCaptions (Batch API)

Read Chinese SRT files from DenseCaption/A1_JAKE/DAY*/
Parse SRT entries (timestamp + Chinese text)
Submit batch translation requests to GPT-5-mini: "Translate to English: {chinese_text}"
Save translated captions as JSON: {timestamp: english_text}
Output: wmu_cache/translated_captions/DAY{N}.json

Step 2: Generate Sync Data + Fine Captions (Batch API)

Merge translated DenseCaptions with bilingual Transcripts by timestamp
For each 30s window: collect all caption + transcript entries
Submit batch requests to GPT-5-mini: rewrite into first-person (Jake's perspective) narrative
Output: wmu_cache/fine_captions/A1_JAKE_30sec.json — array of {start_time, end_time, text, date, video_path}

Step 3: Multi-scale Merging (Batch API)

Group 30s captions into 3min windows (6 per window) → submit merge prompts → A1_JAKE_3min.json
Group 3min into 10min windows → A1_JAKE_10min.json
Group 10min into 1hr windows → A1_JAKE_1h.json

Step 4: Triple Extraction (Batch API)

For each caption at each time scale: NER + episodic triple extraction
Entity resolution: embed entities → search existing canonicals → LLM confirm
Store triples in SQLite (using existing ORM models)

Step 5: Semantic Memory (Batch API)

At 10min boundaries: semantic triple extraction from 10min summaries
Consolidation: embedding similarity → LLM merge/update/keep
Store in SQLite

Step 6: Visual Embeddings (GPU Worker)

For each 30s video clip: sample 16 frames uniformly
Send to GPU worker /encode-video endpoint
Store embeddings (VECTOR(1536)) as numpy arrays in JSON format

Step 7: Evaluation (Direct API)

Load 500 QA questions from EgoLifeQA_A1_JAKE.json
For each question:
Parse query_time → index memories up to that time (temporal causality)
Run reasoning agent (GPT-5, up to 5 rounds) with access to all 3 memory types
Extract predicted answer letter using three-strategy parser (matching the reference implementation):
1. Exact match: normalize response text, compare to each choice's text (e.g., "Alice" matches choice B "Alice")
2. Letter extraction: regex for patterns like (B), B., B:, **B**, standalone B
3. Full pattern match: regex for B. Alice, (B) Alice, Answer: B
First strategy that yields a match wins. If none match, mark as incorrect.
Compare extracted letter to ground truth answer field
Compute accuracy: correct / 500
Save results to wmu_cache/eval_results.json with per-question detail: {ID, type, question, predicted, correct, num_rounds, round_history}

Temporal Causality

The paper enforces that memory is indexed only up to query_time. For each QA question, the pipeline must: 1. Filter episodic triples to only those from segments with end_time <= query_time 2. Filter semantic triples to only those consolidated up to query_time 3. Filter visual embeddings to only those from segments before query_time

This prevents the model from "seeing" future events when answering questions.

Cache Strategy

Each pipeline step caches its output so reruns skip completed work: - wmu_cache/translated_captions/ — step 1 output - wmu_cache/fine_captions/ — step 2 output - wmu_cache/multiscale/ — step 3 output - wmu_cache/triples/ — step 4 output (SQLite DB) - wmu_cache/semantic/ — step 5 output - wmu_cache/visual_embeddings/ — step 6 output (JSON) - wmu_cache/eval_results.json — step 7 output

File Structure

main/server/worldmm/
├── pipeline/
│   ├── __init__.py
│   ├── translate.py          # Step 1: SRT parsing + batch translation
│   ├── sync.py               # Step 2: merge captions + transcripts, fine caption generation
│   ├── multiscale_runner.py  # Step 3: multi-scale merging orchestration
│   ├── triple_runner.py      # Step 4: NER + triple extraction + entity resolution
│   ├── semantic_runner.py    # Step 5: semantic extraction + consolidation
│   ├── visual_runner.py      # Step 6: VLM2Vec encoding via GPU worker
│   ├── evaluate.py           # Step 7: QA evaluation
│   └── run.py                # Main orchestrator: runs steps 1-7 in sequence
├── gpu_worker/
│   ├── server.py             # FastAPI VLM2Vec server (runs on g5.xlarge)
│   └── setup.sh              # Instance setup script (install deps, download weights)

GPU Worker Detail

# FastAPI endpoints
POST /encode-video
  Body: {"frames": [base64_str, ...]}  # 16 frames
  Response: {"embedding": [float, ...]}  # 1536-dim vector

POST /encode-text
  Body: {"text": "query string"}
  Response: {"embedding": [float, ...]}  # 1536-dim vector

GET /health
  Response: {"status": "ready", "model": "VLM2Vec-V2"}

Expected Results

Paper reports 65.6% accuracy with GPT-5 (all 3 memory types)
Without visual memory: expect lower (~55-60% estimated)
With visual memory via GPU worker: should approach 65.6%
Significant deviation (< 50%) suggests implementation bugs

Cost Estimate

Step	API	Est. calls	Est. cost
Translation	Batch GPT-5-mini	~5,000 captions x 7 days	~$2
Fine captions	Batch GPT-5-mini	~5,000	~$2
Multi-scale merge	Batch GPT-5-mini	~2,000	~$1
NER + triples	Batch GPT-5-mini	~10,000	~$3
Entity resolution	Batch GPT-5-mini + embeddings	~5,000	~$2
Semantic extraction	Batch GPT-5-mini	~500	~$0.50
VLM2Vec encoding	GPU worker (g5.xlarge ~$1/hr)	~5,000 clips	~$2-4 (2-4 hrs)
Evaluation	Direct GPT-5	500 x ~3 rounds	~$10-15
Total			~$25-30