Skip to content

WorldMM Vertical Slice — EgoLife Validation Design Spec

Date: 2026-03-22 Depends on: docs/superpowers/specs/2026-03-22-worldmm-live-memory-design.md Location: main/server/worldmm/

Goal

Validate the WorldMM implementation by running the full pipeline against EgoLife data and evaluating accuracy on 500 QA questions. Compare results to the paper's reported 65.6% (GPT-5).

Two Components

1. GPU Worker (AWS g5.xlarge)

VLM2Vec-V2 FastAPI server for visual embedding.

  • Model: TIGER-Lab/VLM2Vec-LoRA + Qwen2-VL base (~16GB total)
  • Instance: g5.xlarge (1x A10G, 24GB VRAM)
  • Setup: download weights to Samsung HDD first, SCP to instance, load into GPU
  • Endpoints: POST /encode-video (16 frames → embedding), POST /encode-text (query → embedding)
  • Lifecycle: manual SSH, start server, stop (not terminate) when done. EBS preserves weights on disk; GPU memory reloads on restart (~1-2 min).

2. Pipeline Orchestrator (local Python script)

Processes EgoLife data through the full WorldMM pipeline, then runs evaluation.

Data Sources (Samsung HDD)

Data Path Format
Video clips /Volumes/Samsung 2TB HDD/egolife/A1_JAKE/DAY*/ ~17s mp4 clips, 828 per day
DenseCaption /Volumes/Samsung 2TB HDD/egolife/EgoLifeCap/DenseCaption/A1_JAKE/DAY*/ Chinese SRT, needs translation
Transcripts /Volumes/Samsung 2TB HDD/egolife/EgoLifeCap/Transcript/A1_JAKE/DAY*/ Bilingual SRT (Chinese + English), speaker-tagged
QA questions /Volumes/Samsung 2TB HDD/egolife/EgoLifeQA/EgoLifeQA_A1_JAKE.json 500 multiple-choice questions with answer key

Pipeline Steps

Step 1: Translate DenseCaptions (Batch API)

  • Read Chinese SRT files from DenseCaption/A1_JAKE/DAY*/
  • Parse SRT entries (timestamp + Chinese text)
  • Submit batch translation requests to GPT-5-mini: "Translate to English: {chinese_text}"
  • Save translated captions as JSON: {timestamp: english_text}
  • Output: wmu_cache/translated_captions/DAY{N}.json

Step 2: Generate Sync Data + Fine Captions (Batch API)

  • Merge translated DenseCaptions with bilingual Transcripts by timestamp
  • For each 30s window: collect all caption + transcript entries
  • Submit batch requests to GPT-5-mini: rewrite into first-person (Jake's perspective) narrative
  • Output: wmu_cache/fine_captions/A1_JAKE_30sec.json — array of {start_time, end_time, text, date, video_path}

Step 3: Multi-scale Merging (Batch API)

  • Group 30s captions into 3min windows (6 per window) → submit merge prompts → A1_JAKE_3min.json
  • Group 3min into 10min windows → A1_JAKE_10min.json
  • Group 10min into 1hr windows → A1_JAKE_1h.json

Step 4: Triple Extraction (Batch API)

  • For each caption at each time scale: NER + episodic triple extraction
  • Entity resolution: embed entities → search existing canonicals → LLM confirm
  • Store triples in SQLite (using existing ORM models)

Step 5: Semantic Memory (Batch API)

  • At 10min boundaries: semantic triple extraction from 10min summaries
  • Consolidation: embedding similarity → LLM merge/update/keep
  • Store in SQLite

Step 6: Visual Embeddings (GPU Worker)

  • For each 30s video clip: sample 16 frames uniformly
  • Send to GPU worker /encode-video endpoint
  • Store embeddings (VECTOR(1536)) as numpy arrays in JSON format

Step 7: Evaluation (Direct API)

  • Load 500 QA questions from EgoLifeQA_A1_JAKE.json
  • For each question:
  • Parse query_time → index memories up to that time (temporal causality)
  • Run reasoning agent (GPT-5, up to 5 rounds) with access to all 3 memory types
  • Extract predicted answer letter using three-strategy parser (matching the reference implementation):
    1. Exact match: normalize response text, compare to each choice's text (e.g., "Alice" matches choice B "Alice")
    2. Letter extraction: regex for patterns like (B), B., B:, **B**, standalone B
    3. Full pattern match: regex for B. Alice, (B) Alice, Answer: B
  • First strategy that yields a match wins. If none match, mark as incorrect.
  • Compare extracted letter to ground truth answer field
  • Compute accuracy: correct / 500
  • Save results to wmu_cache/eval_results.json with per-question detail: {ID, type, question, predicted, correct, num_rounds, round_history}

Temporal Causality

The paper enforces that memory is indexed only up to query_time. For each QA question, the pipeline must: 1. Filter episodic triples to only those from segments with end_time <= query_time 2. Filter semantic triples to only those consolidated up to query_time 3. Filter visual embeddings to only those from segments before query_time

This prevents the model from "seeing" future events when answering questions.

Cache Strategy

Each pipeline step caches its output so reruns skip completed work: - wmu_cache/translated_captions/ — step 1 output - wmu_cache/fine_captions/ — step 2 output - wmu_cache/multiscale/ — step 3 output - wmu_cache/triples/ — step 4 output (SQLite DB) - wmu_cache/semantic/ — step 5 output - wmu_cache/visual_embeddings/ — step 6 output (JSON) - wmu_cache/eval_results.json — step 7 output

File Structure

main/server/worldmm/
├── pipeline/
│   ├── __init__.py
│   ├── translate.py          # Step 1: SRT parsing + batch translation
│   ├── sync.py               # Step 2: merge captions + transcripts, fine caption generation
│   ├── multiscale_runner.py  # Step 3: multi-scale merging orchestration
│   ├── triple_runner.py      # Step 4: NER + triple extraction + entity resolution
│   ├── semantic_runner.py    # Step 5: semantic extraction + consolidation
│   ├── visual_runner.py      # Step 6: VLM2Vec encoding via GPU worker
│   ├── evaluate.py           # Step 7: QA evaluation
│   └── run.py                # Main orchestrator: runs steps 1-7 in sequence
├── gpu_worker/
│   ├── server.py             # FastAPI VLM2Vec server (runs on g5.xlarge)
│   └── setup.sh              # Instance setup script (install deps, download weights)

GPU Worker Detail

# FastAPI endpoints
POST /encode-video
  Body: {"frames": [base64_str, ...]}  # 16 frames
  Response: {"embedding": [float, ...]}  # 1536-dim vector

POST /encode-text
  Body: {"text": "query string"}
  Response: {"embedding": [float, ...]}  # 1536-dim vector

GET /health
  Response: {"status": "ready", "model": "VLM2Vec-V2"}

Expected Results

  • Paper reports 65.6% accuracy with GPT-5 (all 3 memory types)
  • Without visual memory: expect lower (~55-60% estimated)
  • With visual memory via GPU worker: should approach 65.6%
  • Significant deviation (< 50%) suggests implementation bugs

Cost Estimate

Step API Est. calls Est. cost
Translation Batch GPT-5-mini ~5,000 captions x 7 days ~$2
Fine captions Batch GPT-5-mini ~5,000 ~$2
Multi-scale merge Batch GPT-5-mini ~2,000 ~$1
NER + triples Batch GPT-5-mini ~10,000 ~$3
Entity resolution Batch GPT-5-mini + embeddings ~5,000 ~$2
Semantic extraction Batch GPT-5-mini ~500 ~$0.50
VLM2Vec encoding GPU worker (g5.xlarge ~$1/hr) ~5,000 clips ~$2-4 (2-4 hrs)
Evaluation Direct GPT-5 500 x ~3 rounds ~$10-15
Total ~$25-30