WorldMM Vertical Slice — EgoLife Validation Design Spec
Date: 2026-03-22 Depends on: docs/superpowers/specs/2026-03-22-worldmm-live-memory-design.md Location: main/server/worldmm/
Goal
Validate the WorldMM implementation by running the full pipeline against EgoLife data and evaluating accuracy on 500 QA questions. Compare results to the paper's reported 65.6% (GPT-5).
Two Components
1. GPU Worker (AWS g5.xlarge)
VLM2Vec-V2 FastAPI server for visual embedding.
- Model:
TIGER-Lab/VLM2Vec-LoRA+ Qwen2-VL base (~16GB total) - Instance: g5.xlarge (1x A10G, 24GB VRAM)
- Setup: download weights to Samsung HDD first, SCP to instance, load into GPU
- Endpoints:
POST /encode-video(16 frames → embedding),POST /encode-text(query → embedding) - Lifecycle: manual SSH, start server, stop (not terminate) when done. EBS preserves weights on disk; GPU memory reloads on restart (~1-2 min).
2. Pipeline Orchestrator (local Python script)
Processes EgoLife data through the full WorldMM pipeline, then runs evaluation.
Data Sources (Samsung HDD)
| Data | Path | Format |
|---|---|---|
| Video clips | /Volumes/Samsung 2TB HDD/egolife/A1_JAKE/DAY*/ | ~17s mp4 clips, 828 per day |
| DenseCaption | /Volumes/Samsung 2TB HDD/egolife/EgoLifeCap/DenseCaption/A1_JAKE/DAY*/ | Chinese SRT, needs translation |
| Transcripts | /Volumes/Samsung 2TB HDD/egolife/EgoLifeCap/Transcript/A1_JAKE/DAY*/ | Bilingual SRT (Chinese + English), speaker-tagged |
| QA questions | /Volumes/Samsung 2TB HDD/egolife/EgoLifeQA/EgoLifeQA_A1_JAKE.json | 500 multiple-choice questions with answer key |
Pipeline Steps
Step 1: Translate DenseCaptions (Batch API)
- Read Chinese SRT files from
DenseCaption/A1_JAKE/DAY*/ - Parse SRT entries (timestamp + Chinese text)
- Submit batch translation requests to GPT-5-mini: "Translate to English: {chinese_text}"
- Save translated captions as JSON:
{timestamp: english_text} - Output:
wmu_cache/translated_captions/DAY{N}.json
Step 2: Generate Sync Data + Fine Captions (Batch API)
- Merge translated DenseCaptions with bilingual Transcripts by timestamp
- For each 30s window: collect all caption + transcript entries
- Submit batch requests to GPT-5-mini: rewrite into first-person (Jake's perspective) narrative
- Output:
wmu_cache/fine_captions/A1_JAKE_30sec.json— array of{start_time, end_time, text, date, video_path}
Step 3: Multi-scale Merging (Batch API)
- Group 30s captions into 3min windows (6 per window) → submit merge prompts →
A1_JAKE_3min.json - Group 3min into 10min windows →
A1_JAKE_10min.json - Group 10min into 1hr windows →
A1_JAKE_1h.json
Step 4: Triple Extraction (Batch API)
- For each caption at each time scale: NER + episodic triple extraction
- Entity resolution: embed entities → search existing canonicals → LLM confirm
- Store triples in SQLite (using existing ORM models)
Step 5: Semantic Memory (Batch API)
- At 10min boundaries: semantic triple extraction from 10min summaries
- Consolidation: embedding similarity → LLM merge/update/keep
- Store in SQLite
Step 6: Visual Embeddings (GPU Worker)
- For each 30s video clip: sample 16 frames uniformly
- Send to GPU worker
/encode-videoendpoint - Store embeddings (VECTOR(1536)) as numpy arrays in JSON format
Step 7: Evaluation (Direct API)
- Load 500 QA questions from
EgoLifeQA_A1_JAKE.json - For each question:
- Parse
query_time→ index memories up to that time (temporal causality) - Run reasoning agent (GPT-5, up to 5 rounds) with access to all 3 memory types
- Extract predicted answer letter using three-strategy parser (matching the reference implementation):
- Exact match: normalize response text, compare to each choice's text (e.g., "Alice" matches choice B "Alice")
- Letter extraction: regex for patterns like
(B),B.,B:,**B**, standaloneB - Full pattern match: regex for
B. Alice,(B) Alice,Answer: B
- First strategy that yields a match wins. If none match, mark as incorrect.
- Compare extracted letter to ground truth
answerfield - Compute accuracy: correct / 500
- Save results to
wmu_cache/eval_results.jsonwith per-question detail:{ID, type, question, predicted, correct, num_rounds, round_history}
Temporal Causality
The paper enforces that memory is indexed only up to query_time. For each QA question, the pipeline must: 1. Filter episodic triples to only those from segments with end_time <= query_time 2. Filter semantic triples to only those consolidated up to query_time 3. Filter visual embeddings to only those from segments before query_time
This prevents the model from "seeing" future events when answering questions.
Cache Strategy
Each pipeline step caches its output so reruns skip completed work: - wmu_cache/translated_captions/ — step 1 output - wmu_cache/fine_captions/ — step 2 output - wmu_cache/multiscale/ — step 3 output - wmu_cache/triples/ — step 4 output (SQLite DB) - wmu_cache/semantic/ — step 5 output - wmu_cache/visual_embeddings/ — step 6 output (JSON) - wmu_cache/eval_results.json — step 7 output
File Structure
main/server/worldmm/
├── pipeline/
│ ├── __init__.py
│ ├── translate.py # Step 1: SRT parsing + batch translation
│ ├── sync.py # Step 2: merge captions + transcripts, fine caption generation
│ ├── multiscale_runner.py # Step 3: multi-scale merging orchestration
│ ├── triple_runner.py # Step 4: NER + triple extraction + entity resolution
│ ├── semantic_runner.py # Step 5: semantic extraction + consolidation
│ ├── visual_runner.py # Step 6: VLM2Vec encoding via GPU worker
│ ├── evaluate.py # Step 7: QA evaluation
│ └── run.py # Main orchestrator: runs steps 1-7 in sequence
├── gpu_worker/
│ ├── server.py # FastAPI VLM2Vec server (runs on g5.xlarge)
│ └── setup.sh # Instance setup script (install deps, download weights)
GPU Worker Detail
# FastAPI endpoints
POST /encode-video
Body: {"frames": [base64_str, ...]} # 16 frames
Response: {"embedding": [float, ...]} # 1536-dim vector
POST /encode-text
Body: {"text": "query string"}
Response: {"embedding": [float, ...]} # 1536-dim vector
GET /health
Response: {"status": "ready", "model": "VLM2Vec-V2"}
Expected Results
- Paper reports 65.6% accuracy with GPT-5 (all 3 memory types)
- Without visual memory: expect lower (~55-60% estimated)
- With visual memory via GPU worker: should approach 65.6%
- Significant deviation (< 50%) suggests implementation bugs
Cost Estimate
| Step | API | Est. calls | Est. cost |
|---|---|---|---|
| Translation | Batch GPT-5-mini | ~5,000 captions x 7 days | ~$2 |
| Fine captions | Batch GPT-5-mini | ~5,000 | ~$2 |
| Multi-scale merge | Batch GPT-5-mini | ~2,000 | ~$1 |
| NER + triples | Batch GPT-5-mini | ~10,000 | ~$3 |
| Entity resolution | Batch GPT-5-mini + embeddings | ~5,000 | ~$2 |
| Semantic extraction | Batch GPT-5-mini | ~500 | ~$0.50 |
| VLM2Vec encoding | GPU worker (g5.xlarge ~$1/hr) | ~5,000 clips | ~$2-4 (2-4 hrs) |
| Evaluation | Direct GPT-5 | 500 x ~3 rounds | ~$10-15 |
| Total | ~$25-30 |