WorldMM Live Memory System — Design Spec
Date: 2026-03-22 Paper: https://arxiv.org/abs/2512.02425 Reference impl: https://github.com/wgcyeo/WorldMM Location: main/server/worldmm/
Overview
A live wearable memory system based on the WorldMM paper's architecture. Video streams from Meta Ray-Ban glasses, audio is transcribed separately, and the system constructs three complementary memory stores (episodic, semantic, visual) that are queried via an adaptive reasoning agent. The system processes asynchronously with 30-second base windows and coarser summaries generated when their full time windows close.
Models
| Role | Model | Dispatch |
|---|---|---|
| Captioning, NER, triple extraction, merging, consolidation, reranking | GPT-5 mini | Batch API (test) / Direct API (prod) |
| Reasoning agent, response generation | GPT-5 | Direct API (always) |
| Visual embedding (video clips + text queries) | VLM2Vec-V2 (~8B) | GPU worker on AWS G5 |
| Text embedding (entity resolution, semantic similarity) | OpenAI embeddings | API |
Architecture: Lambda + GPU Worker Split
Split along the GPU boundary:
- Lambdas (existing server pattern): ingest endpoint, ASR trigger, all GPT-5 mini calls, graph storage, query/reasoning API
- GPU worker (G5 instance): stateless VLM2Vec encoding only — receives video clips or text queries, returns embeddings.
- Storage: PostgreSQL + pgvector (graphs, triples, embeddings), S3 (video segments, frames)
- Graph cache: Redis for materialized igraph per user, invalidated on triple insert (scoped: only the specific user+memory_type key is invalidated)
Data Flow
Ingest Path
Meta Ray-Bans
-> SDK pushes video frames to ingest endpoint (Lambda)
-> Audio captured separately -> Whisper API -> transcript
-> Frames + transcript buffered into 30s windows in S3
Every 30s window close:
|-- Sample 16 frames uniformly from the 30s clip (sampled once, used for both captioning and VLM2Vec)
|-- GPT-5 mini: caption(16 frames as base64 image_url content parts, transcript) -> 30s caption
|-- GPT-5 mini: NER + episodic triple extraction -> triples -> graph update
|-- GPU worker: VLM2Vec(same 16 frames) -> visual embedding (VECTOR(1536)) -> pgvector
|-- Store caption + metadata in PostgreSQL
Every 3min window close:
|-- GPT-5 mini: merge(six 30s captions) -> 3min summary -> episodic triples -> graph
Every 10min window close:
|-- GPT-5 mini: merge(three 3min + one remaining 1min summaries) -> 10min summary -> episodic triples -> graph
|-- GPT-5 mini: semantic triple extraction from 10min summary text only (not episodic triples) -> consolidation against existing semantic graph
Every 1hr window close:
|-- GPT-5 mini: merge(six 10min summaries) -> 1hr summary -> episodic triples -> graph
Query Path
Mobile app -> query API (Lambda)
-> Adaptive reasoning agent (GPT-5, up to 5 rounds):
Round N: pick memory type + formulate search query
|-- episodic: embed query -> PPR across temporal graphs -> LLM rerank -> captions
|-- semantic: embed query -> PPR on entity graph -> top triples
|-- visual: VLM2Vec encode query text (GPU worker) -> cosine sim -> frames from S3
Accumulate retrieved context
-> GPT-5: generate answer from all accumulated context
-> Response to mobile app
Episodic vs Semantic Triples
Both are [subject, predicate, object] tuples stored in the same table with a memory_type discriminator. They differ in extraction timing, prompt intent, and lifecycle:
Episodic triples — event-specific facts: - Extracted at: every time scale (30s, 3min, 10min, 1hr) from each caption - Prompt intent: "Extract factual, event-specific relationships from this caption" - Examples: ["I", "stand at", "dining table"], ["Katrina", "asks about", "tomorrow's schedule"] - Lifecycle: immutable once created. Tied to a specific segment. Never consolidated or merged. - Retrieval: PPR across multi-scale temporal graphs, then LLM cross-scale reranking
Semantic triples — generalized relational knowledge: - Extracted at: 10-minute summaries only (coarser timescale to capture patterns) - Prompt intent: "Extract long-term patterns, habits, preferences, and social relationships — not what happened in this specific moment, but what is generally true" - Examples: ["I", "often eats", "fruits and snacks"], ["Alice", "expresses romantic feelings toward", "I"] - Lifecycle: undergo consolidation pipeline. When new semantic triples arrive, they are compared (via embedding similarity) against existing ones. The LLM decides to merge, update, or remove conflicting triples. - Retrieval: PPR on a single evolving entity graph (no temporal scales)
This distinction drives two different prompt templates (prompts/episodic_triples.txt, prompts/semantic_triples.txt) and two different pipeline paths after extraction (episodic → direct graph insert, semantic → consolidation → graph update).
Prompt Templates
The prompts/ directory contains all LLM prompt templates:
| Template | Used by | Purpose |
|---|---|---|
caption.txt | captioner.py | Generate 30s caption from frames + transcript |
ner.txt | openie.py | Extract named entities from caption |
episodic_triples.txt | openie.py | Extract event-specific triples from caption |
semantic_triples.txt | extraction.py | Extract relational/habitual triples from 10min summary |
merge_captions.txt | multiscale.py | Merge N fine-grained captions into coarser summary |
consolidation.txt | consolidation.py | Decide merge/update/remove for conflicting semantic triples |
cross_scale_rerank.txt | episodic_retriever.py | Given candidates from all temporal scales + original query, select the top 3 most relevant |
reasoning_agent.txt | agent.py | Adaptive retrieval: decide SEARCH or ANSWER each round |
response.txt | agent.py | Generate final answer from accumulated retrieval context |
entity_confirm.txt | openie.py | Confirm whether two surface forms refer to same entity |
Reasoning Agent Action Format
The reasoning agent communicates via structured JSON output each round:
// Search action — agent wants to query a memory type
{
"action": "SEARCH",
"memory_type": "episodic" | "semantic" | "visual",
"query": "free-form search query for the selected memory type"
}
// Answer action — agent has enough context to respond
{
"action": "ANSWER"
}
The agent loop: 1. Send prompt with: original user question, all prior rounds (action + retrieved context) 2. Parse JSON response 3. If SEARCH: dispatch to the corresponding retriever, append results to context, continue loop 4. If ANSWER: exit loop, pass all accumulated context to response agent (GPT-5) for final answer 5. If 5 rounds reached without ANSWER: force exit, pass accumulated context to response agent
Segment Buffer
Lambdas are stateless. The segment buffer uses S3 + DynamoDB: - Incoming frames written to S3 with keys like worldmm/{user_id}/buffer/{timestamp}.jpg - DynamoDB tracks the current window state: {user_id, window_start, frame_count, audio_key} - When frame_count reaches the expected count for a 30s window (or 30s wall time elapses), the window is "closed" and an SNS notification triggers the processing fan-out - Audio chunks are similarly buffered in S3 and assembled per window
Module Structure
main/server/worldmm/
|-- ingest/
| |-- segment_buffer.py # Accumulates frames/audio into 30s windows
| |-- captioner.py # GPT-5 mini: frames + transcript -> caption
| |-- transcriber.py # Whisper API: audio -> transcript
| |-- app.py # Lambda handler for ingest endpoint
|
|-- memory/
| |-- episodic/
| | |-- graph.py # Multi-scale temporal knowledge graphs (PPR via igraph)
| | |-- openie.py # NER + triple extraction (GPT-5 mini)
| | |-- multiscale.py # Merge captions across time windows
| |-- semantic/
| | |-- graph.py # Evolving entity-relationship graph (PPR via igraph)
| | |-- extraction.py # Semantic triple extraction
| | |-- consolidation.py # Embedding similarity + LLM merge/update
| |-- visual/
| |-- encoder.py # VLM2Vec client (calls GPU worker)
| |-- index.py # pgvector similarity search
|
|-- retrieval/
| |-- agent.py # Adaptive reasoning agent (GPT-5, up to 5 rounds)
| |-- episodic_retriever.py # PPR search across temporal scales + rerank
| |-- semantic_retriever.py # PPR search on entity graph
| |-- visual_retriever.py # VLM2Vec cosine sim + frame fetch
|
|-- llm/
| |-- client.py # OpenAI API wrapper (batch vs direct mode)
| |-- batch.py # Batch API request builder + poller
| |-- prompts/ # Prompt templates
|
|-- gpu_worker/
| |-- vlm2vec_server.py # FastAPI on G5, serves VLM2Vec encoding
| |-- Dockerfile # AMI/container with pre-loaded weights
|
|-- tests/
|-- fixtures/ # EgoLife-derived test data
|-- (test files — see TDD Strategy)
Storage Schema
worldmm_segments
| Column | Type | Notes |
|---|---|---|
| id | UUID PK | |
| user_id | FK | |
| start_time | TIMESTAMPTZ | |
| end_time | TIMESTAMPTZ | |
| duration_seconds | INTEGER | 30, 180, 600, 3600 — not an enum |
| caption | TEXT | |
| s3_video_key | VARCHAR | raw video clip |
| s3_frames_key | VARCHAR | sampled frames |
| transcript | TEXT | ASR output (base segments only) |
| parent_segment_id | UUID FK NULL | links 30s -> 3min -> 10min -> 1hr |
worldmm_entities
| Column | Type | Notes |
|---|---|---|
| id | UUID PK | |
| user_id | FK | |
| canonical_entity_id | UUID FK NULL -> self | NULL = this IS the canonical |
| surface_form | VARCHAR | "Kate", "my roommate" |
| canonical_name | VARCHAR | "Katrina" — set on canonical row only |
| embedding | VECTOR(1536) | OpenAI text-embedding-3-large output |
Index: HNSW on embedding WHERE canonical_entity_id IS NULL (canonical entities only).
worldmm_triples
| Column | Type | Notes |
|---|---|---|
| id | UUID PK | |
| segment_id | FK -> worldmm_segments | |
| user_id | FK | |
| memory_type | ENUM: episodic, semantic | |
| subject_entity_id | FK -> worldmm_entities | points to canonical entity |
| predicate | VARCHAR | |
| object_entity_id | FK -> worldmm_entities NULL | points to canonical entity (NULL if literal object) |
| object_literal | VARCHAR NULL | literal value when object is not an entity (e.g., "2:30 PM", "123 Main St") |
| created_at | TIMESTAMPTZ | |
| invalidated_at | TIMESTAMPTZ NULL | semantic consolidation soft-delete |
CHECK constraint: exactly one of object_entity_id or object_literal must be non-null.
Triple entity FKs point directly to the canonical entity row. Graph construction is a single-hop lookup. Entity re-resolution only touches the entities table. Literal-object triples (e.g., ["meeting", "started_at", "2:30 PM"]) are included in the graph as leaf nodes — the literal value becomes a node connected only to the subject entity. These nodes participate in PPR traversal but do not undergo entity resolution.
worldmm_visual_embeddings
| Column | Type | Notes |
|---|---|---|
| id | UUID PK | |
| segment_id | FK -> worldmm_segments | |
| user_id | FK | |
| embedding | VECTOR(1536) | VLM2Vec-encoded video clip |
| timestamp | TIMESTAMPTZ |
Index: HNSW on embedding. Embedding dimension: VECTOR(1536) matching VLM2Vec-V2 output.
PPR Parameters (from paper)
| Parameter | Value | Notes |
|---|---|---|
| Damping factor | 0.85 | Standard PPR damping |
| Implementation | igraph personalized_pagerank(implementation="prpack") | |
| Episodic seed selection | Query entities matched via NER + embedding similarity | |
| Episodic top-K per scale | 10 (30s), 5 (3min), 5 (10min), 3 (1hr) | Then LLM cross-scale reranker filters to top 3 |
| Semantic seed selection | Top-K triples by cosine similarity to query embedding | Entities from those triples become uniform personalization vector |
| Semantic scoring | Edge score = PPR(subject) + PPR(object) | Edge-focused, not node-focused |
| Semantic top-K | 10 triples | |
| Entity match threshold | 0.6 cosine similarity (tunable via config) |
Entity Resolution Flow
- NER extracts "Kate" from caption
- Embed "Kate" via text embedding API
- HNSW search against canonical entities for that user (cosine >= 0.6)
- Match found -> LLM confirmation -> create alias row with canonical_entity_id FK
- No match -> create new canonical entity (canonical_entity_id = NULL)
- Triple FK points to canonical entity directly
- Entity resolution is batched per 30s window (all entities from one caption resolved together)
Graph Caching
Per-user materialized igraph held in Redis: - Key: worldmm:graph:{user_id}:{memory_type} - Invalidation: delete on new triple insert. During active wear, cache is effectively always cold — accept the 200-500ms rebuild cost per query as steady-state. This is acceptable for async mobile queries. - Fallback: if Redis unavailable, rebuild from DB every time.
Visual Query Path
VLM2Vec encodes both video clips (ingest) and text queries (retrieval) into the same embedding space: - Ingest: GPU worker encodes 30s clips -> embeddings stored in pgvector - Query: GPU worker encodes query text -> cosine similarity against stored embeddings - GPU unavailable at query time: visual retrieval returns empty results, reasoning agent falls back to episodic + semantic (graceful degradation)
LLM Client
Single request builder, two dispatch modes controlled by config:
request = build_request(
model="gpt-5-mini",
messages=[...],
max_completion_tokens=256,
reasoning_effort="low"
)
if mode == "batch":
batch_submit([request, ...]) # Batch API
else:
client.chat.completions.create(**request) # Direct API
- All GPT-5 mini calls go through this client (captioning, NER, triples, merging, consolidation, reranking)
- GPT-5 reasoning agent always uses direct API (sequential loop)
reasoning_effort="low"on all extraction/structured-output tasks to avoid wasted reasoning tokens
GPU Worker
Minimal stateless FastAPI service on G5 instance: - POST /encode-video: 30s video clip -> VLM2Vec embedding vector - POST /encode-text: query string -> VLM2Vec embedding vector - VLM2Vec-V2 loaded once at startup, stays in GPU memory - Weights pre-baked into AMI or loaded from EFS on boot - Single instance during active use, can terminate when idle
TDD Strategy
Principle
If the assertion depends on what the LLM says, it's Layer 2 (validation). If it depends on what our code does with what the LLM says, it's Layer 1 (deterministic TDD).
Layer 1 — Deterministic (red/green TDD)
LLM is mocked. Tests verify mechanical correctness: prompt formatting, response parsing, data structure assembly, graph algorithms.
| Test | Asserts |
|---|---|
| test_captioner_prompt | Correct frames selected, transcript included, prompt populated, response parsed to string |
| test_openie_ner_parse | Given canned LLM response, extracts entity list in expected format |
| test_openie_triple_parse | Given canned LLM response, produces [subject, predicate, object] with correct types |
| test_multiscale_merge_prompt | Six captions assembled into merge prompt, response parsed to summary string |
| test_episodic_graph_ppr | Given triples, igraph built, PPR from seed returns expected ranked nodes |
| test_entity_resolution | Given embeddings at known distances, >= 0.6 triggers LLM confirm (mocked), alias created with correct canonical FK |
| test_semantic_consolidation_candidates | Given triple embeddings, correct candidates selected by cosine threshold |
| test_semantic_consolidation_merge | Given canned LLM merge response, old triple invalidated, new merged triple inserted |
| test_visual_encoder | Request sent to GPU worker with correct payload, response parsed to vector of expected dim |
| test_retrieval_agent_dispatch | Given canned SEARCH response, correct retriever called with correct query |
| test_retrieval_agent_termination | Given canned ANSWER response, loop exits and returns answer without another search |
| test_retrieval_agent_max_rounds | After 5 SEARCH responses, loop exits and forces answer generation |
| test_batch_client | Requests batched correctly, poll logic works, results mapped to original request IDs |
Layer 2 — Validation (real API, accepts variance)
Uses EgoLife fixtures with real API calls (Batch API for cost). Not part of red/green cycle.
| Test | Validates |
|---|---|
| test_caption_quality | Caption from real frames/transcript mentions identifiable entities |
| test_triple_extraction_quality | Extracted triples are semantically valid for source caption |
| test_multiscale_coherence | 3min summary preserves key information from 30s captions |
| test_retrieval_agent_e2e | Agent retrieves relevant context and produces reasonable answer for known EgoLife question |