WorldMM Live Memory System — Design Spec

Date: 2026-03-22 Paper: https://arxiv.org/abs/2512.02425 Reference impl: https://github.com/wgcyeo/WorldMM Location: main/server/worldmm/

Overview

A live wearable memory system based on the WorldMM paper's architecture. Video streams from Meta Ray-Ban glasses, audio is transcribed separately, and the system constructs three complementary memory stores (episodic, semantic, visual) that are queried via an adaptive reasoning agent. The system processes asynchronously with 30-second base windows and coarser summaries generated when their full time windows close.

Models

Role	Model	Dispatch
Captioning, NER, triple extraction, merging, consolidation, reranking	GPT-5 mini	Batch API (test) / Direct API (prod)
Reasoning agent, response generation	GPT-5	Direct API (always)
Visual embedding (video clips + text queries)	VLM2Vec-V2 (~8B)	GPU worker on AWS G5
Text embedding (entity resolution, semantic similarity)	OpenAI embeddings	API

Architecture: Lambda + GPU Worker Split

Split along the GPU boundary:

Lambdas (existing server pattern): ingest endpoint, ASR trigger, all GPT-5 mini calls, graph storage, query/reasoning API
GPU worker (G5 instance): stateless VLM2Vec encoding only — receives video clips or text queries, returns embeddings.
Storage: PostgreSQL + pgvector (graphs, triples, embeddings), S3 (video segments, frames)
Graph cache: Redis for materialized igraph per user, invalidated on triple insert (scoped: only the specific user+memory_type key is invalidated)

Data Flow

Ingest Path

Meta Ray-Bans
  -> SDK pushes video frames to ingest endpoint (Lambda)
  -> Audio captured separately -> Whisper API -> transcript
  -> Frames + transcript buffered into 30s windows in S3

Every 30s window close:
  |-- Sample 16 frames uniformly from the 30s clip (sampled once, used for both captioning and VLM2Vec)
  |-- GPT-5 mini: caption(16 frames as base64 image_url content parts, transcript) -> 30s caption
  |-- GPT-5 mini: NER + episodic triple extraction -> triples -> graph update
  |-- GPU worker: VLM2Vec(same 16 frames) -> visual embedding (VECTOR(1536)) -> pgvector
  |-- Store caption + metadata in PostgreSQL

Every 3min window close:
  |-- GPT-5 mini: merge(six 30s captions) -> 3min summary -> episodic triples -> graph

Every 10min window close:
  |-- GPT-5 mini: merge(three 3min + one remaining 1min summaries) -> 10min summary -> episodic triples -> graph
  |-- GPT-5 mini: semantic triple extraction from 10min summary text only (not episodic triples) -> consolidation against existing semantic graph

Every 1hr window close:
  |-- GPT-5 mini: merge(six 10min summaries) -> 1hr summary -> episodic triples -> graph

Query Path

Mobile app -> query API (Lambda)
  -> Adaptive reasoning agent (GPT-5, up to 5 rounds):
      Round N: pick memory type + formulate search query
        |-- episodic: embed query -> PPR across temporal graphs -> LLM rerank -> captions
        |-- semantic: embed query -> PPR on entity graph -> top triples
        |-- visual: VLM2Vec encode query text (GPU worker) -> cosine sim -> frames from S3
      Accumulate retrieved context
  -> GPT-5: generate answer from all accumulated context
  -> Response to mobile app

Episodic vs Semantic Triples

Both are [subject, predicate, object] tuples stored in the same table with a memory_type discriminator. They differ in extraction timing, prompt intent, and lifecycle:

Episodic triples — event-specific facts: - Extracted at: every time scale (30s, 3min, 10min, 1hr) from each caption - Prompt intent: "Extract factual, event-specific relationships from this caption" - Examples: ["I", "stand at", "dining table"], ["Katrina", "asks about", "tomorrow's schedule"] - Lifecycle: immutable once created. Tied to a specific segment. Never consolidated or merged. - Retrieval: PPR across multi-scale temporal graphs, then LLM cross-scale reranking

Semantic triples — generalized relational knowledge: - Extracted at: 10-minute summaries only (coarser timescale to capture patterns) - Prompt intent: "Extract long-term patterns, habits, preferences, and social relationships — not what happened in this specific moment, but what is generally true" - Examples: ["I", "often eats", "fruits and snacks"], ["Alice", "expresses romantic feelings toward", "I"] - Lifecycle: undergo consolidation pipeline. When new semantic triples arrive, they are compared (via embedding similarity) against existing ones. The LLM decides to merge, update, or remove conflicting triples. - Retrieval: PPR on a single evolving entity graph (no temporal scales)

This distinction drives two different prompt templates (prompts/episodic_triples.txt, prompts/semantic_triples.txt) and two different pipeline paths after extraction (episodic → direct graph insert, semantic → consolidation → graph update).

Prompt Templates

The prompts/ directory contains all LLM prompt templates:

Template	Used by	Purpose
`caption.txt`	captioner.py	Generate 30s caption from frames + transcript
`ner.txt`	openie.py	Extract named entities from caption
`episodic_triples.txt`	openie.py	Extract event-specific triples from caption
`semantic_triples.txt`	extraction.py	Extract relational/habitual triples from 10min summary
`merge_captions.txt`	multiscale.py	Merge N fine-grained captions into coarser summary
`consolidation.txt`	consolidation.py	Decide merge/update/remove for conflicting semantic triples
`cross_scale_rerank.txt`	episodic_retriever.py	Given candidates from all temporal scales + original query, select the top 3 most relevant
`reasoning_agent.txt`	agent.py	Adaptive retrieval: decide SEARCH or ANSWER each round
`response.txt`	agent.py	Generate final answer from accumulated retrieval context
`entity_confirm.txt`	openie.py	Confirm whether two surface forms refer to same entity

Reasoning Agent Action Format

The reasoning agent communicates via structured JSON output each round:

// Search action — agent wants to query a memory type
{
  "action": "SEARCH",
  "memory_type": "episodic" | "semantic" | "visual",
  "query": "free-form search query for the selected memory type"
}

// Answer action — agent has enough context to respond
{
  "action": "ANSWER"
}

The agent loop: 1. Send prompt with: original user question, all prior rounds (action + retrieved context) 2. Parse JSON response 3. If SEARCH: dispatch to the corresponding retriever, append results to context, continue loop 4. If ANSWER: exit loop, pass all accumulated context to response agent (GPT-5) for final answer 5. If 5 rounds reached without ANSWER: force exit, pass accumulated context to response agent

Segment Buffer

Lambdas are stateless. The segment buffer uses S3 + DynamoDB: - Incoming frames written to S3 with keys like worldmm/{user_id}/buffer/{timestamp}.jpg - DynamoDB tracks the current window state: {user_id, window_start, frame_count, audio_key} - When frame_count reaches the expected count for a 30s window (or 30s wall time elapses), the window is "closed" and an SNS notification triggers the processing fan-out - Audio chunks are similarly buffered in S3 and assembled per window

Module Structure

main/server/worldmm/
|-- ingest/
|   |-- segment_buffer.py      # Accumulates frames/audio into 30s windows
|   |-- captioner.py           # GPT-5 mini: frames + transcript -> caption
|   |-- transcriber.py         # Whisper API: audio -> transcript
|   |-- app.py                 # Lambda handler for ingest endpoint
|
|-- memory/
|   |-- episodic/
|   |   |-- graph.py           # Multi-scale temporal knowledge graphs (PPR via igraph)
|   |   |-- openie.py          # NER + triple extraction (GPT-5 mini)
|   |   |-- multiscale.py      # Merge captions across time windows
|   |-- semantic/
|   |   |-- graph.py           # Evolving entity-relationship graph (PPR via igraph)
|   |   |-- extraction.py      # Semantic triple extraction
|   |   |-- consolidation.py   # Embedding similarity + LLM merge/update
|   |-- visual/
|       |-- encoder.py         # VLM2Vec client (calls GPU worker)
|       |-- index.py           # pgvector similarity search
|
|-- retrieval/
|   |-- agent.py               # Adaptive reasoning agent (GPT-5, up to 5 rounds)
|   |-- episodic_retriever.py  # PPR search across temporal scales + rerank
|   |-- semantic_retriever.py  # PPR search on entity graph
|   |-- visual_retriever.py    # VLM2Vec cosine sim + frame fetch
|
|-- llm/
|   |-- client.py              # OpenAI API wrapper (batch vs direct mode)
|   |-- batch.py               # Batch API request builder + poller
|   |-- prompts/               # Prompt templates
|
|-- gpu_worker/
|   |-- vlm2vec_server.py      # FastAPI on G5, serves VLM2Vec encoding
|   |-- Dockerfile             # AMI/container with pre-loaded weights
|
|-- tests/
    |-- fixtures/              # EgoLife-derived test data
    |-- (test files — see TDD Strategy)

Storage Schema

worldmm_segments

Column	Type	Notes
id	UUID PK
user_id	FK
start_time	TIMESTAMPTZ
end_time	TIMESTAMPTZ
duration_seconds	INTEGER	30, 180, 600, 3600 — not an enum
caption	TEXT
s3_video_key	VARCHAR	raw video clip
s3_frames_key	VARCHAR	sampled frames
transcript	TEXT	ASR output (base segments only)
parent_segment_id	UUID FK NULL	links 30s -> 3min -> 10min -> 1hr

worldmm_entities

Column	Type	Notes
id	UUID PK
user_id	FK
canonical_entity_id	UUID FK NULL -> self	NULL = this IS the canonical
surface_form	VARCHAR	"Kate", "my roommate"
canonical_name	VARCHAR	"Katrina" — set on canonical row only
embedding	VECTOR(1536)	OpenAI text-embedding-3-large output

Index: HNSW on embedding WHERE canonical_entity_id IS NULL (canonical entities only).

worldmm_triples

Column	Type	Notes
id	UUID PK
segment_id	FK -> worldmm_segments
user_id	FK
memory_type	ENUM: episodic, semantic
subject_entity_id	FK -> worldmm_entities	points to canonical entity
predicate	VARCHAR
object_entity_id	FK -> worldmm_entities NULL	points to canonical entity (NULL if literal object)
object_literal	VARCHAR NULL	literal value when object is not an entity (e.g., "2:30 PM", "123 Main St")
created_at	TIMESTAMPTZ
invalidated_at	TIMESTAMPTZ NULL	semantic consolidation soft-delete

CHECK constraint: exactly one of object_entity_id or object_literal must be non-null.

Triple entity FKs point directly to the canonical entity row. Graph construction is a single-hop lookup. Entity re-resolution only touches the entities table. Literal-object triples (e.g., ["meeting", "started_at", "2:30 PM"]) are included in the graph as leaf nodes — the literal value becomes a node connected only to the subject entity. These nodes participate in PPR traversal but do not undergo entity resolution.

worldmm_visual_embeddings

Column	Type	Notes
id	UUID PK
segment_id	FK -> worldmm_segments
user_id	FK
embedding	VECTOR(1536)	VLM2Vec-encoded video clip
timestamp	TIMESTAMPTZ

Index: HNSW on embedding. Embedding dimension: VECTOR(1536) matching VLM2Vec-V2 output.

PPR Parameters (from paper)

Parameter	Value	Notes
Damping factor	0.85	Standard PPR damping
Implementation	igraph `personalized_pagerank(implementation="prpack")`
Episodic seed selection	Query entities matched via NER + embedding similarity
Episodic top-K per scale	10 (30s), 5 (3min), 5 (10min), 3 (1hr)	Then LLM cross-scale reranker filters to top 3
Semantic seed selection	Top-K triples by cosine similarity to query embedding	Entities from those triples become uniform personalization vector
Semantic scoring	Edge score = PPR(subject) + PPR(object)	Edge-focused, not node-focused
Semantic top-K	10 triples
Entity match threshold	0.6 cosine similarity (tunable via config)

Entity Resolution Flow

NER extracts "Kate" from caption
Embed "Kate" via text embedding API
HNSW search against canonical entities for that user (cosine >= 0.6)
Match found -> LLM confirmation -> create alias row with canonical_entity_id FK
No match -> create new canonical entity (canonical_entity_id = NULL)
Triple FK points to canonical entity directly
Entity resolution is batched per 30s window (all entities from one caption resolved together)

Graph Caching

Per-user materialized igraph held in Redis: - Key: worldmm:graph:{user_id}:{memory_type} - Invalidation: delete on new triple insert. During active wear, cache is effectively always cold — accept the 200-500ms rebuild cost per query as steady-state. This is acceptable for async mobile queries. - Fallback: if Redis unavailable, rebuild from DB every time.

Visual Query Path

VLM2Vec encodes both video clips (ingest) and text queries (retrieval) into the same embedding space: - Ingest: GPU worker encodes 30s clips -> embeddings stored in pgvector - Query: GPU worker encodes query text -> cosine similarity against stored embeddings - GPU unavailable at query time: visual retrieval returns empty results, reasoning agent falls back to episodic + semantic (graceful degradation)

LLM Client

Single request builder, two dispatch modes controlled by config:

request = build_request(
    model="gpt-5-mini",
    messages=[...],
    max_completion_tokens=256,
    reasoning_effort="low"
)

if mode == "batch":
    batch_submit([request, ...])  # Batch API
else:
    client.chat.completions.create(**request)  # Direct API

All GPT-5 mini calls go through this client (captioning, NER, triples, merging, consolidation, reranking)
GPT-5 reasoning agent always uses direct API (sequential loop)
reasoning_effort="low" on all extraction/structured-output tasks to avoid wasted reasoning tokens

GPU Worker

Minimal stateless FastAPI service on G5 instance: - POST /encode-video: 30s video clip -> VLM2Vec embedding vector - POST /encode-text: query string -> VLM2Vec embedding vector - VLM2Vec-V2 loaded once at startup, stays in GPU memory - Weights pre-baked into AMI or loaded from EFS on boot - Single instance during active use, can terminate when idle

TDD Strategy

Principle

If the assertion depends on what the LLM says, it's Layer 2 (validation). If it depends on what our code does with what the LLM says, it's Layer 1 (deterministic TDD).

Layer 1 — Deterministic (red/green TDD)

LLM is mocked. Tests verify mechanical correctness: prompt formatting, response parsing, data structure assembly, graph algorithms.

Test	Asserts
test_captioner_prompt	Correct frames selected, transcript included, prompt populated, response parsed to string
test_openie_ner_parse	Given canned LLM response, extracts entity list in expected format
test_openie_triple_parse	Given canned LLM response, produces [subject, predicate, object] with correct types
test_multiscale_merge_prompt	Six captions assembled into merge prompt, response parsed to summary string
test_episodic_graph_ppr	Given triples, igraph built, PPR from seed returns expected ranked nodes
test_entity_resolution	Given embeddings at known distances, >= 0.6 triggers LLM confirm (mocked), alias created with correct canonical FK
test_semantic_consolidation_candidates	Given triple embeddings, correct candidates selected by cosine threshold
test_semantic_consolidation_merge	Given canned LLM merge response, old triple invalidated, new merged triple inserted
test_visual_encoder	Request sent to GPU worker with correct payload, response parsed to vector of expected dim
test_retrieval_agent_dispatch	Given canned SEARCH response, correct retriever called with correct query
test_retrieval_agent_termination	Given canned ANSWER response, loop exits and returns answer without another search
test_retrieval_agent_max_rounds	After 5 SEARCH responses, loop exits and forces answer generation
test_batch_client	Requests batched correctly, poll logic works, results mapped to original request IDs

Layer 2 — Validation (real API, accepts variance)

Uses EgoLife fixtures with real API calls (Batch API for cost). Not part of red/green cycle.

Test	Validates
test_caption_quality	Caption from real frames/transcript mentions identifiable entities
test_triple_extraction_quality	Extracted triples are semantically valid for source caption
test_multiscale_coherence	3min summary preserves key information from 30s captions
test_retrieval_agent_e2e	Agent retrieves relevant context and produces reasonable answer for known EgoLife question