Skip to content

WorldMM Live Memory System — Design Spec

Date: 2026-03-22 Paper: https://arxiv.org/abs/2512.02425 Reference impl: https://github.com/wgcyeo/WorldMM Location: main/server/worldmm/

Overview

A live wearable memory system based on the WorldMM paper's architecture. Video streams from Meta Ray-Ban glasses, audio is transcribed separately, and the system constructs three complementary memory stores (episodic, semantic, visual) that are queried via an adaptive reasoning agent. The system processes asynchronously with 30-second base windows and coarser summaries generated when their full time windows close.

Models

Role Model Dispatch
Captioning, NER, triple extraction, merging, consolidation, reranking GPT-5 mini Batch API (test) / Direct API (prod)
Reasoning agent, response generation GPT-5 Direct API (always)
Visual embedding (video clips + text queries) VLM2Vec-V2 (~8B) GPU worker on AWS G5
Text embedding (entity resolution, semantic similarity) OpenAI embeddings API

Architecture: Lambda + GPU Worker Split

Split along the GPU boundary:

  • Lambdas (existing server pattern): ingest endpoint, ASR trigger, all GPT-5 mini calls, graph storage, query/reasoning API
  • GPU worker (G5 instance): stateless VLM2Vec encoding only — receives video clips or text queries, returns embeddings.
  • Storage: PostgreSQL + pgvector (graphs, triples, embeddings), S3 (video segments, frames)
  • Graph cache: Redis for materialized igraph per user, invalidated on triple insert (scoped: only the specific user+memory_type key is invalidated)

Data Flow

Ingest Path

Meta Ray-Bans
  -> SDK pushes video frames to ingest endpoint (Lambda)
  -> Audio captured separately -> Whisper API -> transcript
  -> Frames + transcript buffered into 30s windows in S3

Every 30s window close:
  |-- Sample 16 frames uniformly from the 30s clip (sampled once, used for both captioning and VLM2Vec)
  |-- GPT-5 mini: caption(16 frames as base64 image_url content parts, transcript) -> 30s caption
  |-- GPT-5 mini: NER + episodic triple extraction -> triples -> graph update
  |-- GPU worker: VLM2Vec(same 16 frames) -> visual embedding (VECTOR(1536)) -> pgvector
  |-- Store caption + metadata in PostgreSQL

Every 3min window close:
  |-- GPT-5 mini: merge(six 30s captions) -> 3min summary -> episodic triples -> graph

Every 10min window close:
  |-- GPT-5 mini: merge(three 3min + one remaining 1min summaries) -> 10min summary -> episodic triples -> graph
  |-- GPT-5 mini: semantic triple extraction from 10min summary text only (not episodic triples) -> consolidation against existing semantic graph

Every 1hr window close:
  |-- GPT-5 mini: merge(six 10min summaries) -> 1hr summary -> episodic triples -> graph

Query Path

Mobile app -> query API (Lambda)
  -> Adaptive reasoning agent (GPT-5, up to 5 rounds):
      Round N: pick memory type + formulate search query
        |-- episodic: embed query -> PPR across temporal graphs -> LLM rerank -> captions
        |-- semantic: embed query -> PPR on entity graph -> top triples
        |-- visual: VLM2Vec encode query text (GPU worker) -> cosine sim -> frames from S3
      Accumulate retrieved context
  -> GPT-5: generate answer from all accumulated context
  -> Response to mobile app

Episodic vs Semantic Triples

Both are [subject, predicate, object] tuples stored in the same table with a memory_type discriminator. They differ in extraction timing, prompt intent, and lifecycle:

Episodic triples — event-specific facts: - Extracted at: every time scale (30s, 3min, 10min, 1hr) from each caption - Prompt intent: "Extract factual, event-specific relationships from this caption" - Examples: ["I", "stand at", "dining table"], ["Katrina", "asks about", "tomorrow's schedule"] - Lifecycle: immutable once created. Tied to a specific segment. Never consolidated or merged. - Retrieval: PPR across multi-scale temporal graphs, then LLM cross-scale reranking

Semantic triples — generalized relational knowledge: - Extracted at: 10-minute summaries only (coarser timescale to capture patterns) - Prompt intent: "Extract long-term patterns, habits, preferences, and social relationships — not what happened in this specific moment, but what is generally true" - Examples: ["I", "often eats", "fruits and snacks"], ["Alice", "expresses romantic feelings toward", "I"] - Lifecycle: undergo consolidation pipeline. When new semantic triples arrive, they are compared (via embedding similarity) against existing ones. The LLM decides to merge, update, or remove conflicting triples. - Retrieval: PPR on a single evolving entity graph (no temporal scales)

This distinction drives two different prompt templates (prompts/episodic_triples.txt, prompts/semantic_triples.txt) and two different pipeline paths after extraction (episodic → direct graph insert, semantic → consolidation → graph update).

Prompt Templates

The prompts/ directory contains all LLM prompt templates:

Template Used by Purpose
caption.txt captioner.py Generate 30s caption from frames + transcript
ner.txt openie.py Extract named entities from caption
episodic_triples.txt openie.py Extract event-specific triples from caption
semantic_triples.txt extraction.py Extract relational/habitual triples from 10min summary
merge_captions.txt multiscale.py Merge N fine-grained captions into coarser summary
consolidation.txt consolidation.py Decide merge/update/remove for conflicting semantic triples
cross_scale_rerank.txt episodic_retriever.py Given candidates from all temporal scales + original query, select the top 3 most relevant
reasoning_agent.txt agent.py Adaptive retrieval: decide SEARCH or ANSWER each round
response.txt agent.py Generate final answer from accumulated retrieval context
entity_confirm.txt openie.py Confirm whether two surface forms refer to same entity

Reasoning Agent Action Format

The reasoning agent communicates via structured JSON output each round:

// Search action — agent wants to query a memory type
{
  "action": "SEARCH",
  "memory_type": "episodic" | "semantic" | "visual",
  "query": "free-form search query for the selected memory type"
}

// Answer action — agent has enough context to respond
{
  "action": "ANSWER"
}

The agent loop: 1. Send prompt with: original user question, all prior rounds (action + retrieved context) 2. Parse JSON response 3. If SEARCH: dispatch to the corresponding retriever, append results to context, continue loop 4. If ANSWER: exit loop, pass all accumulated context to response agent (GPT-5) for final answer 5. If 5 rounds reached without ANSWER: force exit, pass accumulated context to response agent

Segment Buffer

Lambdas are stateless. The segment buffer uses S3 + DynamoDB: - Incoming frames written to S3 with keys like worldmm/{user_id}/buffer/{timestamp}.jpg - DynamoDB tracks the current window state: {user_id, window_start, frame_count, audio_key} - When frame_count reaches the expected count for a 30s window (or 30s wall time elapses), the window is "closed" and an SNS notification triggers the processing fan-out - Audio chunks are similarly buffered in S3 and assembled per window

Module Structure

main/server/worldmm/
|-- ingest/
|   |-- segment_buffer.py      # Accumulates frames/audio into 30s windows
|   |-- captioner.py           # GPT-5 mini: frames + transcript -> caption
|   |-- transcriber.py         # Whisper API: audio -> transcript
|   |-- app.py                 # Lambda handler for ingest endpoint
|
|-- memory/
|   |-- episodic/
|   |   |-- graph.py           # Multi-scale temporal knowledge graphs (PPR via igraph)
|   |   |-- openie.py          # NER + triple extraction (GPT-5 mini)
|   |   |-- multiscale.py      # Merge captions across time windows
|   |-- semantic/
|   |   |-- graph.py           # Evolving entity-relationship graph (PPR via igraph)
|   |   |-- extraction.py      # Semantic triple extraction
|   |   |-- consolidation.py   # Embedding similarity + LLM merge/update
|   |-- visual/
|       |-- encoder.py         # VLM2Vec client (calls GPU worker)
|       |-- index.py           # pgvector similarity search
|
|-- retrieval/
|   |-- agent.py               # Adaptive reasoning agent (GPT-5, up to 5 rounds)
|   |-- episodic_retriever.py  # PPR search across temporal scales + rerank
|   |-- semantic_retriever.py  # PPR search on entity graph
|   |-- visual_retriever.py    # VLM2Vec cosine sim + frame fetch
|
|-- llm/
|   |-- client.py              # OpenAI API wrapper (batch vs direct mode)
|   |-- batch.py               # Batch API request builder + poller
|   |-- prompts/               # Prompt templates
|
|-- gpu_worker/
|   |-- vlm2vec_server.py      # FastAPI on G5, serves VLM2Vec encoding
|   |-- Dockerfile             # AMI/container with pre-loaded weights
|
|-- tests/
    |-- fixtures/              # EgoLife-derived test data
    |-- (test files — see TDD Strategy)

Storage Schema

worldmm_segments

Column Type Notes
id UUID PK
user_id FK
start_time TIMESTAMPTZ
end_time TIMESTAMPTZ
duration_seconds INTEGER 30, 180, 600, 3600 — not an enum
caption TEXT
s3_video_key VARCHAR raw video clip
s3_frames_key VARCHAR sampled frames
transcript TEXT ASR output (base segments only)
parent_segment_id UUID FK NULL links 30s -> 3min -> 10min -> 1hr

worldmm_entities

Column Type Notes
id UUID PK
user_id FK
canonical_entity_id UUID FK NULL -> self NULL = this IS the canonical
surface_form VARCHAR "Kate", "my roommate"
canonical_name VARCHAR "Katrina" — set on canonical row only
embedding VECTOR(1536) OpenAI text-embedding-3-large output

Index: HNSW on embedding WHERE canonical_entity_id IS NULL (canonical entities only).

worldmm_triples

Column Type Notes
id UUID PK
segment_id FK -> worldmm_segments
user_id FK
memory_type ENUM: episodic, semantic
subject_entity_id FK -> worldmm_entities points to canonical entity
predicate VARCHAR
object_entity_id FK -> worldmm_entities NULL points to canonical entity (NULL if literal object)
object_literal VARCHAR NULL literal value when object is not an entity (e.g., "2:30 PM", "123 Main St")
created_at TIMESTAMPTZ
invalidated_at TIMESTAMPTZ NULL semantic consolidation soft-delete

CHECK constraint: exactly one of object_entity_id or object_literal must be non-null.

Triple entity FKs point directly to the canonical entity row. Graph construction is a single-hop lookup. Entity re-resolution only touches the entities table. Literal-object triples (e.g., ["meeting", "started_at", "2:30 PM"]) are included in the graph as leaf nodes — the literal value becomes a node connected only to the subject entity. These nodes participate in PPR traversal but do not undergo entity resolution.

worldmm_visual_embeddings

Column Type Notes
id UUID PK
segment_id FK -> worldmm_segments
user_id FK
embedding VECTOR(1536) VLM2Vec-encoded video clip
timestamp TIMESTAMPTZ

Index: HNSW on embedding. Embedding dimension: VECTOR(1536) matching VLM2Vec-V2 output.

PPR Parameters (from paper)

Parameter Value Notes
Damping factor 0.85 Standard PPR damping
Implementation igraph personalized_pagerank(implementation="prpack")
Episodic seed selection Query entities matched via NER + embedding similarity
Episodic top-K per scale 10 (30s), 5 (3min), 5 (10min), 3 (1hr) Then LLM cross-scale reranker filters to top 3
Semantic seed selection Top-K triples by cosine similarity to query embedding Entities from those triples become uniform personalization vector
Semantic scoring Edge score = PPR(subject) + PPR(object) Edge-focused, not node-focused
Semantic top-K 10 triples
Entity match threshold 0.6 cosine similarity (tunable via config)

Entity Resolution Flow

  1. NER extracts "Kate" from caption
  2. Embed "Kate" via text embedding API
  3. HNSW search against canonical entities for that user (cosine >= 0.6)
  4. Match found -> LLM confirmation -> create alias row with canonical_entity_id FK
  5. No match -> create new canonical entity (canonical_entity_id = NULL)
  6. Triple FK points to canonical entity directly
  7. Entity resolution is batched per 30s window (all entities from one caption resolved together)

Graph Caching

Per-user materialized igraph held in Redis: - Key: worldmm:graph:{user_id}:{memory_type} - Invalidation: delete on new triple insert. During active wear, cache is effectively always cold — accept the 200-500ms rebuild cost per query as steady-state. This is acceptable for async mobile queries. - Fallback: if Redis unavailable, rebuild from DB every time.

Visual Query Path

VLM2Vec encodes both video clips (ingest) and text queries (retrieval) into the same embedding space: - Ingest: GPU worker encodes 30s clips -> embeddings stored in pgvector - Query: GPU worker encodes query text -> cosine similarity against stored embeddings - GPU unavailable at query time: visual retrieval returns empty results, reasoning agent falls back to episodic + semantic (graceful degradation)

LLM Client

Single request builder, two dispatch modes controlled by config:

request = build_request(
    model="gpt-5-mini",
    messages=[...],
    max_completion_tokens=256,
    reasoning_effort="low"
)

if mode == "batch":
    batch_submit([request, ...])  # Batch API
else:
    client.chat.completions.create(**request)  # Direct API
  • All GPT-5 mini calls go through this client (captioning, NER, triples, merging, consolidation, reranking)
  • GPT-5 reasoning agent always uses direct API (sequential loop)
  • reasoning_effort="low" on all extraction/structured-output tasks to avoid wasted reasoning tokens

GPU Worker

Minimal stateless FastAPI service on G5 instance: - POST /encode-video: 30s video clip -> VLM2Vec embedding vector - POST /encode-text: query string -> VLM2Vec embedding vector - VLM2Vec-V2 loaded once at startup, stays in GPU memory - Weights pre-baked into AMI or loaded from EFS on boot - Single instance during active use, can terminate when idle

TDD Strategy

Principle

If the assertion depends on what the LLM says, it's Layer 2 (validation). If it depends on what our code does with what the LLM says, it's Layer 1 (deterministic TDD).

Layer 1 — Deterministic (red/green TDD)

LLM is mocked. Tests verify mechanical correctness: prompt formatting, response parsing, data structure assembly, graph algorithms.

Test Asserts
test_captioner_prompt Correct frames selected, transcript included, prompt populated, response parsed to string
test_openie_ner_parse Given canned LLM response, extracts entity list in expected format
test_openie_triple_parse Given canned LLM response, produces [subject, predicate, object] with correct types
test_multiscale_merge_prompt Six captions assembled into merge prompt, response parsed to summary string
test_episodic_graph_ppr Given triples, igraph built, PPR from seed returns expected ranked nodes
test_entity_resolution Given embeddings at known distances, >= 0.6 triggers LLM confirm (mocked), alias created with correct canonical FK
test_semantic_consolidation_candidates Given triple embeddings, correct candidates selected by cosine threshold
test_semantic_consolidation_merge Given canned LLM merge response, old triple invalidated, new merged triple inserted
test_visual_encoder Request sent to GPU worker with correct payload, response parsed to vector of expected dim
test_retrieval_agent_dispatch Given canned SEARCH response, correct retriever called with correct query
test_retrieval_agent_termination Given canned ANSWER response, loop exits and returns answer without another search
test_retrieval_agent_max_rounds After 5 SEARCH responses, loop exits and forces answer generation
test_batch_client Requests batched correctly, poll logic works, results mapped to original request IDs

Layer 2 — Validation (real API, accepts variance)

Uses EgoLife fixtures with real API calls (Batch API for cost). Not part of red/green cycle.

Test Validates
test_caption_quality Caption from real frames/transcript mentions identifiable entities
test_triple_extraction_quality Extracted triples are semantically valid for source caption
test_multiscale_coherence 3min summary preserves key information from 30s captions
test_retrieval_agent_e2e Agent retrieves relevant context and produces reasonable answer for known EgoLife question