Ingest Session (WorldMM Pipeline)
After a recording session ends, the ingest pipeline transforms raw audio and video frames into a structured knowledge graph stored in PostgreSQL, enabling semantic and episodic memory retrieval.
Flow
The ingest_session function in main/server/worldmm/pipeline/ingest_session.py orchestrates the following stages:
1. Load Session Metadata
The session directory contains metadata.json with session_id, user_id, window_count, video_start_ms, audio_start_ms, fps, and duration_s.
2. Audio Transcription (Whisper)
If audio.m4a exists and is non-empty, OpenAIWhisperASR transcribes it using either whisper-large-v3 (Groq) or whisper-1 (OpenAI). The result is a list of timestamped transcript segments.
3. Transcript Alignment
align_transcript_to_windows maps Whisper segments to 30-second window slots based on video_start_ms and audio_start_ms offsets. Each window receives the text spoken during its time range.
4. Per-Window Processing (repeated for each window)
For each 30-second window directory (windows/window_NNN/):
- Caption generation — JPEG frames are base64-encoded and sent to a vision LLM (
llama-4-scoutvia Groq orgpt-4o-mini). For windows with more than 4 frames,caption_parallelbatches sub-calls and merges results. The transcript for the window is included as context. - Segment storage — A
WorldMMSegmentrecord is written to PostgreSQL withstart_time,end_time,caption,transcript,source_session_id(frommetadata.json), andsource_window_index.source_session_idis required for the memories feed to group all windows of a session into a single feed tile; if it is null the feed falls back to treating each segment as its own session. - Named Entity Recognition — The caption is sent to an extraction LLM to identify named entities (people, places, objects).
- Entity resolution — Each entity is looked up by embedding similarity (HippoRAG-style): the entity is embedded, searched in pgvector, and if a candidate is found, an LLM confirms whether it is the same entity. New entities are created if no match is confirmed.
- Triple extraction — The caption plus entity list is sent to the LLM to extract RDF triples
(subject, predicate, object). Triples are stored asWorldMMTriplerecords withmemory_type="episodic". - Visual embedding (optional) — If a GPU worker URL is configured, the frame batch is encoded by
VLM2VecClientand the resulting embedding is stored inWorldMMSegment.visual_embedding(pgvector).
5. Multiscale Merging
If enough windows are available, captions are hierarchically merged across temporal scales:
- 6 windows → 3-minute summary
- 3 three-minute summaries → 10-minute summary
- 6 ten-minute summaries → 1-hour summary
Each merged summary is stored as an additional WorldMMSegment with source_session_id propagated from the session metadata, so these synthetic segments are grouped with the per-window segments in the feed.
6. Semantic Triple Extraction
Higher-level summaries (10-minute and above) are processed by the extraction LLM to produce memory_type="semantic" triples representing long-term facts rather than episodic events. The synthetic segment created for each semantic summary also carries source_session_id so it is grouped with the parent session in the feed.
Entry Point
- Module:
main/server/worldmm/pipeline/ingest_session.py→ingest_session() - Triggered by: session end Lambda and audio/frame upload Lambdas (via async Lambda invocation)
Note: The cloud Lambda path (
IngestWindowFunction) that processes individual 30-second windows in production — including GPU captioning, DLQ-backed retry, and segment status management — is documented separately in ingest-window.md.
Key Data Models
| Table | Purpose |
|---|---|
WorldMMSegment | Stores per-window and merged captions, transcripts, visual embeddings; source_session_id and source_window_index link each segment back to its originating recording session |
WorldMMEntity | Named entities with canonical resolution |
WorldMMTriple | RDF triples (episodic + semantic memory) |
Dependencies
- PostgreSQL + pgvector (
DATABASE_URLorDB_HOST/DB_NAME/DB_PORT+ SSM secrets) - Groq / OpenAI API (
GROQ_API_KEY) - GPU worker (
GPU_WORKER_URL) — optional, for visual embeddings - S3 raw frames:
sessions/{sessionId}/window_{n:03d}/frame_{n:03d}.jpg