Skip to content

Ingest Session (WorldMM Pipeline)

After a recording session ends, the ingest pipeline transforms raw audio and video frames into a structured knowledge graph stored in PostgreSQL, enabling semantic and episodic memory retrieval.

Flow

The ingest_session function in main/server/worldmm/pipeline/ingest_session.py orchestrates the following stages:

1. Load Session Metadata

The session directory contains metadata.json with session_id, user_id, window_count, video_start_ms, audio_start_ms, fps, and duration_s.

2. Audio Transcription (Whisper)

If audio.m4a exists and is non-empty, OpenAIWhisperASR transcribes it using either whisper-large-v3 (Groq) or whisper-1 (OpenAI). The result is a list of timestamped transcript segments.

3. Transcript Alignment

align_transcript_to_windows maps Whisper segments to 30-second window slots based on video_start_ms and audio_start_ms offsets. Each window receives the text spoken during its time range.

4. Per-Window Processing (repeated for each window)

For each 30-second window directory (windows/window_NNN/):

  • Caption generation — JPEG frames are base64-encoded and sent to a vision LLM (llama-4-scout via Groq or gpt-4o-mini). For windows with more than 4 frames, caption_parallel batches sub-calls and merges results. The transcript for the window is included as context.
  • Segment storage — A WorldMMSegment record is written to PostgreSQL with start_time, end_time, caption, transcript, source_session_id (from metadata.json), and source_window_index. source_session_id is required for the memories feed to group all windows of a session into a single feed tile; if it is null the feed falls back to treating each segment as its own session.
  • Named Entity Recognition — The caption is sent to an extraction LLM to identify named entities (people, places, objects).
  • Entity resolution — Each entity is looked up by embedding similarity (HippoRAG-style): the entity is embedded, searched in pgvector, and if a candidate is found, an LLM confirms whether it is the same entity. New entities are created if no match is confirmed.
  • Triple extraction — The caption plus entity list is sent to the LLM to extract RDF triples (subject, predicate, object). Triples are stored as WorldMMTriple records with memory_type="episodic".
  • Visual embedding (optional) — If a GPU worker URL is configured, the frame batch is encoded by VLM2VecClient and the resulting embedding is stored in WorldMMSegment.visual_embedding (pgvector).

5. Multiscale Merging

If enough windows are available, captions are hierarchically merged across temporal scales:

  • 6 windows → 3-minute summary
  • 3 three-minute summaries → 10-minute summary
  • 6 ten-minute summaries → 1-hour summary

Each merged summary is stored as an additional WorldMMSegment with source_session_id propagated from the session metadata, so these synthetic segments are grouped with the per-window segments in the feed.

6. Semantic Triple Extraction

Higher-level summaries (10-minute and above) are processed by the extraction LLM to produce memory_type="semantic" triples representing long-term facts rather than episodic events. The synthetic segment created for each semantic summary also carries source_session_id so it is grouped with the parent session in the feed.

Entry Point

  • Module: main/server/worldmm/pipeline/ingest_session.pyingest_session()
  • Triggered by: session end Lambda and audio/frame upload Lambdas (via async Lambda invocation)

Note: The cloud Lambda path (IngestWindowFunction) that processes individual 30-second windows in production — including GPU captioning, DLQ-backed retry, and segment status management — is documented separately in ingest-window.md.

Key Data Models

Table Purpose
WorldMMSegment Stores per-window and merged captions, transcripts, visual embeddings; source_session_id and source_window_index link each segment back to its originating recording session
WorldMMEntity Named entities with canonical resolution
WorldMMTriple RDF triples (episodic + semantic memory)

Dependencies

  • PostgreSQL + pgvector (DATABASE_URL or DB_HOST/DB_NAME/DB_PORT + SSM secrets)
  • Groq / OpenAI API (GROQ_API_KEY)
  • GPU worker (GPU_WORKER_URL) — optional, for visual embeddings
  • S3 raw frames: sessions/{sessionId}/window_{n:03d}/frame_{n:03d}.jpg