Skip to content

Memories Chat (AI Q&A)

An authenticated user asks a natural-language question about their past experiences; the system retrieves relevant memory segments from the WorldMM knowledge graph and uses a GPU-hosted LLM to compose a grounded answer.

Flow

  1. User submits a question — The mobile app's chat screen sends a POST request with { "question": "...", "chat_id": "<optional>" }. Setting verbose: true returns the full reasoning trace.

  2. Authentication — The Lambda requires a valid Cognito JWT. user_id is extracted from the auth context.

  3. Database initialization — PostgreSQL connection is configured from SSM secrets and SQLAlchemy ORM is initialized.

  4. Chat persistence (pre-GPU) — The user's message is saved to DynamoDB (Messages table) before the GPU call so that it is never lost on timeout. If chat_id is null, a new chat record is created in the Chats table with the question as the title. Ownership is verified for existing chats (returns 403 on mismatch).

  5. GPU worker resolution — The Lambda reads GPU_INSTANCE_ID from its environment variable (set at deploy time from SSM). It calls _resolve_gpu_url() to look up the instance via ec2.describe_instances. If the instance is stopped or terminated (stale SSM ID), the Lambda performs a tag-based fallback: it scans EC2 for any running instance tagged Name=encache-gpu-worker, adopts that instance, and writes the new ID back to SSM (/encache/gpu/instance_id) so subsequent invocations do not repeat the scan. If no running instance is found, a spot instance is launched from the launch template and a graceful fallback message is returned. Private IP is always preferred over public IP because both the chat Lambda and the GPU worker are in the same VPC; using private IP means SG-to-SG rules apply rather than traffic exiting through NAT.

5a. GPU readiness polling — After URL resolution, wait_for_gpu_health() polls _gpu_is_healthy() (a GET /health probe with a 3 s socket timeout) with exponential backoff until the worker responds 200 or a termination condition is reached. This replaces the previous single-probe behavior: the Lambda no longer fails immediately on the first unhealthy check. Termination conditions, checked after each failed probe: - Max-wait timeout: elapsed_ms >= GPU_MAX_WAIT_MS (default 120 000 ms / 2 min). Logs gpu_retry_exhausted_max_wait. - Lambda timeout guard: lambda_remaining_ms - elapsed_ms <= 10 000. Stops polling with ≥10 s left for response assembly. Logs gpu_retry_lambda_timeout_approaching. Backoff sequence: delay starts at GPU_RETRY_INITIAL_DELAY_MS (default 2 000 ms), doubles after each failed check, caps at 60 000 ms. If wait_for_gpu_health() returns (False, elapsed_ms), gpu_worker_url is cleared and the fallback message path executes.

  1. Knowledge graph loadingload_episodic_graphs and load_semantic_graph load the user's entity and triple data from PostgreSQL into in-memory graph structures.

  2. Retrieval (multi-modal) — The ReasoningAgent runs a multi-step reasoning loop. In each round it decides whether to search and which memory type to query:

  3. Episodic retriever — Embeds the query (via TextEmbedder), searches episodic graphs across all temporal scales, returns ranked segments.
  4. Semantic retriever — Searches semantic triples via pgvector embedding similarity.
  5. Visual retriever — Encodes the query as a visual embedding via VLM2VecClient, queries search_visual_embeddings, and resolves captions for matching segments.

  6. Answer generation — The reasoning LLM (GPULLMClient) synthesizes retrieved context into a natural-language answer.

  7. Response persistence — The assistant's answer is saved to DynamoDB and the chat's updatedAt timestamp is touched.

  8. Response — Returns { "answer": "...", "chat_id": "..." }. With verbose=true, also returns a trace object with search rounds, memory types used, and result counts.

Frontend: Chat Screen

File: main/app/app/chat.tsx

The chat screen uses a manual Animated.Value (keyboardOffsetValue) to translate the entire content area — messages list and input dock together — when the keyboard appears or disappears. This keeps messages visible above the keyboard without covering them.

KeyboardAvoidingView is present with behavior="position" but the primary keyboard handling is the Animated.View translateY approach, which gives coordinated, smooth motion for the whole chat body.

Platform handling:

  • iOS: listens to keyboardWillChangeFrame and computes nextHeight = windowHeight - event.endCoordinates.screenY. Animates translateY to -nextHeight using the event's own duration.
  • Android: listens to keyboardDidShow / keyboardDidHide. On show, animates translateY to -event.endCoordinates.height; on hide, animates back to 0. Both use 250 ms with Easing.out(Easing.cubic).

InputDock receives disableKeyboardAnimation={true} so it does not create its own keyboard listeners or apply an independent translateY. All motion comes from the parent Animated.View.

Layout hierarchy:

ThemedView (flex: 1, watermark)
  AppHeader
  KeyboardAvoidingView behavior="position" (flex: 1)
    Animated.View (translateY: keyboardOffsetValue)
      FlatList (message list)
      InputDock (text input + send button, disableKeyboardAnimation=true)

The InputDock receives bottomInset from useSafeAreaInsets() to handle notched/gesture-bar devices correctly on both platforms.

The FlatList uses showsVerticalScrollIndicator={false} to hide the native scrollbar while keeping scroll functionality intact.

Entry Point

  • Lambda: main/server/api/memories/chat/app.pylambda_handler
  • HTTP method: POST /memories/chat (API Gateway, Cognito-authenticated)

Key Components

Component Location
ReasoningAgent main/server/worldmm/retrieval/agent.py
TextEmbedder main/server/worldmm/memory/text_embedder.py
VLM2VecClient main/server/worldmm/memory/visual/encoder.py
GPULLMClient main/server/worldmm/llm/client.py
ChatRepository main/server/layers/shared/python/shared/chat/repository.py

Dependencies

  • PostgreSQL (WorldMMSegment, WorldMMEntity, WorldMMTriple, pgvector)
  • DynamoDB (MESSAGES_TABLE_NAME, CHATS_TABLE_NAME)
  • GPU EC2 worker (GPU_INSTANCE_ID, GPU_WORKER_PORT, GPU_LAUNCH_TEMPLATE_ID, GPU_MAX_WAIT_MS, GPU_RETRY_INITIAL_DELAY_MS)
  • Groq API (GROQ_API_KEY) for text embedding and LLM calls

GPU Instance Discovery

The watchdog on the GPU worker shuts the instance down after 480 s of idle time but does not update SSM. This creates a stale-ID cycle: both the chat Lambda and the ingest Lambda read a terminated instance ID from SSM, address traffic to the dead instance, and receive zero invocations — causing the watchdog to kill any replacement instance immediately.

The tag-based fallback breaks the cycle:

_resolve_gpu_url(instance_id):
  state = ec2.describe_instances([instance_id]).State.Name
  if state in (stopped, stopping, terminated):
    running = _find_running_gpu_instance_by_tag()   # Filters: tag:Name=encache-gpu-worker, state=running
    if running:
      _update_ssm_gpu_instance_id(running.InstanceId)  # PUT /encache/gpu/instance_id
      return http://{running.PrivateIpAddress}:{port}
    if state == stopped:
      ec2.start_instances([instance_id])
    if state == terminated and launch_template_id:
      ec2.run_instances(LaunchTemplate=..., Tags=[Name=encache-gpu-worker])
      _update_ssm_gpu_instance_id(new_id)
    return None                                     # caller returns fallback message
  return http://{instance.PrivateIpAddress}:{port}

Private IP is used because both Lambdas are inside the VPC. Public IP routes through the NAT gateway and loses security-group identity, causing SG-to-SG rules to fail.

GPU Health Retry Configuration

Env Var Type Default Description
GPU_MAX_WAIT_MS int 120000 Maximum total polling duration (ms). Retry loop exits with fallback when elapsed time exceeds this value.
GPU_RETRY_INITIAL_DELAY_MS int 2000 First inter-probe sleep (ms). Doubles after each failed check, capped at 60 000 ms.

Backoff sequence with defaults: 2 s → 4 s → 8 s → 16 s → 32 s → 60 s → 60 s → … until 120 s total elapsed.

The Lambda timeout guard (lambda_remaining_ms - elapsed_ms <= 10 000) always takes priority over GPU_MAX_WAIT_MS — polling stops early so at least 10 s remains for response assembly and DynamoDB writes.