Memories Chat (AI Q&A)

An authenticated user asks a natural-language question about their past experiences; the system retrieves relevant memory segments from the WorldMM knowledge graph and uses a GPU-hosted LLM to compose a grounded answer.

Flow

User submits a question — The mobile app's chat screen sends a POST request with { "question": "...", "chat_id": "<optional>" }. Setting verbose: true returns the full reasoning trace.
Authentication — The Lambda requires a valid Cognito JWT. user_id is extracted from the auth context.
Database initialization — PostgreSQL connection is configured from SSM secrets and SQLAlchemy ORM is initialized.
Chat persistence (pre-GPU) — The user's message is saved to DynamoDB (Messages table) before the GPU call so that it is never lost on timeout. If chat_id is null, a new chat record is created in the Chats table with the question as the title. Ownership is verified for existing chats (returns 403 on mismatch).
GPU worker resolution — The Lambda reads GPU_INSTANCE_ID from its environment variable (set at deploy time from SSM). It calls _resolve_gpu_url() to look up the instance via ec2.describe_instances. If the instance is stopped or terminated (stale SSM ID), the Lambda performs a tag-based fallback: it scans EC2 for any running instance tagged Name=encache-gpu-worker, adopts that instance, and writes the new ID back to SSM (/encache/gpu/instance_id) so subsequent invocations do not repeat the scan. If no running instance is found, a spot instance is launched from the launch template and a graceful fallback message is returned. Private IP is always preferred over public IP because both the chat Lambda and the GPU worker are in the same VPC; using private IP means SG-to-SG rules apply rather than traffic exiting through NAT.

5a. GPU readiness polling — After URL resolution, wait_for_gpu_health() polls _gpu_is_healthy() (a GET /health probe with a 3 s socket timeout) with exponential backoff until the worker responds 200 or a termination condition is reached. Termination conditions, checked after each failed probe: - Max-wait timeout: elapsed_ms >= GPU_MAX_WAIT_MS (default 15 000 ms in the async worker; see deployment notes). Logs gpu_retry_exhausted_max_wait. - Lambda timeout guard: lambda_remaining_ms - elapsed_ms <= 10 000. Stops polling with ≥10 s left for response assembly. Logs gpu_retry_lambda_timeout_approaching. Backoff sequence: delay starts at GPU_RETRY_INITIAL_DELAY_MS (default 2 000 ms), doubles after each failed check, caps at 60 000 ms. If wait_for_gpu_health() returns (False, elapsed_ms), gpu_worker_url is cleared and the GPU-timeout path executes (see step 5b).

5b. GPU timeout path (async worker) — When wait_for_gpu_health() returns False and the request is on the async dispatcher path (i.e., message_id is set): - repo.mark_gpu_pending(chat_id, message_id) sets gpu_pending=True, ready=False, content="" on the placeholder message. - _enqueue_gpu_retry(chat_id, message_id, user_id, question, retry_count=0, delay_seconds=120) enqueues a message to ChatGpuRetryQueue (SQS). - The worker returns {status: "gpu_retry_enqueued"} immediately; ChatGpuRetryFunction takes over from here (see chat-flow.md#gpuRetry). - On the legacy synchronous path (no message_id), the old fallback behavior is preserved: a static "GPU starting up" message is saved with ready=true.

Knowledge graph loading — load_episodic_graphs and load_semantic_graph load the user's entity and triple data from PostgreSQL into in-memory graph structures.
Retrieval (multi-modal) — The ReasoningAgent runs a multi-step reasoning loop. In each round it decides whether to search and which memory type to query:
Episodic retriever — Embeds the query (via TextEmbedder), searches episodic graphs across all temporal scales, returns ranked segments.
Semantic retriever — Searches semantic triples via pgvector embedding similarity.
Visual retriever — Encodes the query as a visual embedding via VLM2VecClient, queries search_visual_embeddings, and resolves captions for matching segments.
Answer generation — The reasoning LLM (GPULLMClient) synthesizes retrieved context into a natural-language answer.
Response persistence — The assistant's answer is saved to DynamoDB and the chat's updatedAt timestamp is touched.
Response — Returns { "answer": "...", "chat_id": "..." }. With verbose=true, also returns a trace object with search rounds, memory types used, and result counts.

Frontend: Chat Screen

File: main/app/app/chat.tsx

The chat screen uses a manual Animated.Value (keyboardOffsetValue) to translate the entire content area — messages list and input dock together — when the keyboard appears or disappears. This keeps messages visible above the keyboard without covering them.

KeyboardAvoidingView is present with behavior="position" but the primary keyboard handling is the Animated.View translateY approach, which gives coordinated, smooth motion for the whole chat body.

Platform handling:

iOS: listens to keyboardWillChangeFrame and computes nextHeight = windowHeight - event.endCoordinates.screenY. Animates translateY to -nextHeight using the event's own duration.
Android: listens to keyboardDidShow / keyboardDidHide. On show, animates translateY to -event.endCoordinates.height; on hide, animates back to 0. Both use 250 ms with Easing.out(Easing.cubic).

InputDock receives disableKeyboardAnimation={true} so it does not create its own keyboard listeners or apply an independent translateY. All motion comes from the parent Animated.View.

Layout hierarchy:

ThemedView (flex: 1, watermark)
  AppHeader
  KeyboardAvoidingView behavior="position" (flex: 1)
    Animated.View (translateY: keyboardOffsetValue)
      FlatList (message list)
      InputDock (text input + send button, disableKeyboardAnimation=true)

The InputDock receives bottomInset from useSafeAreaInsets() to handle notched/gesture-bar devices correctly on both platforms.

The FlatList uses showsVerticalScrollIndicator={false} to hide the native scrollbar while keeping scroll functionality intact.

Entry Point

Lambda: main/server/api/memories/chat/app.py → lambda_handler
HTTP method: POST /memories/chat (API Gateway, Cognito-authenticated)

Key Components

Component	Location
`ReasoningAgent`	`main/server/worldmm/retrieval/agent.py`
`TextEmbedder`	`main/server/worldmm/memory/text_embedder.py`
`VLM2VecClient`	`main/server/worldmm/memory/visual/encoder.py`
`GPULLMClient`	`main/server/worldmm/llm/client.py`
`ChatRepository`	`main/server/layers/shared/python/shared/chat/repository.py`

Dependencies

PostgreSQL (WorldMMSegment, WorldMMEntity, WorldMMTriple, pgvector)
DynamoDB (MESSAGES_TABLE_NAME, CHATS_TABLE_NAME)
GPU EC2 worker (GPU_INSTANCE_ID, GPU_WORKER_PORT, GPU_LAUNCH_TEMPLATE_ID, GPU_MAX_WAIT_MS, GPU_RETRY_INITIAL_DELAY_MS)
SQS ChatGpuRetryQueue (CHAT_GPU_RETRY_QUEUE_URL) — for GPU timeout retry path
Groq API (GROQ_API_KEY) for text embedding and LLM calls

GPU Instance Discovery

The watchdog on the GPU worker shuts the instance down after 480 s of idle time but does not update SSM. This creates a stale-ID cycle: both the chat Lambda and the ingest Lambda read a terminated instance ID from SSM, address traffic to the dead instance, and receive zero invocations — causing the watchdog to kill any replacement instance immediately.

The tag-based fallback breaks the cycle:

_resolve_gpu_url(instance_id):
  state = ec2.describe_instances([instance_id]).State.Name
  if state in (stopped, stopping, terminated):
    running = _find_running_gpu_instance_by_tag()   # Filters: tag:Name=encache-gpu-worker, state=running
    if running:
      _update_ssm_gpu_instance_id(running.InstanceId)  # PUT /encache/gpu/instance_id
      return http://{running.PrivateIpAddress}:{port}
    if state == stopped:
      ec2.start_instances([instance_id])
    if state == terminated and launch_template_id:
      ec2.run_instances(LaunchTemplate=..., Tags=[Name=encache-gpu-worker])
      _update_ssm_gpu_instance_id(new_id)
    return None                                     # caller returns fallback message
  return http://{instance.PrivateIpAddress}:{port}

Private IP is used because both Lambdas are inside the VPC. Public IP routes through the NAT gateway and loses security-group identity, causing SG-to-SG rules to fail.

GPU Health Retry Configuration

Env Var	Type	Default	Description
`GPU_MAX_WAIT_MS`	int	`15000`	Maximum total polling duration (ms) for the async worker. Set low to keep within API Gateway 29s constraint. When exceeded, triggers `mark_gpu_pending` + SQS enqueue.
`GPU_RETRY_INITIAL_DELAY_MS`	int	`2000`	First inter-probe sleep (ms). Doubles after each failed check, capped at 60 000 ms.

Backoff sequence with defaults (GPU_MAX_WAIT_MS=15000): 2 s → 4 s → exhaust max-wait.

The Lambda timeout guard (lambda_remaining_ms - elapsed_ms <= 10 000) always takes priority over GPU_MAX_WAIT_MS — polling stops early so at least 10 s remains for response assembly and DynamoDB writes.