Memories Chat (AI Q&A)
An authenticated user asks a natural-language question about their past experiences; the system retrieves relevant memory segments from the WorldMM knowledge graph and uses a GPU-hosted LLM to compose a grounded answer.
Flow
-
User submits a question — The mobile app's chat screen sends a
POSTrequest with{ "question": "...", "chat_id": "<optional>" }. Settingverbose: truereturns the full reasoning trace. -
Authentication — The Lambda requires a valid Cognito JWT.
user_idis extracted from the auth context. -
Database initialization — PostgreSQL connection is configured from SSM secrets and SQLAlchemy ORM is initialized.
-
Chat persistence (pre-GPU) — The user's message is saved to DynamoDB (
Messagestable) before the GPU call so that it is never lost on timeout. Ifchat_idis null, a new chat record is created in theChatstable with the question as the title. Ownership is verified for existing chats (returns 403 on mismatch). -
GPU worker resolution — The Lambda reads
GPU_INSTANCE_IDfrom its environment variable (set at deploy time from SSM). It calls_resolve_gpu_url()to look up the instance viaec2.describe_instances. If the instance is stopped or terminated (stale SSM ID), the Lambda performs a tag-based fallback: it scans EC2 for any running instance taggedName=encache-gpu-worker, adopts that instance, and writes the new ID back to SSM (/encache/gpu/instance_id) so subsequent invocations do not repeat the scan. If no running instance is found, a spot instance is launched from the launch template and a graceful fallback message is returned. Private IP is always preferred over public IP because both the chat Lambda and the GPU worker are in the same VPC; using private IP means SG-to-SG rules apply rather than traffic exiting through NAT.
5a. GPU readiness polling — After URL resolution, wait_for_gpu_health() polls _gpu_is_healthy() (a GET /health probe with a 3 s socket timeout) with exponential backoff until the worker responds 200 or a termination condition is reached. This replaces the previous single-probe behavior: the Lambda no longer fails immediately on the first unhealthy check. Termination conditions, checked after each failed probe: - Max-wait timeout: elapsed_ms >= GPU_MAX_WAIT_MS (default 120 000 ms / 2 min). Logs gpu_retry_exhausted_max_wait. - Lambda timeout guard: lambda_remaining_ms - elapsed_ms <= 10 000. Stops polling with ≥10 s left for response assembly. Logs gpu_retry_lambda_timeout_approaching. Backoff sequence: delay starts at GPU_RETRY_INITIAL_DELAY_MS (default 2 000 ms), doubles after each failed check, caps at 60 000 ms. If wait_for_gpu_health() returns (False, elapsed_ms), gpu_worker_url is cleared and the fallback message path executes.
-
Knowledge graph loading —
load_episodic_graphsandload_semantic_graphload the user's entity and triple data from PostgreSQL into in-memory graph structures. -
Retrieval (multi-modal) — The
ReasoningAgentruns a multi-step reasoning loop. In each round it decides whether to search and which memory type to query: - Episodic retriever — Embeds the query (via
TextEmbedder), searches episodic graphs across all temporal scales, returns ranked segments. - Semantic retriever — Searches semantic triples via pgvector embedding similarity.
-
Visual retriever — Encodes the query as a visual embedding via
VLM2VecClient, queriessearch_visual_embeddings, and resolves captions for matching segments. -
Answer generation — The reasoning LLM (
GPULLMClient) synthesizes retrieved context into a natural-language answer. -
Response persistence — The assistant's answer is saved to DynamoDB and the chat's
updatedAttimestamp is touched. -
Response — Returns
{ "answer": "...", "chat_id": "..." }. Withverbose=true, also returns atraceobject with search rounds, memory types used, and result counts.
Frontend: Chat Screen
File: main/app/app/chat.tsx
The chat screen uses a manual Animated.Value (keyboardOffsetValue) to translate the entire content area — messages list and input dock together — when the keyboard appears or disappears. This keeps messages visible above the keyboard without covering them.
KeyboardAvoidingView is present with behavior="position" but the primary keyboard handling is the Animated.View translateY approach, which gives coordinated, smooth motion for the whole chat body.
Platform handling:
- iOS: listens to
keyboardWillChangeFrameand computesnextHeight = windowHeight - event.endCoordinates.screenY. AnimatestranslateYto-nextHeightusing the event's own duration. - Android: listens to
keyboardDidShow/keyboardDidHide. On show, animatestranslateYto-event.endCoordinates.height; on hide, animates back to0. Both use 250 ms withEasing.out(Easing.cubic).
InputDock receives disableKeyboardAnimation={true} so it does not create its own keyboard listeners or apply an independent translateY. All motion comes from the parent Animated.View.
Layout hierarchy:
ThemedView (flex: 1, watermark)
AppHeader
KeyboardAvoidingView behavior="position" (flex: 1)
Animated.View (translateY: keyboardOffsetValue)
FlatList (message list)
InputDock (text input + send button, disableKeyboardAnimation=true)
The InputDock receives bottomInset from useSafeAreaInsets() to handle notched/gesture-bar devices correctly on both platforms.
The FlatList uses showsVerticalScrollIndicator={false} to hide the native scrollbar while keeping scroll functionality intact.
Entry Point
- Lambda:
main/server/api/memories/chat/app.py→lambda_handler - HTTP method:
POST /memories/chat(API Gateway, Cognito-authenticated)
Key Components
| Component | Location |
|---|---|
ReasoningAgent | main/server/worldmm/retrieval/agent.py |
TextEmbedder | main/server/worldmm/memory/text_embedder.py |
VLM2VecClient | main/server/worldmm/memory/visual/encoder.py |
GPULLMClient | main/server/worldmm/llm/client.py |
ChatRepository | main/server/layers/shared/python/shared/chat/repository.py |
Dependencies
- PostgreSQL (
WorldMMSegment,WorldMMEntity,WorldMMTriple, pgvector) - DynamoDB (
MESSAGES_TABLE_NAME,CHATS_TABLE_NAME) - GPU EC2 worker (
GPU_INSTANCE_ID,GPU_WORKER_PORT,GPU_LAUNCH_TEMPLATE_ID,GPU_MAX_WAIT_MS,GPU_RETRY_INITIAL_DELAY_MS) - Groq API (
GROQ_API_KEY) for text embedding and LLM calls
GPU Instance Discovery
The watchdog on the GPU worker shuts the instance down after 480 s of idle time but does not update SSM. This creates a stale-ID cycle: both the chat Lambda and the ingest Lambda read a terminated instance ID from SSM, address traffic to the dead instance, and receive zero invocations — causing the watchdog to kill any replacement instance immediately.
The tag-based fallback breaks the cycle:
_resolve_gpu_url(instance_id):
state = ec2.describe_instances([instance_id]).State.Name
if state in (stopped, stopping, terminated):
running = _find_running_gpu_instance_by_tag() # Filters: tag:Name=encache-gpu-worker, state=running
if running:
_update_ssm_gpu_instance_id(running.InstanceId) # PUT /encache/gpu/instance_id
return http://{running.PrivateIpAddress}:{port}
if state == stopped:
ec2.start_instances([instance_id])
if state == terminated and launch_template_id:
ec2.run_instances(LaunchTemplate=..., Tags=[Name=encache-gpu-worker])
_update_ssm_gpu_instance_id(new_id)
return None # caller returns fallback message
return http://{instance.PrivateIpAddress}:{port}
Private IP is used because both Lambdas are inside the VPC. Public IP routes through the NAT gateway and loses security-group identity, causing SG-to-SG rules to fail.
GPU Health Retry Configuration
| Env Var | Type | Default | Description |
|---|---|---|---|
GPU_MAX_WAIT_MS | int | 120000 | Maximum total polling duration (ms). Retry loop exits with fallback when elapsed time exceeds this value. |
GPU_RETRY_INITIAL_DELAY_MS | int | 2000 | First inter-probe sleep (ms). Doubles after each failed check, capped at 60 000 ms. |
Backoff sequence with defaults: 2 s → 4 s → 8 s → 16 s → 32 s → 60 s → 60 s → … until 120 s total elapsed.
The Lambda timeout guard (lambda_remaining_ms - elapsed_ms <= 10 000) always takes priority over GPU_MAX_WAIT_MS — polling stops early so at least 10 s remains for response assembly and DynamoDB writes.