Chat Returns "GPU Starting Up" / "No Context" Despite Videos Uploaded
Metadata
- Date:
2026-05-07 - Status:
investigating - Severity:
high - User:
a408c4d8-60d1-7085-0e89-8b28e7102455(benjaminsl2000@gmail.com) - Related Issues: PR #402 (GPU cold start fix), #395 (GPU template fix)
- Owner: TBD
Symptom
User reports: - Uploaded multiple videos to the platform - When asking questions via chat, receives "Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes." - OR chat returns "I don't have any information about you" / "I don't know anything about you" - Expects the chat to have context about the uploaded videos
Root Cause Analysis
The chat system has two separate failure modes that manifest as "no context":
Failure Mode 1: GPU Worker Unavailable (Returns Fixed Fallback Message)
Path: api/memories/chat/app.py line 275-281
If the GPU worker is unavailable, the Lambda returns this exact message on EVERY request:
"Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes."
The code never reaches the database query stage (line 328-329):
if not gpu_worker_url:
# Early exit — user data is NOT loaded from PostgreSQL
fallback = "Chat is not available right now..."
return {"answer": fallback, "chat_id": chat_id}
# Only executed if GPU is healthy:
episodic_graphs = load_episodic_graphs(user_id) # <-- SKIPPED if GPU unavailable
semantic_graph = load_semantic_graph(user_id) # <-- SKIPPED if GPU unavailable
Critical point: The chat system requires a healthy GPU worker to proceed — not for the LLM answer generation, but because VLM2VecClient is needed for visual retrieval (line 319-322). Even if the user has no visual memories, the system architecture still requires the GPU endpoint.
Causes of GPU unavailability:
- GPU_INSTANCE_ID = "none" (not configured)
- SSM parameter
/encache/gpu/instance_idis set to sentinel value"none" - Lambda treats this as "GPU not configured"
-
Returns fallback immediately (line 251-263)
-
EC2 instance is stopped or starting
- Instance exists but is not yet in "running" state
_resolve_gpu_url()callsstart_instances()and returns None-
On next request, check again (line 101-109)
-
EC2 instance is terminated (spot interruption)
- Instance reached "terminated" state (common for one-time spot instances)
_resolve_gpu_url()re-launches from template and returns None-
Slow path: instance must boot and models must load (2-5 min) (line 111-122)
-
GPU instance is running but unreachable
- Instance has no IP address assigned
- Network connectivity issues
-
Returns None (line 124-126)
-
GPU worker health check fails
- Instance is running but models are still loading
GET /healthreturns 503 (Not Loaded)- Takes 2-5 minutes for models to load from pre-baked AMI
-
Takes 10-15 minutes if models must download from HuggingFace (line 269-271)
-
vCPU quota exceeded
- AWS account has vCPU quota = 0 for G-instances
run_instances()fails with VcpuLimitExceeded- Failure is silent in logs (line 42-69)
Failure Mode 2: GPU Available But User Has No Data (LLM Returns "I Don't Know")
Path: api/memories/chat/app.py line 328-329 → db_loader.py line 59-144
If GPU is healthy, Lambda loads user data from PostgreSQL:
episodic_graphs = load_episodic_graphs(user_id) # queries WorldMMTriple, WorldMMEntity, WorldMMSegment
semantic_graph = load_semantic_graph(user_id) # queries WorldMMTriple, WorldMMEntity
If these queries return empty (no rows), the ReasoningAgent gets empty context:
def episodic_retriever(query):
if not episodic_graphs:
return [] # <-- Returns empty if no data
def semantic_retriever_fn(query):
triples = get_triples_for_user(user_id, "semantic")
if not triples:
return [] # <-- Returns empty if no data
The LLM then answers the user's question with empty context, resulting in responses like: - "I don't have any information about you" - "I don't know anything about you" - "I don't have memories of that"
Causes of empty data:
- Ingest pipeline never ran
- Videos were uploaded to S3 but
ingest_session()was not triggered -
Could be session-end Lambda failure, missing S3 event, etc.
-
Ingest pipeline failed partway
- Session metadata loaded, but caption/transcript extraction failed
- No
WorldMMSegmentrecords created for that session -
No
WorldMMTriplerecords created (depends on segments) -
Data exists but is invalidated
WorldMMTriple.invalidated_atis set (soft-delete)load_episodic_graphs()filters:WorldMMTriple.invalidated_at.is_(None)-
Data is filtered out at query time (line 67)
-
Data exists but in wrong user_id
- Data was ingested under different user ID (auth issue during upload)
load_episodic_graphs()filters byuser_id == <correct_user>(line 65)-
Returns nothing
-
Visual embeddings are incorrect (separate issue)
- Embeddings were generated with wrong LoRA adapter (
VLM2Vec-LoRAinstead ofVLM2Vec-Qwen2VL-2B) - Visual retrieval returns poor results or empty
- Fixed in commit 74c19e7 (2026-05-07), but old data remains in wrong vector space
- See
docs/bugs/2026-05-07-wrong-lora-adapter-vlm2vec-embedding.md
System Architecture for Debugging
flowchart TD
USER["User asks question"]
CHAT["MemoriesChatFunction<br/>api/memories/chat/app.py"]
SSM["SSM: /encache/gpu/instance_id"]
NORM["normalize_gpu_instance_id"]
RESOLVE["_resolve_gpu_url"]
EC2["EC2 API:<br/>describe_instances"]
HEALTH["GET /health"]
FALLBACK["Return fallback:<br/>GPU starting up"]
DB["PostgreSQL"]
LOAD["load_episodic_graphs<br/>load_semantic_graph"]
EMPTY["Empty graphs?"]
LLM["ReasoningAgent"]
NOCTX["Answer with no context:<br/>I don't know about you"]
GOOD["Answer with context"]
USER --> CHAT
CHAT --> SSM
SSM --> NORM
NORM --> RESOLVE
RESOLVE --> EC2
EC2 -->|running & healthy| HEALTH
HEALTH -->|200| DB
HEALTH -->|503| FALLBACK
EC2 -->|stopped/terminated| FALLBACK
EC2 -->|error| FALLBACK
DB --> LOAD
LOAD --> EMPTY
EMPTY -->|no data| NOCTX
EMPTY -->|has data| LLM
LLM --> GOOD
NOCTX --> USER
GOOD --> USER
FALLBACK --> USER Diagnosis Checklist
Step 1: Determine GPU Availability
Question: Is the chat returning the exact "GPU is starting up" message, or varied LLM responses?
- If always same "GPU is starting up" message → GPU is unavailable (Failure Mode 1)
- If varied responses ("I don't know", question-dependent) → GPU is available but data may be missing (Failure Mode 2)
Step 2: If GPU Unavailable — Check GPU Configuration
Command 1: Verify GPU instance is configured
aws ssm get-parameter --name /encache/gpu/instance_id --region us-east-1 \
--query 'Parameter.Value' --output text
- If returns
"none": GPU is not configured → skip to Step 3 (Launch GPU) - If returns instance ID (e.g.,
i-0abc123...): → continue to Step 3 (Check EC2)
Command 2: Check EC2 instance state
aws ec2 describe-instances --instance-ids i-0abc123 --region us-east-1 \
--query 'Reservations[*].Instances[*].[State.Name,PrivateIpAddress]' \
--output text
- If
runningwith an IP: → continue to Step 4 (Check GPU Health) - If
stopped,stopping,pending: → continue to Step 4 (Check GPU Health) (will take 1-2 min to start) - If
terminatedor no instances: → skip to Step 3 (Launch GPU)
Step 3: If GPU Instance Not Ready — Launch or Restart
If instance is stopped: It will auto-start on next chat request (line 108)
If instance is terminated or missing: Manually launch from template
aws ec2 run-instances \
--launch-template LaunchTemplateId=lt-0cf39db4cd6dff510,Version=\$Latest \
--region us-east-1 \
--query 'Instances[0].InstanceId' --output text
Capture the returned instance ID and update SSM:
aws ssm put-parameter \
--name /encache/gpu/instance_id \
--value i-<new-instance-id> \
--type String \
--overwrite \
--region us-east-1
Redeploy the chat Lambda to pick up the new instance ID:
Wait 2-5 minutes for the GPU instance to boot and models to load. Then test chat again.
Step 4: If GPU Running — Check Health and Logs
Command 1: Connect to GPU instance and check service status
# SSH into the instance (need EC2 keypair)
ssh -i /path/to/key.pem ec2-user@<instance-public-ip>
# Check service
sudo systemctl status vlm2vec
# View logs
sudo tail -100 /var/log/vlm2vec.log
Expected logs while loading:
Loading Qwen2-VL-7B-Instruct (captioning/NER/triples)...
Loading VLM2Vec-2B (embeddings)...
Total GPU memory used: X.XGB
Expected after ready:
Error patterns: - torch.cuda.OutOfMemoryError: Instance is too small (need g5.xlarge or larger) - ModuleNotFoundError: Missing Python package (should be pre-installed) - No such file or directory: Model cache not initialized (wait 5 more min or SSH and check /home/ec2-user/.cache/huggingface/)
Command 2: Test health endpoint directly from Lambda
# From Lambda logs (CloudWatch)
aws logs tail /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ --follow
Look for: - chat_gpu_resolved url=http://....:8000 → GPU found and healthy - Failed to resolve GPU URL → GPU query failed - GPU not healthy at http://... → Health check returned 503 or error
Step 5: If GPU Healthy — Check User Data in PostgreSQL
Connect to PostgreSQL (requires AWS/VPN access):
psql postgresql://<user>:<pass>@<host>:5432/metadata
-- Check if user has any segments
SELECT COUNT(*) as segment_count FROM "WorldMMSegment"
WHERE user_id = 'a408c4d8-60d1-7085-0e89-8b28e7102455';
-- Check if user has any triples
SELECT COUNT(*) as triple_count FROM "WorldMMTriple"
WHERE user_id = 'a408c4d8-60d1-7085-0e89-8b28e7102455'
AND invalidated_at IS NULL;
-- Check if user has any entities
SELECT COUNT(*) as entity_count FROM "WorldMMEntity"
WHERE user_id = 'a408c4d8-60d1-7085-0e89-8b28e7102455';
- If all counts are 0 → Data was never ingested → Skip to Step 6 (Verify Ingest)
- If counts > 0 → Data exists → Chat should work, but may be retrieving poorly → Check retrieval in CloudWatch logs (Step 7)
Step 6: If No Data — Verify Ingest Pipeline
Check session end Lambda logs:
Check ingest window Lambda logs:
Check ingest DLQ (if windows failed and retried):
Look for: - ingest_session_start event logged - ingest_window_start events for each window - No error/exception logs - world_mm_segment_stored events (confirms segments written)
If ingest logs show failures: - Groq API error: Check GROQ_API_KEY in SSM - GPU timeout: GPU was starting, ingest timed out — retry ingest - Database connection error: Check PostgreSQL connectivity from Lambda VPC
Manually re-trigger ingest (if needed):
# Find the session ID from S3
aws s3 ls s3://encache-raw-memory/sessions/ | grep <user-id-prefix>
# Invoke ingest Lambda manually
aws lambda invoke \
--function-name server-RetriggerIngestFunction \
--payload '{"session_id":"<session-uuid>"}' \
--region us-east-1 \
/tmp/response.json
Step 7: If Data Exists — Check Retrieval and LLM Quality
Check chat Lambda verbose logs (set verbose=true in chat request):
Response includes trace object with: - rounds: Each search round (episodic, semantic, visual) - memory_types_used: Which retrievers returned results - total_results_retrieved: Count of results
If retrieval is empty (total_results_retrieved = 0): - Graphs loaded successfully but returned no results - Could be: - Query embedding mismatch (text embedder using wrong model) - Entity resolution failed - Segments exist but have no triples - Visual embeddings in wrong vector space (wrong adapter at ingest time)
If retrieval has results (total_results_retrieved > 0): - LLM received context but answered poorly - Could be: - LLM model changed and new model lacks knowledge - GPU worker is using wrong model checkpoint - Context was semantically irrelevant despite text match
Recommended Fix Sequence for This User
- Check if GPU is configured (Step 2, Command 1)
- If "none": Launch GPU instance manually (Step 3)
-
If instance ID exists: Check EC2 state (Step 2, Command 2)
-
Wait for GPU to boot and models to load (2-5 minutes)
-
Verify with
systemctl status vlm2vec(Step 4, Command 1) -
Verify GPU_INSTANCE_ID is in Lambda env (redeploy SAM if changed)
-
Test chat endpoint (should work if GPU is healthy)
-
If still no context:
- Check PostgreSQL for user data (Step 5)
- If no data: Check ingest logs and re-trigger if needed (Step 6)
- If data exists: Check retrieval quality with verbose mode (Step 7)
Key Files
| File | Purpose |
|---|---|
main/server/api/memories/chat/app.py | Chat Lambda, GPU resolution logic, fallback message (line 275) |
main/server/worldmm/retrieval/db_loader.py | Loads episodic/semantic graphs from PostgreSQL |
main/server/worldmm/gpu_worker/server.py | GPU worker health endpoint, model loading |
main/server/worldmm/gpu_worker/ec2_user_data.sh | EC2 startup script, model pre-baking detection |
main/devops/main.tf | GPU launch template, SSM parameters, IAM roles |
main/server/template.yaml | SAM template, chat Lambda env vars |
docs/bugs/2026-05-06-chat-gpu-worker-starting-up-spot-instance.md | Prior GPU startup investigation |
docs/bugs/2026-05-07-wrong-lora-adapter-vlm2vec-embedding.md | Visual embedding quality issue |
Notes
- The system prioritizes GPU availability over graceful degradation. The chat Lambda does NOT have a fallback LLM on CPU; if the GPU is unavailable, users get the "starting up" message.
- The pre-baked AMI (
ami-05bea3b3b7c57278e) has models baked in, reducing cold start from 10-15 min to 2-5 min. - Spot interruptions are handled via re-launch, but this is slow. Consider switching to on-demand or persistent spot for better availability.
- Visual embeddings can be in wrong vector space if ingested before the LoRA adapter fix (commit 74c19e7). Old embeddings should be re-generated.
See Also
docs/docs/memories-chat.md— Chat endpoint architecturedocs/docs/ingest-session.md— Ingest pipeline detailsdocs/docs/gpu-worker.md— GPU worker deployment and models