Skip to content

Chat Returns "GPU Starting Up" / "No Context" Despite Videos Uploaded

Metadata

  • Date: 2026-05-07
  • Status: investigating
  • Severity: high
  • User: a408c4d8-60d1-7085-0e89-8b28e7102455 (benjaminsl2000@gmail.com)
  • Related Issues: PR #402 (GPU cold start fix), #395 (GPU template fix)
  • Owner: TBD

Symptom

User reports: - Uploaded multiple videos to the platform - When asking questions via chat, receives "Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes." - OR chat returns "I don't have any information about you" / "I don't know anything about you" - Expects the chat to have context about the uploaded videos

Root Cause Analysis

The chat system has two separate failure modes that manifest as "no context":

Failure Mode 1: GPU Worker Unavailable (Returns Fixed Fallback Message)

Path: api/memories/chat/app.py line 275-281

If the GPU worker is unavailable, the Lambda returns this exact message on EVERY request:

"Chat is not available right now — the GPU worker is starting up. Please try again in a few minutes."

The code never reaches the database query stage (line 328-329):

if not gpu_worker_url:
    # Early exit — user data is NOT loaded from PostgreSQL
    fallback = "Chat is not available right now..."
    return {"answer": fallback, "chat_id": chat_id}

# Only executed if GPU is healthy:
episodic_graphs = load_episodic_graphs(user_id)  # <-- SKIPPED if GPU unavailable
semantic_graph = load_semantic_graph(user_id)     # <-- SKIPPED if GPU unavailable

Critical point: The chat system requires a healthy GPU worker to proceed — not for the LLM answer generation, but because VLM2VecClient is needed for visual retrieval (line 319-322). Even if the user has no visual memories, the system architecture still requires the GPU endpoint.

Causes of GPU unavailability:

  1. GPU_INSTANCE_ID = "none" (not configured)
  2. SSM parameter /encache/gpu/instance_id is set to sentinel value "none"
  3. Lambda treats this as "GPU not configured"
  4. Returns fallback immediately (line 251-263)

  5. EC2 instance is stopped or starting

  6. Instance exists but is not yet in "running" state
  7. _resolve_gpu_url() calls start_instances() and returns None
  8. On next request, check again (line 101-109)

  9. EC2 instance is terminated (spot interruption)

  10. Instance reached "terminated" state (common for one-time spot instances)
  11. _resolve_gpu_url() re-launches from template and returns None
  12. Slow path: instance must boot and models must load (2-5 min) (line 111-122)

  13. GPU instance is running but unreachable

  14. Instance has no IP address assigned
  15. Network connectivity issues
  16. Returns None (line 124-126)

  17. GPU worker health check fails

  18. Instance is running but models are still loading
  19. GET /health returns 503 (Not Loaded)
  20. Takes 2-5 minutes for models to load from pre-baked AMI
  21. Takes 10-15 minutes if models must download from HuggingFace (line 269-271)

  22. vCPU quota exceeded

  23. AWS account has vCPU quota = 0 for G-instances
  24. run_instances() fails with VcpuLimitExceeded
  25. Failure is silent in logs (line 42-69)

Failure Mode 2: GPU Available But User Has No Data (LLM Returns "I Don't Know")

Path: api/memories/chat/app.py line 328-329 → db_loader.py line 59-144

If GPU is healthy, Lambda loads user data from PostgreSQL:

episodic_graphs = load_episodic_graphs(user_id)   # queries WorldMMTriple, WorldMMEntity, WorldMMSegment
semantic_graph = load_semantic_graph(user_id)     # queries WorldMMTriple, WorldMMEntity

If these queries return empty (no rows), the ReasoningAgent gets empty context:

def episodic_retriever(query):
    if not episodic_graphs:
        return []   # <-- Returns empty if no data

def semantic_retriever_fn(query):
    triples = get_triples_for_user(user_id, "semantic")
    if not triples:
        return []   # <-- Returns empty if no data

The LLM then answers the user's question with empty context, resulting in responses like: - "I don't have any information about you" - "I don't know anything about you" - "I don't have memories of that"

Causes of empty data:

  1. Ingest pipeline never ran
  2. Videos were uploaded to S3 but ingest_session() was not triggered
  3. Could be session-end Lambda failure, missing S3 event, etc.

  4. Ingest pipeline failed partway

  5. Session metadata loaded, but caption/transcript extraction failed
  6. No WorldMMSegment records created for that session
  7. No WorldMMTriple records created (depends on segments)

  8. Data exists but is invalidated

  9. WorldMMTriple.invalidated_at is set (soft-delete)
  10. load_episodic_graphs() filters: WorldMMTriple.invalidated_at.is_(None)
  11. Data is filtered out at query time (line 67)

  12. Data exists but in wrong user_id

  13. Data was ingested under different user ID (auth issue during upload)
  14. load_episodic_graphs() filters by user_id == <correct_user> (line 65)
  15. Returns nothing

  16. Visual embeddings are incorrect (separate issue)

  17. Embeddings were generated with wrong LoRA adapter (VLM2Vec-LoRA instead of VLM2Vec-Qwen2VL-2B)
  18. Visual retrieval returns poor results or empty
  19. Fixed in commit 74c19e7 (2026-05-07), but old data remains in wrong vector space
  20. See docs/bugs/2026-05-07-wrong-lora-adapter-vlm2vec-embedding.md

System Architecture for Debugging

flowchart TD
    USER["User asks question"]
    CHAT["MemoriesChatFunction<br/>api/memories/chat/app.py"]
    SSM["SSM: /encache/gpu/instance_id"]
    NORM["normalize_gpu_instance_id"]
    RESOLVE["_resolve_gpu_url"]
    EC2["EC2 API:<br/>describe_instances"]
    HEALTH["GET /health"]
    FALLBACK["Return fallback:<br/>GPU starting up"]

    DB["PostgreSQL"]
    LOAD["load_episodic_graphs<br/>load_semantic_graph"]
    EMPTY["Empty graphs?"]
    LLM["ReasoningAgent"]
    NOCTX["Answer with no context:<br/>I don't know about you"]
    GOOD["Answer with context"]

    USER --> CHAT
    CHAT --> SSM
    SSM --> NORM
    NORM --> RESOLVE
    RESOLVE --> EC2
    EC2 -->|running & healthy| HEALTH
    HEALTH -->|200| DB
    HEALTH -->|503| FALLBACK
    EC2 -->|stopped/terminated| FALLBACK
    EC2 -->|error| FALLBACK

    DB --> LOAD
    LOAD --> EMPTY
    EMPTY -->|no data| NOCTX
    EMPTY -->|has data| LLM
    LLM --> GOOD
    NOCTX --> USER
    GOOD --> USER
    FALLBACK --> USER

Diagnosis Checklist

Step 1: Determine GPU Availability

Question: Is the chat returning the exact "GPU is starting up" message, or varied LLM responses?

  • If always same "GPU is starting up" message → GPU is unavailable (Failure Mode 1)
  • If varied responses ("I don't know", question-dependent) → GPU is available but data may be missing (Failure Mode 2)

Step 2: If GPU Unavailable — Check GPU Configuration

Command 1: Verify GPU instance is configured

aws ssm get-parameter --name /encache/gpu/instance_id --region us-east-1 \
  --query 'Parameter.Value' --output text

  • If returns "none": GPU is not configured → skip to Step 3 (Launch GPU)
  • If returns instance ID (e.g., i-0abc123...): → continue to Step 3 (Check EC2)

Command 2: Check EC2 instance state

aws ec2 describe-instances --instance-ids i-0abc123 --region us-east-1 \
  --query 'Reservations[*].Instances[*].[State.Name,PrivateIpAddress]' \
  --output text

  • If running with an IP: → continue to Step 4 (Check GPU Health)
  • If stopped, stopping, pending: → continue to Step 4 (Check GPU Health) (will take 1-2 min to start)
  • If terminated or no instances: → skip to Step 3 (Launch GPU)

Step 3: If GPU Instance Not Ready — Launch or Restart

If instance is stopped: It will auto-start on next chat request (line 108)

# Manual start (optional):
aws ec2 start-instances --instance-ids i-0abc123 --region us-east-1

If instance is terminated or missing: Manually launch from template

aws ec2 run-instances \
  --launch-template LaunchTemplateId=lt-0cf39db4cd6dff510,Version=\$Latest \
  --region us-east-1 \
  --query 'Instances[0].InstanceId' --output text

Capture the returned instance ID and update SSM:

aws ssm put-parameter \
  --name /encache/gpu/instance_id \
  --value i-<new-instance-id> \
  --type String \
  --overwrite \
  --region us-east-1

Redeploy the chat Lambda to pick up the new instance ID:

cd main/server && sam build && sam deploy --guided

Wait 2-5 minutes for the GPU instance to boot and models to load. Then test chat again.

Step 4: If GPU Running — Check Health and Logs

Command 1: Connect to GPU instance and check service status

# SSH into the instance (need EC2 keypair)
ssh -i /path/to/key.pem ec2-user@<instance-public-ip>

# Check service
sudo systemctl status vlm2vec

# View logs
sudo tail -100 /var/log/vlm2vec.log

Expected logs while loading:

Loading Qwen2-VL-7B-Instruct (captioning/NER/triples)...
Loading VLM2Vec-2B (embeddings)...
Total GPU memory used: X.XGB

Expected after ready:

Uvicorn running on http://0.0.0.0:8000

Error patterns: - torch.cuda.OutOfMemoryError: Instance is too small (need g5.xlarge or larger) - ModuleNotFoundError: Missing Python package (should be pre-installed) - No such file or directory: Model cache not initialized (wait 5 more min or SSH and check /home/ec2-user/.cache/huggingface/)

Command 2: Test health endpoint directly from Lambda

# From Lambda logs (CloudWatch)
aws logs tail /aws/lambda/server-MemoriesChatFunction-OkmZYszwOXzJ --follow

Look for: - chat_gpu_resolved url=http://....:8000 → GPU found and healthy - Failed to resolve GPU URL → GPU query failed - GPU not healthy at http://... → Health check returned 503 or error

Step 5: If GPU Healthy — Check User Data in PostgreSQL

Connect to PostgreSQL (requires AWS/VPN access):

psql postgresql://<user>:<pass>@<host>:5432/metadata

-- Check if user has any segments
SELECT COUNT(*) as segment_count FROM "WorldMMSegment"
WHERE user_id = 'a408c4d8-60d1-7085-0e89-8b28e7102455';

-- Check if user has any triples
SELECT COUNT(*) as triple_count FROM "WorldMMTriple"
WHERE user_id = 'a408c4d8-60d1-7085-0e89-8b28e7102455'
AND invalidated_at IS NULL;

-- Check if user has any entities
SELECT COUNT(*) as entity_count FROM "WorldMMEntity"
WHERE user_id = 'a408c4d8-60d1-7085-0e89-8b28e7102455';

  • If all counts are 0 → Data was never ingestedSkip to Step 6 (Verify Ingest)
  • If counts > 0 → Data existsChat should work, but may be retrieving poorly → Check retrieval in CloudWatch logs (Step 7)

Step 6: If No Data — Verify Ingest Pipeline

Check session end Lambda logs:

aws logs tail /aws/lambda/server-SessionEndFunction-<hash> --since 1h

Check ingest window Lambda logs:

aws logs tail /aws/lambda/server-IngestWindowFunction-<hash> --since 1h

Check ingest DLQ (if windows failed and retried):

aws logs tail /aws/lambda/server-IngestDLQConsumer-<hash> --since 1h

Look for: - ingest_session_start event logged - ingest_window_start events for each window - No error/exception logs - world_mm_segment_stored events (confirms segments written)

If ingest logs show failures: - Groq API error: Check GROQ_API_KEY in SSM - GPU timeout: GPU was starting, ingest timed out — retry ingest - Database connection error: Check PostgreSQL connectivity from Lambda VPC

Manually re-trigger ingest (if needed):

# Find the session ID from S3
aws s3 ls s3://encache-raw-memory/sessions/ | grep <user-id-prefix>

# Invoke ingest Lambda manually
aws lambda invoke \
  --function-name server-RetriggerIngestFunction \
  --payload '{"session_id":"<session-uuid>"}' \
  --region us-east-1 \
  /tmp/response.json

Step 7: If Data Exists — Check Retrieval and LLM Quality

Check chat Lambda verbose logs (set verbose=true in chat request):

POST /memories/chat
{
  "question": "What did I do?",
  "verbose": true
}

Response includes trace object with: - rounds: Each search round (episodic, semantic, visual) - memory_types_used: Which retrievers returned results - total_results_retrieved: Count of results

If retrieval is empty (total_results_retrieved = 0): - Graphs loaded successfully but returned no results - Could be: - Query embedding mismatch (text embedder using wrong model) - Entity resolution failed - Segments exist but have no triples - Visual embeddings in wrong vector space (wrong adapter at ingest time)

If retrieval has results (total_results_retrieved > 0): - LLM received context but answered poorly - Could be: - LLM model changed and new model lacks knowledge - GPU worker is using wrong model checkpoint - Context was semantically irrelevant despite text match

  1. Check if GPU is configured (Step 2, Command 1)
  2. If "none": Launch GPU instance manually (Step 3)
  3. If instance ID exists: Check EC2 state (Step 2, Command 2)

  4. Wait for GPU to boot and models to load (2-5 minutes)

  5. Verify with systemctl status vlm2vec (Step 4, Command 1)

  6. Verify GPU_INSTANCE_ID is in Lambda env (redeploy SAM if changed)

  7. Test chat endpoint (should work if GPU is healthy)

  8. If still no context:

  9. Check PostgreSQL for user data (Step 5)
  10. If no data: Check ingest logs and re-trigger if needed (Step 6)
  11. If data exists: Check retrieval quality with verbose mode (Step 7)

Key Files

File Purpose
main/server/api/memories/chat/app.py Chat Lambda, GPU resolution logic, fallback message (line 275)
main/server/worldmm/retrieval/db_loader.py Loads episodic/semantic graphs from PostgreSQL
main/server/worldmm/gpu_worker/server.py GPU worker health endpoint, model loading
main/server/worldmm/gpu_worker/ec2_user_data.sh EC2 startup script, model pre-baking detection
main/devops/main.tf GPU launch template, SSM parameters, IAM roles
main/server/template.yaml SAM template, chat Lambda env vars
docs/bugs/2026-05-06-chat-gpu-worker-starting-up-spot-instance.md Prior GPU startup investigation
docs/bugs/2026-05-07-wrong-lora-adapter-vlm2vec-embedding.md Visual embedding quality issue

Notes

  • The system prioritizes GPU availability over graceful degradation. The chat Lambda does NOT have a fallback LLM on CPU; if the GPU is unavailable, users get the "starting up" message.
  • The pre-baked AMI (ami-05bea3b3b7c57278e) has models baked in, reducing cold start from 10-15 min to 2-5 min.
  • Spot interruptions are handled via re-launch, but this is slow. Consider switching to on-demand or persistent spot for better availability.
  • Visual embeddings can be in wrong vector space if ingested before the LoRA adapter fix (commit 74c19e7). Old embeddings should be re-generated.

See Also

  • docs/docs/memories-chat.md — Chat endpoint architecture
  • docs/docs/ingest-session.md — Ingest pipeline details
  • docs/docs/gpu-worker.md — GPU worker deployment and models