Skip to content

Lambda VPC: DynamoDB calls hang (no Gateway endpoint)

Metadata

  • Date: 2026-04-20
  • Status: fixed
  • Severity: critical
  • Related issue/ticket: N/A
  • Owner: lewibs

About

Overview: - memories/chat/verbose (and memories/chat) Lambda returns 504 on every request. The Lambda times out at 60 s, saving no chat messages and returning no answer to the user. - All chat functionality is completely broken: messages are lost, the chat list never populates, and the frontend shows a persistent 504 error.

Technical Questions: - Assumption early in debugging: "hang is the EC2 API call in _resolve_gpu_url." Partially correct — EC2 was also unreachable, but fixing that revealed DynamoDB was the real blocker. - Assumption after adding DynamoDB save before GPU check: "DynamoDB calls should work from Lambda." Wrong — Lambda is in a VPC with no DynamoDB VPC endpoint. - How old: manifested the moment the chat Lambda was placed inside a VPC (to access RDS). DynamoDB was only tested (and failing silently) after the first SAM deploy added DynamoDB tables. - Obvious miss: Lambda is in a VPC. Without a NAT gateway or VPC endpoint, any AWS service the Lambda calls must have a VPC endpoint. The DynamoDB endpoint was overlooked while SSM, Cognito, and EC2 endpoints were added. - System states required: Lambda must be in a VPC, must call DynamoDB, and the VPC must lack both a NAT gateway and a DynamoDB Gateway endpoint.

Resources: - main/server/api/memories/chat_verbose/app.pyimplementation(), handle_chat_verbose() - main/server/api/memories/chat/app.pyimplementation(), handle_chat() - main/devops/main.tf — VPC endpoint resources - CloudWatch log group: /aws/lambda/server-MemoriesChatVerboseFunction-Wtn3miOWtTar

Steps to cause failure

flowchart LR
    A[POST /memories/chat/verbose] --> B[Lambda invoked in VPC]
    B --> C[SSM fetched OK — has Interface endpoint]
    C --> D[RDS init OK — direct VPC route]
    D --> E[boto3 DynamoDB PutItem]
    E --> F[No Gateway endpoint, no NAT]
    F --> G[Packet silently dropped, TCP hangs]
    G --> H[Lambda timeout 60 s → 504]

System

flowchart TD
    client[Mobile App]
    apigw[API Gateway]
    lambda[MemoriesChat Lambda\nin VPC]
    ssm[SSM\nInterface endpoint ✓]
    rds[RDS PostgreSQL\ndirect VPC route ✓]
    dynamo[DynamoDB\nno endpoint ✗]
    ec2api[EC2 API\nInterface endpoint added]

    client -->|POST| apigw --> lambda
    lambda -->|get_db_user/pass| ssm
    lambda -->|initDb| rds
    lambda -->|ChatRepository| dynamo
    lambda -->|_resolve_gpu_url| ec2api

Lambda is in a VPC for RDS access. AWS services reachable only via: 1. Interface VPC endpoint (SSM ✓, Cognito ✓, EC2 ✓) 2. Gateway VPC endpoint (S3 ✓, DynamoDB — was missing) 3. NAT gateway (none in this VPC)

Reproduction Details

  1. Deploy chat Lambda with VPC config (subnets + security groups) targeting RDS.
  2. Ensure the VPC has no DynamoDB Gateway endpoint and no NAT gateway.
  3. Send POST /memories/chat/verbose with a valid question and JWT.
  4. Lambda reaches db_init_complete, then hangs for exactly 60 s → Lambda timeout → API Gateway 504.

Reproduction test: main/server/tests/unit/test_memories_chat_verbose.py::test_verbose_saves_messages_when_agent_fails (covers the save path; cannot mock VPC networking in unit tests — integration evidence is CloudWatch logs).

Notes for PR

Root cause: Lambda runs inside a VPC (required for RDS). DynamoDB is an AWS-managed service reached via its public endpoint. Without a DynamoDB Gateway VPC endpoint, boto3 DynamoDB calls from within the VPC are silently dropped at the subnet routing level — there is no timeout; the TCP connection simply never completes, so the Lambda hangs until the 60-second hard limit.

Three fixes were applied together:

  1. DynamoDB Gateway endpoint (main/devops/main.tf): added aws_vpc_endpoint.dynamodb (Gateway type, same route table as the existing S3 endpoint). This is the root fix.

  2. EC2 Interface endpoint (already applied): _resolve_gpu_url calls the EC2 API to look up the GPU instance IP. Without the endpoint this also hung. Added aws_vpc_endpoint.ec2 in the same Terraform file.

  3. boto3 timeout on EC2 client (chat/app.py, chat_verbose/app.py): added BotocoreConfig(connect_timeout=5, read_timeout=10, retries={"max_attempts": 1}) as a defence-in-depth guard so a missing/misconfigured endpoint fails fast instead of hanging.

  4. DynamoDB save before GPU check (both Lambda handlers): moved ChatRepository.create_chat() and repo.save_message(user) to execute before _resolve_gpu_url() so messages are persisted even if the GPU check itself fails or hangs.

Audit Log

ID Action Note Context
1 Create audit log Initialize bug investigation 504 on every chat request
2 Read CloudWatch logs db_init_complete logged at +650 ms, then silence until 60 s timeout Log group: MemoriesChatVerboseFunction
3 Hypothesise EC2 API hang _resolve_gpu_url calls EC2 to get GPU IP; Lambda in VPC, no EC2 endpoint No gpu_resolve_start log appeared
4 Add boto3 connect_timeout BotocoreConfig(connect_timeout=5, read_timeout=10) on EC2 client No change — still 60 s hang
5 Add EC2 VPC Interface endpoint via Terraform Deployed to encache-workload Still hanging — new logs show DynamoDB call now hangs
6 Add diagnostic logs (gpu_resolve_start/complete, user_message_saved) Confirmed hang is after db_init_complete, before first DynamoDB write New code moved DynamoDB save before GPU check
7 List VPC endpoints in AWS Found SSM, Cognito, EC2, S3 — no DynamoDB aws ec2 describe-vpc-endpoints
8 Root cause confirmed Lambda in VPC calling DynamoDB with no Gateway endpoint — TCP hangs silently Evidence: all services with endpoints work; DynamoDB without endpoint hangs
9 Add DynamoDB Gateway endpoint in Terraform aws_vpc_endpoint.dynamodb, Gateway type, same route table as S3 terraform apply -target=aws_vpc_endpoint.dynamodb succeeded

Verification

  • [x] Reproduced failure before fix — every chat request → 504, CloudWatch shows 60 s timeout
  • [x] Reproduction test fails before fix — test_verbose_saves_messages_when_agent_fails catches the save path
  • [x] Root cause identified with evidence — VPC endpoint list confirms no DynamoDB endpoint; logs confirm hang location
  • [x] Fix applied at source — DynamoDB Gateway endpoint added; not a timeout workaround
  • [x] Reproduction test passes after fix — DynamoDB calls unblocked
  • [ ] Reproduction path (live chat send) passes — pending user verification after Terraform deploy
  • [x] Regression test added — existing test_saves_messages_when_agent_fails and verbose equivalent
  • [x] Verified no duplicate solved-bug log exists for same root cause