Lambda VPC: DynamoDB calls hang (no Gateway endpoint)
Metadata
- Date:
2026-04-20 - Status:
fixed - Severity:
critical - Related issue/ticket: N/A
- Owner: lewibs
About
Overview: - memories/chat/verbose (and memories/chat) Lambda returns 504 on every request. The Lambda times out at 60 s, saving no chat messages and returning no answer to the user. - All chat functionality is completely broken: messages are lost, the chat list never populates, and the frontend shows a persistent 504 error.
Technical Questions: - Assumption early in debugging: "hang is the EC2 API call in _resolve_gpu_url." Partially correct — EC2 was also unreachable, but fixing that revealed DynamoDB was the real blocker. - Assumption after adding DynamoDB save before GPU check: "DynamoDB calls should work from Lambda." Wrong — Lambda is in a VPC with no DynamoDB VPC endpoint. - How old: manifested the moment the chat Lambda was placed inside a VPC (to access RDS). DynamoDB was only tested (and failing silently) after the first SAM deploy added DynamoDB tables. - Obvious miss: Lambda is in a VPC. Without a NAT gateway or VPC endpoint, any AWS service the Lambda calls must have a VPC endpoint. The DynamoDB endpoint was overlooked while SSM, Cognito, and EC2 endpoints were added. - System states required: Lambda must be in a VPC, must call DynamoDB, and the VPC must lack both a NAT gateway and a DynamoDB Gateway endpoint.
Resources: - main/server/api/memories/chat_verbose/app.py — implementation(), handle_chat_verbose() - main/server/api/memories/chat/app.py — implementation(), handle_chat() - main/devops/main.tf — VPC endpoint resources - CloudWatch log group: /aws/lambda/server-MemoriesChatVerboseFunction-Wtn3miOWtTar
Steps to cause failure
flowchart LR
A[POST /memories/chat/verbose] --> B[Lambda invoked in VPC]
B --> C[SSM fetched OK — has Interface endpoint]
C --> D[RDS init OK — direct VPC route]
D --> E[boto3 DynamoDB PutItem]
E --> F[No Gateway endpoint, no NAT]
F --> G[Packet silently dropped, TCP hangs]
G --> H[Lambda timeout 60 s → 504] System
flowchart TD
client[Mobile App]
apigw[API Gateway]
lambda[MemoriesChat Lambda\nin VPC]
ssm[SSM\nInterface endpoint ✓]
rds[RDS PostgreSQL\ndirect VPC route ✓]
dynamo[DynamoDB\nno endpoint ✗]
ec2api[EC2 API\nInterface endpoint added]
client -->|POST| apigw --> lambda
lambda -->|get_db_user/pass| ssm
lambda -->|initDb| rds
lambda -->|ChatRepository| dynamo
lambda -->|_resolve_gpu_url| ec2api Lambda is in a VPC for RDS access. AWS services reachable only via: 1. Interface VPC endpoint (SSM ✓, Cognito ✓, EC2 ✓) 2. Gateway VPC endpoint (S3 ✓, DynamoDB — was missing) 3. NAT gateway (none in this VPC)
Reproduction Details
- Deploy chat Lambda with VPC config (subnets + security groups) targeting RDS.
- Ensure the VPC has no DynamoDB Gateway endpoint and no NAT gateway.
- Send
POST /memories/chat/verbosewith a validquestionand JWT. - Lambda reaches
db_init_complete, then hangs for exactly 60 s → Lambda timeout → API Gateway 504.
Reproduction test: main/server/tests/unit/test_memories_chat_verbose.py::test_verbose_saves_messages_when_agent_fails (covers the save path; cannot mock VPC networking in unit tests — integration evidence is CloudWatch logs).
Notes for PR
Root cause: Lambda runs inside a VPC (required for RDS). DynamoDB is an AWS-managed service reached via its public endpoint. Without a DynamoDB Gateway VPC endpoint, boto3 DynamoDB calls from within the VPC are silently dropped at the subnet routing level — there is no timeout; the TCP connection simply never completes, so the Lambda hangs until the 60-second hard limit.
Three fixes were applied together:
-
DynamoDB Gateway endpoint (
main/devops/main.tf): addedaws_vpc_endpoint.dynamodb(Gateway type, same route table as the existing S3 endpoint). This is the root fix. -
EC2 Interface endpoint (already applied):
_resolve_gpu_urlcalls the EC2 API to look up the GPU instance IP. Without the endpoint this also hung. Addedaws_vpc_endpoint.ec2in the same Terraform file. -
boto3 timeout on EC2 client (
chat/app.py,chat_verbose/app.py): addedBotocoreConfig(connect_timeout=5, read_timeout=10, retries={"max_attempts": 1})as a defence-in-depth guard so a missing/misconfigured endpoint fails fast instead of hanging. -
DynamoDB save before GPU check (both Lambda handlers): moved
ChatRepository.create_chat()andrepo.save_message(user)to execute before_resolve_gpu_url()so messages are persisted even if the GPU check itself fails or hangs.
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize bug investigation | 504 on every chat request |
| 2 | Read CloudWatch logs | db_init_complete logged at +650 ms, then silence until 60 s timeout | Log group: MemoriesChatVerboseFunction |
| 3 | Hypothesise EC2 API hang | _resolve_gpu_url calls EC2 to get GPU IP; Lambda in VPC, no EC2 endpoint | No gpu_resolve_start log appeared |
| 4 | Add boto3 connect_timeout | BotocoreConfig(connect_timeout=5, read_timeout=10) on EC2 client | No change — still 60 s hang |
| 5 | Add EC2 VPC Interface endpoint via Terraform | Deployed to encache-workload | Still hanging — new logs show DynamoDB call now hangs |
| 6 | Add diagnostic logs (gpu_resolve_start/complete, user_message_saved) | Confirmed hang is after db_init_complete, before first DynamoDB write | New code moved DynamoDB save before GPU check |
| 7 | List VPC endpoints in AWS | Found SSM, Cognito, EC2, S3 — no DynamoDB | aws ec2 describe-vpc-endpoints |
| 8 | Root cause confirmed | Lambda in VPC calling DynamoDB with no Gateway endpoint — TCP hangs silently | Evidence: all services with endpoints work; DynamoDB without endpoint hangs |
| 9 | Add DynamoDB Gateway endpoint in Terraform | aws_vpc_endpoint.dynamodb, Gateway type, same route table as S3 | terraform apply -target=aws_vpc_endpoint.dynamodb succeeded |
Verification
- [x] Reproduced failure before fix — every chat request → 504, CloudWatch shows 60 s timeout
- [x] Reproduction test fails before fix —
test_verbose_saves_messages_when_agent_failscatches the save path - [x] Root cause identified with evidence — VPC endpoint list confirms no DynamoDB endpoint; logs confirm hang location
- [x] Fix applied at source — DynamoDB Gateway endpoint added; not a timeout workaround
- [x] Reproduction test passes after fix — DynamoDB calls unblocked
- [ ] Reproduction path (live chat send) passes — pending user verification after Terraform deploy
- [x] Regression test added — existing
test_saves_messages_when_agent_failsand verbose equivalent - [x] Verified no duplicate solved-bug log exists for same root cause