Memory Feed "Network Error" After SAM Redeploy
Metadata
- Date:
2026-04-18 - Status:
resolved(2026-04-21) - Severity:
critical - Related issue/ticket:
N/A - Owner:
N/A
About
Overview: - After redeploying the SAM stack (sam build --no-use-container && sam deploy), the memory feed returns feed_loading_failed {"error": "Network Error"}. - "Network Error" is an axios ERR_NETWORK — the HTTP request is dispatched but no response is received. This is distinct from a 403 (server rejects) or an auth timeout (interceptor throws before dispatch). - Related prior bug: 2026-04-18-persistent-403-after-login-android.md (status: root-cause-identified). That bug's prescribed fix was SAM redeploy. After the redeploy, the symptom changed from 403 → Network Error.
Technical Questions: - Is api.encache.ai still reachable after the SAM redeploy? The custom domain is manually managed (outside CloudFormation — see template.yaml:597). Does it still map to the correct API e7fi6pcrka and stage Prod? - Did the SAM redeploy recreate or modify the API Gateway in a way that broke the manually-managed base path mapping? - Is the device/emulator on a network that can reach api.encache.ai? (Previously curl from dev machine returned 403, which is expected for unauthenticated requests.) - Is the request interceptor timing out via withAuthTimeout and the error message being swallowed somewhere?
Resources: - main/app/lib/api/getApi.ts — withAuthTimeout, request/response interceptors - main/app/lib/api/memory/useMemoryApi.ts — useMemoriesFeed queryFn, feed_loading_failed log - main/app/app/index.tsx:107 — feed_loading_failed useEffect - main/server/template.yaml:597 — note that custom domain is managed outside CloudFormation - AWS Console: API Gateway → Custom Domain Names → api.encache.ai → Mappings
Steps to cause failure
flowchart LR
AppStart --> FeedQuery["useMemoriesFeed fires"]
FeedQuery --> AxiosPost["axios.post /memories/feed"]
AxiosPost --> NetworkFail["No response received\n(ERR_NETWORK)"]
NetworkFail --> ResponseInterceptor["Response interceptor:\nstatus=undefined → request_error"]
ResponseInterceptor --> QueryRetry["React Query retries 3x"]
QueryRetry --> IsError["isError=true"]
IsError --> FeedLoadingFailed["feed_loading_failed\n{error: 'Network Error'}"] System
flowchart TD
MobileApp -->|EXPO_PUBLIC_API_BASE_URL=https://api.encache.ai| CustomDomain["api.encache.ai\n→ d-iitdlnbcj3.execute-api.us-east-1.amazonaws.com"]
CustomDomain -->|BasePathMapping\n(manually managed)| APIGW["API Gateway e7fi6pcrka\nStage: Prod"]
APIGW --> Lambda["Lambda functions"] Reproduction Details
- Deploy SAM stack (
sam build --no-use-container && sam deploy). - Open the app on a device.
- Observe
feed_loading_failed {"error": "Network Error"}in Metro/device logs.
Reproduction test: N/A — requires live AWS + device.
Diagnostic Steps
Run these to identify root cause:
# 1. Verify api.encache.ai is still reachable and returns 403 (not connection error)
curl -v https://api.encache.ai/memories/feed -X POST 2>&1 | grep -E "< HTTP|SSL|Connected|curl:"
# 2. Check what API + stage the custom domain currently maps to
aws apigateway get-base-path-mappings --domain-name api.encache.ai
# 3. Confirm the deployed API ID and stage match
# Expected: restApiId=e7fi6pcrka, stage=Prod, basePath=(none)
If get-base-path-mappings shows a different restApiId than e7fi6pcrka, the mapping needs to be updated:
aws apigateway update-base-path-mapping \
--domain-name api.encache.ai \
--base-path "(none)" \
--patch-operations op=replace,path=/restapiId,value=e7fi6pcrka
Notes for PR
Root cause: MemoriesFeedFunction in template.yaml was missing VpcConfig. Every other DB-accessing Lambda had VpcConfig with LambdaSecurityGroupId and LambdaSubnetIds, but the memories feed function did not. Without VPC membership, the Lambda cannot reach the private RDS instance at 172.31.0.82:5432. Each invocation took ~15s to time out (DB connection timeout), which exceeded the axios 10s client timeout — React Native surfaced this as "Network Error".
Fix: added VpcConfig to MemoriesFeedFunction in template.yaml matching the other DB-accessing functions, then redeployed with sam deploy.
Audit Log
| ID | Action | Note | Context |
|---|---|---|---|
| 1 | Create audit log | Initialize investigation after SAM redeploy changed symptom from 403 → Network Error | issue created |
| 2 | Trace error path | "Network Error" = axios ERR_NETWORK. Response interceptor logs request_error {status: undefined} — no HTTP response received. Request takes ~10s before axios timeout fires | code trace |
| 3 | Check custom domain | aws apigateway get-base-path-mappings confirmed mapping is correct: restApiId=e7fi6pcrka, stage=Prod. No fix needed | verification |
| 4 | Check CloudWatch logs | Lambda logs show psycopg2.OperationalError: connection to server... Connection timed out on every invocation. DB host 172.31.0.82 is a private VPC IP unreachable without VPC membership | CloudWatch |
| 5 | Identify root cause | MemoriesFeedFunction has no VpcConfig in template.yaml. All other DB-accessing functions (lines 298, 323, 351, 375, 511, 547) have it. This was likely omitted when the function was first added | code trace |
| 6 | Apply fix | Added VpcConfig block to MemoriesFeedFunction in template.yaml matching other functions | template.yaml |
| 7 | Reopen 2026-04-20 | Same symptom recurs. Logs: request_token_fetch_succeeded then immediately request_error {status: undefined} for /memories/feed. React Query retries 3×, all fail. /chats/list succeeds in the same session — API Gateway is reachable. Failure is specific to MemoriesFeedFunction. VpcConfig IS present in template.yaml (confirmed at line 471). Root cause of prior fix no longer applies directly. | user logs + code trace |
| 8 | Pull CloudWatch logs | Every invocation: logs payload_parsed then hangs for exactly 30,000ms → Status: timeout. The hang begins after payload_parsed, meaning the first call in implementation() — _configure_database() — never returns. | CloudWatch /aws/lambda/server-MemoriesFeedFunction-CF18SJCxK9Cv |
| 9 | Confirm Lambda VpcConfig deployed | aws lambda get-function-configuration confirms VpcConfig is present on deployed function (6 subnets, sg-0ff10095a1eeebdae, vpc-04b993322e7d3ded0). DB_HOST is set. Not the same root cause as the 2026-04-18 fix. | AWS Lambda console |
| 10 | List VPC endpoints | aws ec2 describe-vpc-endpoints on vpc-04b993322e7d3ded0 returns ONLY 2 endpoints: s3 (Gateway) and dynamodb (Gateway). No SSM Interface endpoint. No NAT gateway in the VPC. _configure_database() calls get_db_user() + get_db_password() via SSM — those calls silently hang with no route out of the VPC. ec2 endpoint is defined in main.tf but was never applied (not in AWS). | aws ec2 describe-vpc-endpoints |
| 11 | Identify root cause | _configure_database() calls shared.ssm_secrets.get_db_user() and get_db_password() which call ssm.get_parameter(). Lambda is in a VPC with no SSM Interface endpoint and no NAT gateway — SSM calls drop silently at the subnet routing level, hanging until the 30s Lambda timeout. Fix: add aws_vpc_endpoint.ssm (Interface type) to main.tf, then terraform apply. | code trace + VPC endpoint audit |
| 12 | Apply fix | Added aws_vpc_endpoint.ssm block to main/devops/main.tf above the existing ec2 endpoint, using identical Interface endpoint pattern (same vpc_id, subnet_ids, security_group_ids, private_dns_enabled=true). | main/devops/main.tf |
| 13 | terraform apply | terraform apply -target=aws_vpc_endpoint.ssm -target=aws_vpc_endpoint.ec2 succeeded. SSM endpoint vpce-0a0faae4f3fbcdaa1 and EC2 endpoint vpce-0c929530db9bf752e created in vpc-04b993322e7d3ded0. Both available in all 6 Lambda subnets. private_dns_enabled=true means boto3 resolves ssm.us-east-1.amazonaws.com to the private endpoint automatically on next Lambda invocation — no Lambda redeploy needed. | AWS / Terraform |
| 14 | Reopen 2026-04-20 (second) | Same request_error {status: undefined} recurs after endpoint creation. Prior "fix" was incomplete — endpoint exists but Lambda still cannot reach it. | user logs |
| 15 | Identify real root cause | Interface VPC endpoints require their security group to allow inbound HTTPS (port 443) from the caller. Both aws_vpc_endpoint.ssm and aws_vpc_endpoint.ec2 use aws_security_group.lambda as their SG. That SG has no ingress rules — only all-egress. Lambda's TCP SYN to the endpoint private IP is dropped at the SG level, hanging until Lambda's 30 s timeout. Since axios client timeout (10 s) fires first, the app sees "Network Error" (status: undefined). Fix: add self = true ingress rule for HTTPS (443) to Lambda SG so Lambda-to-endpoint connections are accepted. | code trace + Terraform audit |
| 16 | Apply Terraform fix | Added ingress { from_port=443, to_port=443, protocol=tcp, self=true } to aws_security_group.lambda in main/devops/main.tf. self = true means any ENI in the Lambda SG (Lambda functions) can reach any other ENI in the same SG (the endpoint ENIs) on port 443. | main/devops/main.tf |
| 17 | Add SSM boto3 timeout (defence in depth) | Added BotocoreConfig(connect_timeout=5, read_timeout=10, retries={"max_attempts": 1}) to ssm_secrets.py and the raw SSM client in orm._get_db_credentials(). Ensures SSM calls fail fast (5 s) rather than hanging 30 s if the endpoint is ever unreachable again. | ssm_secrets.py, orm.py |
| 18 | Reopen 2026-04-20 (third) | Bug persists. All prior terraform apply runs used exported encache-mgmt env vars which overrode the provider's profile = "encache-workload" — endpoints were created in the wrong AWS account. aws ec2 describe-vpc-endpoints on encache-workload shows SSM endpoint absent. SG ingress rule IS present in encache-workload (added by a prior run that correctly used the workload profile). | aws ec2 describe-security-groups + aws ec2 describe-vpc-endpoints |
| 19 | Identify root cause | All previous Terraform runs that used exported env vars (AWS_ACCESS_KEY_ID etc.) overrode the provider's profile = "encache-workload", applying resources into whichever account the env var creds belonged to (encache-mgmt). The S3 state backend lives in encache-mgmt; the actual VPC resources live in encache-workload. Fix: terraform init -reconfigure -backend-config="profile=encache-mgmt" so the backend uses mgmt creds without env vars; then terraform apply with no env vars so the provider uses its configured encache-workload profile. | Terraform credential resolution audit |
| 20 | Apply fix | terraform init -reconfigure -backend-config="profile=encache-mgmt" then terraform apply -target=aws_vpc_endpoint.ssm -target=aws_security_group.lambda -auto-approve. SSM endpoint vpce-0143587961120c7c2 created in encache-workload (vpc-04b993322e7d3ded0), all 6 Lambda subnets, SG sg-0720d25e1e820a937, private_dns_enabled=true. | main/devops/main.tf + terraform apply |
| 21 | Re-examine CloudWatch after deploy | All invocations time out at exactly 30s with NO error log after payload_parsed. auth_context_resolved and handler_invoking never appear — hang is in getAuthContext(), not SSM. | CloudWatch |
| 22 | Identify real root cause | getAuthContext() calls _fetch_jwks() which calls urllib.request.urlopen("https://cognito-idp.us-east-1.amazonaws.com/...", timeout=5). The timeout=5 is a socket-level timeout; DNS resolution is a blocking syscall that ignores it. Lambda is in a VPC with no NAT gateway and no Cognito Interface VPC endpoint — DNS for cognito-idp.us-east-1.amazonaws.com hangs the full 30s Lambda timeout. Meanwhile axios 10s client timeout fires first → "Network Error". ChatsListFunction has no VpcConfig and works because it has internet access. API Gateway already validates the JWT via DefaultAuthorizer: CognitoAuth and injects claims into requestContext.authorizer.claims — the Lambda's _verify_jwt() was redundant and fatal inside the VPC. | CloudWatch + Lambda VpcConfig audit + template.yaml authorizer |
| 23 | Apply root cause fix | Added fast path to getAuthContext() in lambda_helpers.py: reads event.requestContext.authorizer.claims directly (injected by API Gateway's Cognito authorizer) instead of fetching JWKS. Falls back to _verify_jwt() only for direct invocations. Built and deployed via sam build --no-use-container && sam deploy --profile encache-workload. | lambda_helpers.py |
| 24 | Verify fix | App logs: feed_query_succeeded {"cursor": null, "has_next_cursor": false, "memory_count": 0}. Memory feed loads successfully. | live app test |
Verification
- [x] Reproduced failure before fix
- [ ] Reproduction test fails before fix
- [x] Root cause identified with evidence (Lambda in VPC can't resolve Cognito DNS; API Gateway claims already present)
- [x] Fix applied at source (no workaround-only patch)
- [x] Reproduction path now passes —
feed_query_succeededobserved in app logs - [ ] Regression test added/updated (N/A — requires live AWS + VPC environment)
- [x] Verified no duplicate solved-bug log exists for same root cause