Skip to content

Memory Feed "Network Error" After SAM Redeploy

Metadata

  • Date: 2026-04-18
  • Status: resolved (2026-04-21)
  • Severity: critical
  • Related issue/ticket: N/A
  • Owner: N/A

About

Overview: - After redeploying the SAM stack (sam build --no-use-container && sam deploy), the memory feed returns feed_loading_failed {"error": "Network Error"}. - "Network Error" is an axios ERR_NETWORK — the HTTP request is dispatched but no response is received. This is distinct from a 403 (server rejects) or an auth timeout (interceptor throws before dispatch). - Related prior bug: 2026-04-18-persistent-403-after-login-android.md (status: root-cause-identified). That bug's prescribed fix was SAM redeploy. After the redeploy, the symptom changed from 403 → Network Error.

Technical Questions: - Is api.encache.ai still reachable after the SAM redeploy? The custom domain is manually managed (outside CloudFormation — see template.yaml:597). Does it still map to the correct API e7fi6pcrka and stage Prod? - Did the SAM redeploy recreate or modify the API Gateway in a way that broke the manually-managed base path mapping? - Is the device/emulator on a network that can reach api.encache.ai? (Previously curl from dev machine returned 403, which is expected for unauthenticated requests.) - Is the request interceptor timing out via withAuthTimeout and the error message being swallowed somewhere?

Resources: - main/app/lib/api/getApi.tswithAuthTimeout, request/response interceptors - main/app/lib/api/memory/useMemoryApi.tsuseMemoriesFeed queryFn, feed_loading_failed log - main/app/app/index.tsx:107feed_loading_failed useEffect - main/server/template.yaml:597 — note that custom domain is managed outside CloudFormation - AWS Console: API Gateway → Custom Domain Names → api.encache.ai → Mappings

Steps to cause failure

flowchart LR
  AppStart --> FeedQuery["useMemoriesFeed fires"]
  FeedQuery --> AxiosPost["axios.post /memories/feed"]
  AxiosPost --> NetworkFail["No response received\n(ERR_NETWORK)"]
  NetworkFail --> ResponseInterceptor["Response interceptor:\nstatus=undefined → request_error"]
  ResponseInterceptor --> QueryRetry["React Query retries 3x"]
  QueryRetry --> IsError["isError=true"]
  IsError --> FeedLoadingFailed["feed_loading_failed\n{error: 'Network Error'}"]

System

flowchart TD
  MobileApp -->|EXPO_PUBLIC_API_BASE_URL=https://api.encache.ai| CustomDomain["api.encache.ai\n→ d-iitdlnbcj3.execute-api.us-east-1.amazonaws.com"]
  CustomDomain -->|BasePathMapping\n(manually managed)| APIGW["API Gateway e7fi6pcrka\nStage: Prod"]
  APIGW --> Lambda["Lambda functions"]

Reproduction Details

  1. Deploy SAM stack (sam build --no-use-container && sam deploy).
  2. Open the app on a device.
  3. Observe feed_loading_failed {"error": "Network Error"} in Metro/device logs.

Reproduction test: N/A — requires live AWS + device.

Diagnostic Steps

Run these to identify root cause:

# 1. Verify api.encache.ai is still reachable and returns 403 (not connection error)
curl -v https://api.encache.ai/memories/feed -X POST 2>&1 | grep -E "< HTTP|SSL|Connected|curl:"

# 2. Check what API + stage the custom domain currently maps to
aws apigateway get-base-path-mappings --domain-name api.encache.ai

# 3. Confirm the deployed API ID and stage match
# Expected: restApiId=e7fi6pcrka, stage=Prod, basePath=(none)

If get-base-path-mappings shows a different restApiId than e7fi6pcrka, the mapping needs to be updated:

aws apigateway update-base-path-mapping \
  --domain-name api.encache.ai \
  --base-path "(none)" \
  --patch-operations op=replace,path=/restapiId,value=e7fi6pcrka

Notes for PR

Root cause: MemoriesFeedFunction in template.yaml was missing VpcConfig. Every other DB-accessing Lambda had VpcConfig with LambdaSecurityGroupId and LambdaSubnetIds, but the memories feed function did not. Without VPC membership, the Lambda cannot reach the private RDS instance at 172.31.0.82:5432. Each invocation took ~15s to time out (DB connection timeout), which exceeded the axios 10s client timeout — React Native surfaced this as "Network Error".

Fix: added VpcConfig to MemoriesFeedFunction in template.yaml matching the other DB-accessing functions, then redeployed with sam deploy.

Audit Log

ID Action Note Context
1 Create audit log Initialize investigation after SAM redeploy changed symptom from 403 → Network Error issue created
2 Trace error path "Network Error" = axios ERR_NETWORK. Response interceptor logs request_error {status: undefined} — no HTTP response received. Request takes ~10s before axios timeout fires code trace
3 Check custom domain aws apigateway get-base-path-mappings confirmed mapping is correct: restApiId=e7fi6pcrka, stage=Prod. No fix needed verification
4 Check CloudWatch logs Lambda logs show psycopg2.OperationalError: connection to server... Connection timed out on every invocation. DB host 172.31.0.82 is a private VPC IP unreachable without VPC membership CloudWatch
5 Identify root cause MemoriesFeedFunction has no VpcConfig in template.yaml. All other DB-accessing functions (lines 298, 323, 351, 375, 511, 547) have it. This was likely omitted when the function was first added code trace
6 Apply fix Added VpcConfig block to MemoriesFeedFunction in template.yaml matching other functions template.yaml
7 Reopen 2026-04-20 Same symptom recurs. Logs: request_token_fetch_succeeded then immediately request_error {status: undefined} for /memories/feed. React Query retries 3×, all fail. /chats/list succeeds in the same session — API Gateway is reachable. Failure is specific to MemoriesFeedFunction. VpcConfig IS present in template.yaml (confirmed at line 471). Root cause of prior fix no longer applies directly. user logs + code trace
8 Pull CloudWatch logs Every invocation: logs payload_parsed then hangs for exactly 30,000ms → Status: timeout. The hang begins after payload_parsed, meaning the first call in implementation()_configure_database() — never returns. CloudWatch /aws/lambda/server-MemoriesFeedFunction-CF18SJCxK9Cv
9 Confirm Lambda VpcConfig deployed aws lambda get-function-configuration confirms VpcConfig is present on deployed function (6 subnets, sg-0ff10095a1eeebdae, vpc-04b993322e7d3ded0). DB_HOST is set. Not the same root cause as the 2026-04-18 fix. AWS Lambda console
10 List VPC endpoints aws ec2 describe-vpc-endpoints on vpc-04b993322e7d3ded0 returns ONLY 2 endpoints: s3 (Gateway) and dynamodb (Gateway). No SSM Interface endpoint. No NAT gateway in the VPC. _configure_database() calls get_db_user() + get_db_password() via SSM — those calls silently hang with no route out of the VPC. ec2 endpoint is defined in main.tf but was never applied (not in AWS). aws ec2 describe-vpc-endpoints
11 Identify root cause _configure_database() calls shared.ssm_secrets.get_db_user() and get_db_password() which call ssm.get_parameter(). Lambda is in a VPC with no SSM Interface endpoint and no NAT gateway — SSM calls drop silently at the subnet routing level, hanging until the 30s Lambda timeout. Fix: add aws_vpc_endpoint.ssm (Interface type) to main.tf, then terraform apply. code trace + VPC endpoint audit
12 Apply fix Added aws_vpc_endpoint.ssm block to main/devops/main.tf above the existing ec2 endpoint, using identical Interface endpoint pattern (same vpc_id, subnet_ids, security_group_ids, private_dns_enabled=true). main/devops/main.tf
13 terraform apply terraform apply -target=aws_vpc_endpoint.ssm -target=aws_vpc_endpoint.ec2 succeeded. SSM endpoint vpce-0a0faae4f3fbcdaa1 and EC2 endpoint vpce-0c929530db9bf752e created in vpc-04b993322e7d3ded0. Both available in all 6 Lambda subnets. private_dns_enabled=true means boto3 resolves ssm.us-east-1.amazonaws.com to the private endpoint automatically on next Lambda invocation — no Lambda redeploy needed. AWS / Terraform
14 Reopen 2026-04-20 (second) Same request_error {status: undefined} recurs after endpoint creation. Prior "fix" was incomplete — endpoint exists but Lambda still cannot reach it. user logs
15 Identify real root cause Interface VPC endpoints require their security group to allow inbound HTTPS (port 443) from the caller. Both aws_vpc_endpoint.ssm and aws_vpc_endpoint.ec2 use aws_security_group.lambda as their SG. That SG has no ingress rules — only all-egress. Lambda's TCP SYN to the endpoint private IP is dropped at the SG level, hanging until Lambda's 30 s timeout. Since axios client timeout (10 s) fires first, the app sees "Network Error" (status: undefined). Fix: add self = true ingress rule for HTTPS (443) to Lambda SG so Lambda-to-endpoint connections are accepted. code trace + Terraform audit
16 Apply Terraform fix Added ingress { from_port=443, to_port=443, protocol=tcp, self=true } to aws_security_group.lambda in main/devops/main.tf. self = true means any ENI in the Lambda SG (Lambda functions) can reach any other ENI in the same SG (the endpoint ENIs) on port 443. main/devops/main.tf
17 Add SSM boto3 timeout (defence in depth) Added BotocoreConfig(connect_timeout=5, read_timeout=10, retries={"max_attempts": 1}) to ssm_secrets.py and the raw SSM client in orm._get_db_credentials(). Ensures SSM calls fail fast (5 s) rather than hanging 30 s if the endpoint is ever unreachable again. ssm_secrets.py, orm.py
18 Reopen 2026-04-20 (third) Bug persists. All prior terraform apply runs used exported encache-mgmt env vars which overrode the provider's profile = "encache-workload" — endpoints were created in the wrong AWS account. aws ec2 describe-vpc-endpoints on encache-workload shows SSM endpoint absent. SG ingress rule IS present in encache-workload (added by a prior run that correctly used the workload profile). aws ec2 describe-security-groups + aws ec2 describe-vpc-endpoints
19 Identify root cause All previous Terraform runs that used exported env vars (AWS_ACCESS_KEY_ID etc.) overrode the provider's profile = "encache-workload", applying resources into whichever account the env var creds belonged to (encache-mgmt). The S3 state backend lives in encache-mgmt; the actual VPC resources live in encache-workload. Fix: terraform init -reconfigure -backend-config="profile=encache-mgmt" so the backend uses mgmt creds without env vars; then terraform apply with no env vars so the provider uses its configured encache-workload profile. Terraform credential resolution audit
20 Apply fix terraform init -reconfigure -backend-config="profile=encache-mgmt" then terraform apply -target=aws_vpc_endpoint.ssm -target=aws_security_group.lambda -auto-approve. SSM endpoint vpce-0143587961120c7c2 created in encache-workload (vpc-04b993322e7d3ded0), all 6 Lambda subnets, SG sg-0720d25e1e820a937, private_dns_enabled=true. main/devops/main.tf + terraform apply
21 Re-examine CloudWatch after deploy All invocations time out at exactly 30s with NO error log after payload_parsed. auth_context_resolved and handler_invoking never appear — hang is in getAuthContext(), not SSM. CloudWatch
22 Identify real root cause getAuthContext() calls _fetch_jwks() which calls urllib.request.urlopen("https://cognito-idp.us-east-1.amazonaws.com/...", timeout=5). The timeout=5 is a socket-level timeout; DNS resolution is a blocking syscall that ignores it. Lambda is in a VPC with no NAT gateway and no Cognito Interface VPC endpoint — DNS for cognito-idp.us-east-1.amazonaws.com hangs the full 30s Lambda timeout. Meanwhile axios 10s client timeout fires first → "Network Error". ChatsListFunction has no VpcConfig and works because it has internet access. API Gateway already validates the JWT via DefaultAuthorizer: CognitoAuth and injects claims into requestContext.authorizer.claims — the Lambda's _verify_jwt() was redundant and fatal inside the VPC. CloudWatch + Lambda VpcConfig audit + template.yaml authorizer
23 Apply root cause fix Added fast path to getAuthContext() in lambda_helpers.py: reads event.requestContext.authorizer.claims directly (injected by API Gateway's Cognito authorizer) instead of fetching JWKS. Falls back to _verify_jwt() only for direct invocations. Built and deployed via sam build --no-use-container && sam deploy --profile encache-workload. lambda_helpers.py
24 Verify fix App logs: feed_query_succeeded {"cursor": null, "has_next_cursor": false, "memory_count": 0}. Memory feed loads successfully. live app test

Verification

  • [x] Reproduced failure before fix
  • [ ] Reproduction test fails before fix
  • [x] Root cause identified with evidence (Lambda in VPC can't resolve Cognito DNS; API Gateway claims already present)
  • [x] Fix applied at source (no workaround-only patch)
  • [x] Reproduction path now passes — feed_query_succeeded observed in app logs
  • [ ] Regression test added/updated (N/A — requires live AWS + VPC environment)
  • [x] Verified no duplicate solved-bug log exists for same root cause