Memory Feed "Network Error" After SAM Redeploy

Metadata

Date: 2026-04-18
Status: resolved (2026-04-21)
Severity: critical
Related issue/ticket: N/A
Owner: N/A

About

Overview: - After redeploying the SAM stack (sam build --no-use-container && sam deploy), the memory feed returns feed_loading_failed {"error": "Network Error"}. - "Network Error" is an axios ERR_NETWORK — the HTTP request is dispatched but no response is received. This is distinct from a 403 (server rejects) or an auth timeout (interceptor throws before dispatch). - Related prior bug: 2026-04-18-persistent-403-after-login-android.md (status: root-cause-identified). That bug's prescribed fix was SAM redeploy. After the redeploy, the symptom changed from 403 → Network Error.

Technical Questions: - Is api.encache.ai still reachable after the SAM redeploy? The custom domain is manually managed (outside CloudFormation — see template.yaml:597). Does it still map to the correct API e7fi6pcrka and stage Prod? - Did the SAM redeploy recreate or modify the API Gateway in a way that broke the manually-managed base path mapping? - Is the device/emulator on a network that can reach api.encache.ai? (Previously curl from dev machine returned 403, which is expected for unauthenticated requests.) - Is the request interceptor timing out via withAuthTimeout and the error message being swallowed somewhere?

Resources: - main/app/lib/api/getApi.ts — withAuthTimeout, request/response interceptors - main/app/lib/api/memory/useMemoryApi.ts — useMemoriesFeed queryFn, feed_loading_failed log - main/app/app/index.tsx:107 — feed_loading_failed useEffect - main/server/template.yaml:597 — note that custom domain is managed outside CloudFormation - AWS Console: API Gateway → Custom Domain Names → api.encache.ai → Mappings

Steps to cause failure

flowchart LR
  AppStart --> FeedQuery["useMemoriesFeed fires"]
  FeedQuery --> AxiosPost["axios.post /memories/feed"]
  AxiosPost --> NetworkFail["No response received\n(ERR_NETWORK)"]
  NetworkFail --> ResponseInterceptor["Response interceptor:\nstatus=undefined → request_error"]
  ResponseInterceptor --> QueryRetry["React Query retries 3x"]
  QueryRetry --> IsError["isError=true"]
  IsError --> FeedLoadingFailed["feed_loading_failed\n{error: 'Network Error'}"]

System

flowchart TD
  MobileApp -->|EXPO_PUBLIC_API_BASE_URL=https://api.encache.ai| CustomDomain["api.encache.ai\n→ d-iitdlnbcj3.execute-api.us-east-1.amazonaws.com"]
  CustomDomain -->|BasePathMapping\n(manually managed)| APIGW["API Gateway e7fi6pcrka\nStage: Prod"]
  APIGW --> Lambda["Lambda functions"]

Reproduction Details

Deploy SAM stack (sam build --no-use-container && sam deploy).
Open the app on a device.
Observe feed_loading_failed {"error": "Network Error"} in Metro/device logs.

Reproduction test: N/A — requires live AWS + device.

Diagnostic Steps

Run these to identify root cause:

# 1. Verify api.encache.ai is still reachable and returns 403 (not connection error)
curl -v https://api.encache.ai/memories/feed -X POST 2>&1 | grep -E "< HTTP|SSL|Connected|curl:"

# 2. Check what API + stage the custom domain currently maps to
aws apigateway get-base-path-mappings --domain-name api.encache.ai

# 3. Confirm the deployed API ID and stage match
# Expected: restApiId=e7fi6pcrka, stage=Prod, basePath=(none)

If get-base-path-mappings shows a different restApiId than e7fi6pcrka, the mapping needs to be updated:

aws apigateway update-base-path-mapping \
  --domain-name api.encache.ai \
  --base-path "(none)" \
  --patch-operations op=replace,path=/restapiId,value=e7fi6pcrka

Notes for PR

Root cause: MemoriesFeedFunction in template.yaml was missing VpcConfig. Every other DB-accessing Lambda had VpcConfig with LambdaSecurityGroupId and LambdaSubnetIds, but the memories feed function did not. Without VPC membership, the Lambda cannot reach the private RDS instance at 172.31.0.82:5432. Each invocation took ~15s to time out (DB connection timeout), which exceeded the axios 10s client timeout — React Native surfaced this as "Network Error".

Fix: added VpcConfig to MemoriesFeedFunction in template.yaml matching the other DB-accessing functions, then redeployed with sam deploy.

Audit Log

ID	Action	Note	Context
1	Create audit log	Initialize investigation after SAM redeploy changed symptom from 403 → Network Error	issue created
2	Trace error path	"Network Error" = axios ERR_NETWORK. Response interceptor logs `request_error {status: undefined}` — no HTTP response received. Request takes ~10s before axios timeout fires	code trace
3	Check custom domain	`aws apigateway get-base-path-mappings` confirmed mapping is correct: `restApiId=e7fi6pcrka, stage=Prod`. No fix needed	verification
4	Check CloudWatch logs	Lambda logs show `psycopg2.OperationalError: connection to server... Connection timed out` on every invocation. DB host `172.31.0.82` is a private VPC IP unreachable without VPC membership	CloudWatch
5	Identify root cause	`MemoriesFeedFunction` has no `VpcConfig` in `template.yaml`. All other DB-accessing functions (lines 298, 323, 351, 375, 511, 547) have it. This was likely omitted when the function was first added	code trace
6	Apply fix	Added `VpcConfig` block to `MemoriesFeedFunction` in `template.yaml` matching other functions	template.yaml
7	Reopen 2026-04-20	Same symptom recurs. Logs: `request_token_fetch_succeeded` then immediately `request_error {status: undefined}` for `/memories/feed`. React Query retries 3×, all fail. `/chats/list` succeeds in the same session — API Gateway is reachable. Failure is specific to `MemoriesFeedFunction`. `VpcConfig` IS present in `template.yaml` (confirmed at line 471). Root cause of prior fix no longer applies directly.	user logs + code trace
8	Pull CloudWatch logs	Every invocation: logs `payload_parsed` then hangs for exactly 30,000ms → `Status: timeout`. The hang begins after `payload_parsed`, meaning the first call in `implementation()` — `_configure_database()` — never returns.	CloudWatch /aws/lambda/server-MemoriesFeedFunction-CF18SJCxK9Cv
9	Confirm Lambda VpcConfig deployed	`aws lambda get-function-configuration` confirms VpcConfig is present on deployed function (6 subnets, sg-0ff10095a1eeebdae, vpc-04b993322e7d3ded0). DB_HOST is set. Not the same root cause as the 2026-04-18 fix.	AWS Lambda console
10	List VPC endpoints	`aws ec2 describe-vpc-endpoints` on vpc-04b993322e7d3ded0 returns ONLY 2 endpoints: `s3` (Gateway) and `dynamodb` (Gateway). No SSM Interface endpoint. No NAT gateway in the VPC. `_configure_database()` calls `get_db_user()` + `get_db_password()` via SSM — those calls silently hang with no route out of the VPC. `ec2` endpoint is defined in `main.tf` but was never applied (not in AWS).	`aws ec2 describe-vpc-endpoints`
11	Identify root cause	`_configure_database()` calls `shared.ssm_secrets.get_db_user()` and `get_db_password()` which call `ssm.get_parameter()`. Lambda is in a VPC with no SSM Interface endpoint and no NAT gateway — SSM calls drop silently at the subnet routing level, hanging until the 30s Lambda timeout. Fix: add `aws_vpc_endpoint.ssm` (Interface type) to `main.tf`, then `terraform apply`.	code trace + VPC endpoint audit
12	Apply fix	Added `aws_vpc_endpoint.ssm` block to `main/devops/main.tf` above the existing `ec2` endpoint, using identical Interface endpoint pattern (same vpc_id, subnet_ids, security_group_ids, private_dns_enabled=true).	main/devops/main.tf
13	terraform apply	`terraform apply -target=aws_vpc_endpoint.ssm -target=aws_vpc_endpoint.ec2` succeeded. SSM endpoint `vpce-0a0faae4f3fbcdaa1` and EC2 endpoint `vpce-0c929530db9bf752e` created in vpc-04b993322e7d3ded0. Both available in all 6 Lambda subnets. `private_dns_enabled=true` means boto3 resolves `ssm.us-east-1.amazonaws.com` to the private endpoint automatically on next Lambda invocation — no Lambda redeploy needed.	AWS / Terraform
14	Reopen 2026-04-20 (second)	Same `request_error {status: undefined}` recurs after endpoint creation. Prior "fix" was incomplete — endpoint exists but Lambda still cannot reach it.	user logs
15	Identify real root cause	Interface VPC endpoints require their security group to allow inbound HTTPS (port 443) from the caller. Both `aws_vpc_endpoint.ssm` and `aws_vpc_endpoint.ec2` use `aws_security_group.lambda` as their SG. That SG has no ingress rules — only all-egress. Lambda's TCP SYN to the endpoint private IP is dropped at the SG level, hanging until Lambda's 30 s timeout. Since axios client timeout (10 s) fires first, the app sees "Network Error" (`status: undefined`). Fix: add `self = true` ingress rule for HTTPS (443) to Lambda SG so Lambda-to-endpoint connections are accepted.	code trace + Terraform audit
16	Apply Terraform fix	Added `ingress { from_port=443, to_port=443, protocol=tcp, self=true }` to `aws_security_group.lambda` in `main/devops/main.tf`. `self = true` means any ENI in the Lambda SG (Lambda functions) can reach any other ENI in the same SG (the endpoint ENIs) on port 443.	main/devops/main.tf
17	Add SSM boto3 timeout (defence in depth)	Added `BotocoreConfig(connect_timeout=5, read_timeout=10, retries={"max_attempts": 1})` to `ssm_secrets.py` and the raw SSM client in `orm._get_db_credentials()`. Ensures SSM calls fail fast (5 s) rather than hanging 30 s if the endpoint is ever unreachable again.	ssm_secrets.py, orm.py
18	Reopen 2026-04-20 (third)	Bug persists. All prior `terraform apply` runs used exported `encache-mgmt` env vars which overrode the provider's `profile = "encache-workload"` — endpoints were created in the wrong AWS account. `aws ec2 describe-vpc-endpoints` on `encache-workload` shows SSM endpoint absent. SG ingress rule IS present in `encache-workload` (added by a prior run that correctly used the workload profile).	`aws ec2 describe-security-groups` + `aws ec2 describe-vpc-endpoints`
19	Identify root cause	All previous Terraform runs that used exported env vars (AWS_ACCESS_KEY_ID etc.) overrode the provider's `profile = "encache-workload"`, applying resources into whichever account the env var creds belonged to (`encache-mgmt`). The S3 state backend lives in `encache-mgmt`; the actual VPC resources live in `encache-workload`. Fix: `terraform init -reconfigure -backend-config="profile=encache-mgmt"` so the backend uses mgmt creds without env vars; then `terraform apply` with no env vars so the provider uses its configured `encache-workload` profile.	Terraform credential resolution audit
20	Apply fix	`terraform init -reconfigure -backend-config="profile=encache-mgmt"` then `terraform apply -target=aws_vpc_endpoint.ssm -target=aws_security_group.lambda -auto-approve`. SSM endpoint `vpce-0143587961120c7c2` created in `encache-workload` (vpc-04b993322e7d3ded0), all 6 Lambda subnets, SG `sg-0720d25e1e820a937`, `private_dns_enabled=true`.	main/devops/main.tf + terraform apply
21	Re-examine CloudWatch after deploy	All invocations time out at exactly 30s with NO error log after `payload_parsed`. `auth_context_resolved` and `handler_invoking` never appear — hang is in `getAuthContext()`, not SSM.	CloudWatch
22	Identify real root cause	`getAuthContext()` calls `_fetch_jwks()` which calls `urllib.request.urlopen("https://cognito-idp.us-east-1.amazonaws.com/...", timeout=5)`. The `timeout=5` is a socket-level timeout; DNS resolution is a blocking syscall that ignores it. Lambda is in a VPC with no NAT gateway and no Cognito Interface VPC endpoint — DNS for `cognito-idp.us-east-1.amazonaws.com` hangs the full 30s Lambda timeout. Meanwhile axios 10s client timeout fires first → "Network Error". `ChatsListFunction` has no VpcConfig and works because it has internet access. API Gateway already validates the JWT via `DefaultAuthorizer: CognitoAuth` and injects claims into `requestContext.authorizer.claims` — the Lambda's `_verify_jwt()` was redundant and fatal inside the VPC.	CloudWatch + Lambda VpcConfig audit + template.yaml authorizer
23	Apply root cause fix	Added fast path to `getAuthContext()` in `lambda_helpers.py`: reads `event.requestContext.authorizer.claims` directly (injected by API Gateway's Cognito authorizer) instead of fetching JWKS. Falls back to `_verify_jwt()` only for direct invocations. Built and deployed via `sam build --no-use-container && sam deploy --profile encache-workload`.	lambda_helpers.py
24	Verify fix	App logs: `feed_query_succeeded {"cursor": null, "has_next_cursor": false, "memory_count": 0}`. Memory feed loads successfully.	live app test

Verification

[x] Reproduced failure before fix
[ ] Reproduction test fails before fix
[x] Root cause identified with evidence (Lambda in VPC can't resolve Cognito DNS; API Gateway claims already present)
[x] Fix applied at source (no workaround-only patch)
[x] Reproduction path now passes — feed_query_succeeded observed in app logs
[ ] Regression test added/updated (N/A — requires live AWS + VPC environment)
[x] Verified no duplicate solved-bug log exists for same root cause