fck-nat Migration

Plan Metadata

Plan type: plan
Parent plan: N/A
Depends on: N/A
Status: documentation

Status semantics: - draft: Plan is being created or updated and is not final. - approved: Plan is approved but not yet applied in code. - documentation: Code currently exists and matches the plan contract.

Update rule: - When an existing plan is edited, set status to draft until re-approved.

System Intent

What is being built: Replace the managed AWS NAT Gateway (nat-0e1627564e57216bf, ~$32/mo) with a self-hosted NAT instance via the community RaJiska/fck-nat/aws Terraform module on a t4g.nano (~$3/mo). Concurrently delete the SSM and EC2 Interface VPC endpoints (~$14.60/mo combined), since per-call NAT egress for those AWS APIs is cheaper than the flat endpoint cost at current usage. Lambda code, SAM template, RDS, and S3/DynamoDB Gateway endpoints are unchanged. Expected savings: ~$42/mo.
Primary consumer(s): All VPC Lambdas defined in main/server/template.yaml that have VpcConfig (chat handler, ingest, memories feed, auth handoff, etc.) — they require internet egress for OpenAI Whisper, Anthropic, Groq, GPU worker, plus AWS control-plane API calls (SSM GetParameter for the DB password; EC2 DescribeInstances/StartInstances for GPU worker management — see main/server/api/memories/chat/app.py:44,55 and main/server/worldmm/pipeline/ingest_window.py:29). S3/DynamoDB traffic stays on the free Gateway endpoints, never the NAT path.
Boundary (black-box scope only): VPC routing layer in main/devops/main.tf. Egress hop changes from managed NAT Gateway to fck-nat instance. Source IP changes (new EIP). No application code changes.

Stage Gate Tracker

[x] Stage 1 Mermaid approved
[x] Stage 2 I/O contracts approved
[x] Stage 3 pseudocode/technical details approved or skipped

1. Mermaid Diagram

Reference: .agent/skills/create-mermaid-diagram/SKILL.md

flowchart TD
  Lambda["VPC Lambda | main/server/template.yaml"]:::unchanged
  RouteTable["lambda_private route table | main/devops/main.tf"]:::updated
  FckNat["fck-nat instance (t4g.nano, ASG=1) | main/devops/main.tf"]:::new
  EIP["EIP (aws_eip.fck_nat, attached by module userdata) | main/devops/main.tf"]:::new
  IGW["Internet Gateway"]:::unchanged
  Internet["External APIs (OpenAI, Anthropic, Groq, GPU, AWS public endpoints)"]:::unchanged
  S3DDB["S3 / DynamoDB (via Gateway endpoints)"]:::unchanged
  RDS["RDS encache-db (in-VPC)"]:::unchanged
  Alarm["CloudWatch alarm AWS/AutoScaling GroupInServiceInstances < 1 | main/devops/main.tf"]:::new
  SNS["SNS encache-cost-alerts"]:::unchanged
  Discord["encache-discord-alerts Lambda → Discord webhook"]:::unchanged

  Lambda -->|"S3/DynamoDB API call"| S3DDB
  Lambda -->|"PostgreSQL connection"| RDS
  Lambda -->|"0.0.0.0/0 egress"| RouteTable
  RouteTable -->|"NetworkInterfaceId = fck-nat ENI"| FckNat
  FckNat -->|"SNAT to EIP"| EIP
  EIP -->|"public IP egress"| IGW
  IGW -->|"HTTPS"| Internet

  FckNat -.->|"GroupInServiceInstances metric"| Alarm
  Alarm -->|"alarm action"| SNS
  SNS -->|"subscription"| Discord

classDef unchanged fill:#d3d3d3,stroke:#666,stroke-width:1px;
classDef updated fill:#ffe58a,stroke:#666,stroke-width:1px;
classDef new fill:#a4d4a8,stroke:#666,stroke-width:1px;

2. Black-Box Inputs and Outputs

This is an infrastructure plan; the "contracts" are network-level invariants that VPC Lambdas depend on, not REST/RPC payloads. Each flow asserts a routing or recovery property that must hold after migration.

Global Types

LambdaInvocation {
  function_name: string (any function with VpcConfig in template.yaml)
  source_subnet: string (private subnet 1b or 1c)
}

EgressTarget {
  destination: string (FQDN or AWS service name)
  expected_path: string ("via fck-nat" | "via gateway endpoint" | "in-VPC")
}

CutoverEvent {
  triggered_by: string ("terraform apply" | "instance termination")
  expected_outage_window: duration
}

InfraAlert {
  source: string ("CloudWatch alarm name")
  destination: string ("Discord channel")
}

Flow: `lambdaExternalEgress`

Test files: N/A (validated by post-apply smoke probe — see Stage 3 verification table, probe #4)
Core files: main/devops/main.tf (route table + module), main/server/template.yaml (Lambda VpcConfig)

Type Definitions

ExternalEgressInput {
  invocation: LambdaInvocation
  target: EgressTarget destination=external (e.g. api.openai.com, api.anthropic.com)
}

ExternalEgressOutput {
  delivered: boolean (HTTP request reaches destination)
  source_ip: string (visible source IP at destination = fck-nat EIP)
  latency_overhead_ms: number (target ≤ 5ms vs managed NAT)
}

Paths

path-name	input	output/expected state change	path-type	notes	updated
`lambdaExternalEgress.success`	`ExternalEgressInput` for OpenAI/Anthropic/Groq/GPU worker	`ExternalEgressOutput delivered=true; source_ip=new EIP`	`happy path`	Replaces failure mode from pre-#339 (`APIConnectionError` after 36s)	`Y`
`lambdaExternalEgress.during-cutover`	`ExternalEgressInput` issued during ~5–10min apply window	`delivered=false; client retries via PersistentUploadQueue (PR #351) and chatWithMemory retry; eventual delivered=true after route lands`	`error → recovered`	Accepted outage; mitigated by overnight scheduling	`Y`

Flow: `lambdaAwsApiEgress`

Test files: N/A (post-apply smoke probe #5)
Core files: main/devops/main.tf

Type Definitions

AwsApiEgressInput {
  invocation: LambdaInvocation
  target: EgressTarget destination=ssm.us-east-1.amazonaws.com OR ec2.us-east-1.amazonaws.com
}

AwsApiEgressOutput {
  delivered: boolean
  resolved_via: string ("public endpoint via fck-nat" — interface endpoint deleted)
  per_call_cost_usd: number (target ≤ $0.0002)
}

Paths

path-name	input	output/expected state change	path-type	notes	updated
`lambdaAwsApiEgress.success`	`AwsApiEgressInput` for SSM GetParameter `/encache/db/password`	`delivered=true; resolved_via=public via fck-nat`	`happy path`	Replaces private-DNS interface endpoint path; latency +20–50ms cold, ~5ms warm	`Y`
`lambdaAwsApiEgress.endpoint-deletion-regression`	Synthetic SSM call right after endpoint delete	`delivered=true; NOT EndpointConnectionError`	`error guard`	Confirms public DNS resolves correctly post-deletion	`Y`

Flow: `lambdaGatewayEndpointEgress`

Test files: N/A (post-apply smoke probe #6 — route-table prefix-list assertion; VPC Flow Logs are not provisioned by this stack so direct traffic capture is out of scope)
Core files: main/devops/main.tf (aws_vpc_endpoint.s3, aws_vpc_endpoint.dynamodb, aws_route_table.lambda_private)

Type Definitions

GatewayEgressInput {
  invocation: LambdaInvocation
  target: EgressTarget destination=s3.us-east-1.amazonaws.com OR dynamodb.us-east-1.amazonaws.com
}

GatewayEgressOutput {
  delivered: boolean
  resolved_via: string ("VPC Gateway endpoint — never NAT")
}

Paths

path-name	input	output/expected state change	path-type	notes	updated
`lambdaGatewayEndpointEgress.success`	`GatewayEgressInput` to S3 or DynamoDB	`delivered=true; resolved_via=VPC Gateway endpoint`	`happy path`	Confirms route table change did NOT break Gateway endpoint association (S3/DDB endpoint route still present in `lambda_private`)	`Y`

Flow: `fckNatAutoHeal`

Test files: N/A (failure-injection probes #8–10)
Core files: main/devops/main.tf (module + alarm)

Type Definitions

AutoHealInput {
  trigger: CutoverEvent triggered_by="instance termination" (manual: aws ec2 terminate-instances)
}

AutoHealOutput {
  replacement_instance_id: string (different from terminated)
  replacement_within: duration (target ≤ 5min)
  eip_reattached: boolean (true)
  route_table_updated: boolean (true — module lifecycle hook rewrites route to new ENI)
  alert_delivered: boolean (true — Discord channel receives alert)
}

Paths

path-name	input	output/expected state change	path-type	notes	updated
`fckNatAutoHeal.success`	`AutoHealInput`	`AutoHealOutput all true within 5min`	`happy path`	Required by soft-prod posture (Q1=B); module's `update_route_tables=true` is the load-bearing setting	`Y`
`fckNatAutoHeal.az-outage`	trigger=AZ failure (us-east-1a down)	`replacement_within=undefined; alert_delivered=true; CODE CHANGE REQUIRED before re-apply`	`error`	Single-AZ accepted in scope. Recovery is NOT a no-op `terraform apply --var subnet_id=...` — the existing `data.aws_subnet.private_1b` / `private_1c` resources point at PRIVATE subnets, which cannot host fck-nat (no IGW route). Operator must first add a `data "aws_subnet" "public_1b"` (or `_1c`) resource to `main.tf` (default VPC has one public subnet per AZ), then update `module.fck_nat.subnet_id` to that resource and apply.	`Y`

Flow: `infraAlertDelivery`

Test files: N/A (failure-injection probe #9)
Core files: main/devops/main.tf (alarm + SNS subscription, both pre-existing)

Type Definitions

AlertInput {
  source: "encache-fck-nat-health alarm"
  trigger: "AWS/AutoScaling GroupInServiceInstances < 1 for 2 evaluation periods (60s each)"
}

AlertOutput {
  destination: "Discord channel via encache-discord-alerts Lambda"
  delivered_within: duration (target ≤ 3min from ASG losing in-service instance)
}

Note: the alarm tracks AWS/AutoScaling GroupInServiceInstances, not AWS/EC2 StatusCheckFailed. Terminating an EC2 instance makes the EC2 status-check metric missing (treated as not-breaching), so a StatusCheckFailed alarm would not fire on a terminate. The ASG metric drops to 0 the instant the instance leaves service and stays at 0 until the replacement is in service, which is the actual signal we care about.

Paths

path-name	input	output/expected state change	path-type	notes	updated
`infraAlertDelivery.success`	`AlertInput` after instance termination	`AlertOutput delivered_within ≤ 3min`	`happy path`	Reuses verified pipeline (validated end-to-end this session — webhook 204, lambda invoke 200, daily report posted)	`Y`

3. Pseudocode / Technical Details for Critical Flows (Optional)

Cutover sequence

Pre-apply:
  1. terraform validate
  2. terraform plan -out=tfplan
  3. Inspect plan: confirm exactly 4 deletes (nat_gateway.main, eip.nat,
                   vpc_endpoint.ssm, vpc_endpoint.ec2);
                   2 creates (aws_eip.fck_nat,
                   aws_cloudwatch_metric_alarm.fck_nat_health)
                   + 1 module add (fck-nat: ASG, launch template, IAM role,
                     SG, static ENI, internal aws_route resource);
                   1 modify (aws_route_table.lambda_private — route block
                   removed, lifecycle.ignore_changes added).
  4. Schedule cutover for overnight low-traffic window (data shows 0 RDS
                   connections / 0 NAT bytes overnight).

Apply:
  5. Open `aws logs tail /aws/lambda/encache-IngestWindowFunction-* --follow`
     in side terminal.
  6. terraform apply tfplan
  7. Wait ~5min for: NAT GW destroy, fck-nat ASG launch, EIP attach via
     userdata, module's aws_route create on lambda_private (0/0 → ENI).

  Stale-route gotcha: the route_table previously held an inline
  `route { 0/0 → nat_gateway_id }`. After NAT GW destroy this entry
  becomes a blackhole entry but still occupies the 0/0 slot, and the
  module's aws_route create fails with `RouteAlreadyExists`. With
  `ignore_changes = [route]` Terraform will NOT remove that entry on
  its own. If `terraform apply` reports `RouteAlreadyExists` on
  `module.fck_nat.aws_route.main[0]`, run:
    aws ec2 delete-route \
      --route-table-id $RT_ID \
      --destination-cidr-block 0.0.0.0/0 \
      --profile encache-workload --region us-east-1
  then re-run `terraform apply tfplan` — the module's aws_route create
  succeeds on retry.

Post-apply (within 5min):
  8. Run smoke probes 1–7 (instance health, EIP, route, lambda egress,
     SSM call, gateway endpoints preserved, alarm wired).
  9. If any probe fails: git revert + terraform apply (~10–15min rollback,
     NAT GW provision is long pole).

Steady-state validation (ongoing):
 10. Monitor next billing cycle daily cost report — expect VPC line drop
     ~$21 → ~$3.
 11. Run failure injection probes 8–10 within first week to prove auto-heal
     + alert.

Module wiring (Terraform sketch)

resource "aws_eip" "fck_nat" {
  domain = "vpc"
  tags   = { Name = "encache-fck-nat-eip" }
}

module "fck_nat" {
  # Pinned to v1.3.0 commit SHA for immutability (Checkov CKV_TF_1).
  source = "git::https://github.com/RaJiska/terraform-aws-fck-nat.git?ref=9377bf9247c96318b99273eb2978d1afce8cf0eb"

  name      = "encache-fck-nat"
  vpc_id    = data.aws_vpc.default.id
  subnet_id = data.aws_subnet.public_1a.id

  # ha_mode=true wraps the instance in an ASG of 1 for auto-heal. With a
  # single subnet_id this is still single-AZ — the "HA mode" label refers
  # to the ASG-vs-raw-instance toggle, not multi-AZ topology. ha_mode=false
  # would create a plain aws_instance with no auto-recovery.
  ha_mode            = true
  instance_type      = "t4g.nano"
  use_spot_instances = false

  # Stable source IP across ASG replacements — userdata calls
  # `aws ec2 associate-address` on every instance launch.
  eip_allocation_ids = [aws_eip.fck_nat.id]

  # NB: `use_default_security_group` / `additional_security_group_ids`
  # only affect the launch-template ENI on the running instance — they do
  # NOT touch the static data-path ENI (`aws_network_interface.main` in
  # the module), which Lambda traffic actually hits and which is
  # hard-wired to the module's default SG (VPC-CIDR ingress). Tightening
  # the data-path SG would require a fork or upstream PR. Out of scope.

  update_route_tables = true
  route_tables_ids = {
    lambda_private = aws_route_table.lambda_private.id
  }
}

resource "aws_cloudwatch_metric_alarm" "fck_nat_health" {
  alarm_name        = "encache-fck-nat-health"
  alarm_description = "fck-nat ASG has fewer than one in-service instance; NAT egress for VPC Lambdas is impaired."

  # GroupInServiceInstances drops to 0 the moment the ASG instance is
  # terminated/unhealthy and stays at 0 until the replacement is in service.
  # AWS/EC2 StatusCheckFailed would NOT fire for terminate-instances —
  # the metric goes missing once the instance disappears, and missing data
  # is treated as not-breaching by default. ASG health is the right signal
  # for "no NAT available," so we alarm on that instead.
  metric_name         = "GroupInServiceInstances"
  namespace           = "AWS/AutoScaling"
  statistic           = "Minimum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 1
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"

  dimensions = {
    AutoScalingGroupName = module.fck_nat.name
  }

  alarm_actions = [aws_sns_topic.cost_alerts.arn]
  ok_actions    = [aws_sns_topic.cost_alerts.arn]
}

Verification probes (run after apply)

#	Probe	Pass criteria
1	`aws ec2 describe-instances --filters Name=tag:Name,Values=encache-fck-nat-*`	state=running, status checks 2/2
2	`aws ec2 describe-addresses --query 'Addresses[?Tags[?Value==\`encache-fck-nat-eip`]]'`\| AssociationId present (proves the dedicated`aws_eip.fck_nat` is attached to the running instance via module userdata)
3	`aws ec2 describe-route-tables --route-table-ids $RT_ID`	NetworkInterfaceId = fck-nat ENI
4	Invoke `IngestWindowFunction` with synthetic payload	response 200, no `APIConnectionError`
5	Invoke chat handler (expect auth-fail not infra-fail)	error type = auth/validation, NOT `EndpointConnectionError`
6	`aws ec2 describe-route-tables --route-table-ids $RT_ID --query 'RouteTables[].Routes[?DestinationPrefixListId][DestinationPrefixListId,GatewayId]'`	Two prefix-list routes returned (S3 + DynamoDB), each with `GatewayId=vpce-*` matching `aws_vpc_endpoint.s3.id` / `aws_vpc_endpoint.dynamodb.id`. Confirms gateway endpoints survived the route-table edit. (VPC Flow Logs are not provisioned by this stack; out of scope to add here.)
7	`aws cloudwatch describe-alarms --alarm-names encache-fck-nat-health`	`MetricName=GroupInServiceInstances`, `Namespace=AWS/AutoScaling`, `ComparisonOperator=LessThanThreshold`, `Threshold=1`, AlarmActions contains `encache-cost-alerts` SNS ARN
8	`aws ec2 terminate-instances --instance-ids $ID`	Within 5min: new instance launched by ASG, EIP re-attached, route table updated to new ENI, Lambda egress works again
9	After probe 8, while ASG count is 0	Alarm transitions OK → ALARM within ~2min; Discord channel receives notification via `encache-cost-alerts` SNS → `encache-discord-alerts` Lambda. After replacement is InService, alarm transitions ALARM → OK and Discord receives a recovery notification.
10	After probe 8	EIP re-associated to new instance

Implementation notes

Module pin: git source at v1.3.0 commit SHA 9377bf9247c96318b99273eb2978d1afce8cf0eb (immutable; satisfies Checkov CKV_TF_1, prevents accidental upgrades on terraform init -upgrade). v1.4.0 requires AWS provider >= 6.0.0 and this stack pins ~> 5.0, so bumping the module requires a coordinated provider upgrade. Update both source and the SHA when upgrading.
use_spot_instances = false — soft-prod rejects surprise interruption.
lifecycle { ignore_changes = [route] } on aws_route_table.lambda_private is required so the module can mutate the 0/0 route on instance replacement without TF reverting.
Source IP changes to new EIP — confirmed no 3rd party (OpenAI, Anthropic, Groq, GPU worker) allowlists current EIP.
Single-AZ accepted; multi-AZ HA mode (~$6/mo extra) defeats most savings and matches existing PR #339 single-AZ posture. Recovery from a us-east-1a outage requires a code change first (add data "aws_subnet" "public_1b" or _1c and update module.fck_nat.subnet_id to it) — there is no public-subnet data source for those AZs in the current main.tf. This is the operator runbook for the fckNatAutoHeal.az-outage flow.
Health alarm uses AWS/AutoScaling GroupInServiceInstances < 1, not AWS/EC2 StatusCheckFailed. Terminating an EC2 instance makes the EC2 metric missing (treated as not-breaching) — the alarm would not fire. ASG GroupInServiceInstances drops to 0 at terminate and is the correct signal for "no NAT available."
SG scoping limitation: the fck-nat module v1.3.x creates a static data-path ENI (aws_network_interface.main inside the module) that is hard-coded to use the module's default SG (ingress = full VPC CIDR). The module's use_default_security_group and additional_security_group_ids inputs only affect the launch-template ENI, which faces the public internet (no inbound initiation). Tightening the data-path SG to "Lambda SG only" would require forking the module or upstreaming a PR. Accepted for this migration: VPC-CIDR ingress on the data path is safe in this stack because only the Lambda private subnets route 0/0 to fck-nat. Flagged for future hardening.
Rollback: git revert <commit> + terraform apply recreates managed NAT + endpoints in ~10–15min.

After all stages are approved, apply .agent/skills/reconcile-plans/SKILL.md to propagate contract updates across linked plans. No existing plan in docs/plans/ references the NAT/endpoint resources directly, so reconciliation is expected to be a no-op other than confirming this plan is linkable.

fck-nat Migration

Plan Metadata

System Intent

Stage Gate Tracker

1. Mermaid Diagram

2. Black-Box Inputs and Outputs

Global Types

Flow: lambdaExternalEgress

Type Definitions

Paths

Flow: lambdaAwsApiEgress

Type Definitions

Paths

Flow: lambdaGatewayEndpointEgress

Type Definitions

Paths

Flow: fckNatAutoHeal

Type Definitions

Paths

Flow: infraAlertDelivery

Type Definitions

Paths

3. Pseudocode / Technical Details for Critical Flows (Optional)

Cutover sequence

Module wiring (Terraform sketch)

Verification probes (run after apply)

Implementation notes

4. Handoff to Related Plan Reconciliation

Flow: `lambdaExternalEgress`

Flow: `lambdaAwsApiEgress`

Flow: `lambdaGatewayEndpointEgress`

Flow: `fckNatAutoHeal`

Flow: `infraAlertDelivery`