Skip to content

fck-nat Migration

Plan Metadata

  • Plan type: plan
  • Parent plan: N/A
  • Depends on: N/A
  • Status: documentation

Status semantics: - draft: Plan is being created or updated and is not final. - approved: Plan is approved but not yet applied in code. - documentation: Code currently exists and matches the plan contract.

Update rule: - When an existing plan is edited, set status to draft until re-approved.

System Intent

  • What is being built: Replace the managed AWS NAT Gateway (nat-0e1627564e57216bf, ~$32/mo) with a self-hosted NAT instance via the community RaJiska/fck-nat/aws Terraform module on a t4g.nano (~$3/mo). Concurrently delete the SSM and EC2 Interface VPC endpoints (~$14.60/mo combined), since per-call NAT egress for those AWS APIs is cheaper than the flat endpoint cost at current usage. Lambda code, SAM template, RDS, and S3/DynamoDB Gateway endpoints are unchanged. Expected savings: ~$42/mo.
  • Primary consumer(s): All VPC Lambdas defined in main/server/template.yaml that have VpcConfig (chat handler, ingest, memories feed, auth handoff, etc.) — they require internet egress for OpenAI Whisper, Anthropic, Groq, GPU worker, plus AWS control-plane API calls (SSM GetParameter for the DB password; EC2 DescribeInstances/StartInstances for GPU worker management — see main/server/api/memories/chat/app.py:44,55 and main/server/worldmm/pipeline/ingest_window.py:29). S3/DynamoDB traffic stays on the free Gateway endpoints, never the NAT path.
  • Boundary (black-box scope only): VPC routing layer in main/devops/main.tf. Egress hop changes from managed NAT Gateway to fck-nat instance. Source IP changes (new EIP). No application code changes.

Stage Gate Tracker

  • [x] Stage 1 Mermaid approved
  • [x] Stage 2 I/O contracts approved
  • [x] Stage 3 pseudocode/technical details approved or skipped

1. Mermaid Diagram

Reference: .agent/skills/create-mermaid-diagram/SKILL.md

flowchart TD
  Lambda["VPC Lambda | main/server/template.yaml"]:::unchanged
  RouteTable["lambda_private route table | main/devops/main.tf"]:::updated
  FckNat["fck-nat instance (t4g.nano, ASG=1) | main/devops/main.tf"]:::new
  EIP["EIP (aws_eip.fck_nat, attached by module userdata) | main/devops/main.tf"]:::new
  IGW["Internet Gateway"]:::unchanged
  Internet["External APIs (OpenAI, Anthropic, Groq, GPU, AWS public endpoints)"]:::unchanged
  S3DDB["S3 / DynamoDB (via Gateway endpoints)"]:::unchanged
  RDS["RDS encache-db (in-VPC)"]:::unchanged
  Alarm["CloudWatch alarm AWS/AutoScaling GroupInServiceInstances < 1 | main/devops/main.tf"]:::new
  SNS["SNS encache-cost-alerts"]:::unchanged
  Discord["encache-discord-alerts Lambda → Discord webhook"]:::unchanged

  Lambda -->|"S3/DynamoDB API call"| S3DDB
  Lambda -->|"PostgreSQL connection"| RDS
  Lambda -->|"0.0.0.0/0 egress"| RouteTable
  RouteTable -->|"NetworkInterfaceId = fck-nat ENI"| FckNat
  FckNat -->|"SNAT to EIP"| EIP
  EIP -->|"public IP egress"| IGW
  IGW -->|"HTTPS"| Internet

  FckNat -.->|"GroupInServiceInstances metric"| Alarm
  Alarm -->|"alarm action"| SNS
  SNS -->|"subscription"| Discord

classDef unchanged fill:#d3d3d3,stroke:#666,stroke-width:1px;
classDef updated fill:#ffe58a,stroke:#666,stroke-width:1px;
classDef new fill:#a4d4a8,stroke:#666,stroke-width:1px;

2. Black-Box Inputs and Outputs

This is an infrastructure plan; the "contracts" are network-level invariants that VPC Lambdas depend on, not REST/RPC payloads. Each flow asserts a routing or recovery property that must hold after migration.

Global Types

LambdaInvocation {
  function_name: string (any function with VpcConfig in template.yaml)
  source_subnet: string (private subnet 1b or 1c)
}

EgressTarget {
  destination: string (FQDN or AWS service name)
  expected_path: string ("via fck-nat" | "via gateway endpoint" | "in-VPC")
}

CutoverEvent {
  triggered_by: string ("terraform apply" | "instance termination")
  expected_outage_window: duration
}

InfraAlert {
  source: string ("CloudWatch alarm name")
  destination: string ("Discord channel")
}

Flow: lambdaExternalEgress

  • Test files: N/A (validated by post-apply smoke probe — see Stage 3 verification table, probe #4)
  • Core files: main/devops/main.tf (route table + module), main/server/template.yaml (Lambda VpcConfig)

Type Definitions

ExternalEgressInput {
  invocation: LambdaInvocation
  target: EgressTarget destination=external (e.g. api.openai.com, api.anthropic.com)
}

ExternalEgressOutput {
  delivered: boolean (HTTP request reaches destination)
  source_ip: string (visible source IP at destination = fck-nat EIP)
  latency_overhead_ms: number (target ≤ 5ms vs managed NAT)
}

Paths

path-name input output/expected state change path-type notes updated
lambdaExternalEgress.success ExternalEgressInput for OpenAI/Anthropic/Groq/GPU worker ExternalEgressOutput delivered=true; source_ip=new EIP happy path Replaces failure mode from pre-#339 (APIConnectionError after 36s) Y
lambdaExternalEgress.during-cutover ExternalEgressInput issued during ~5–10min apply window delivered=false; client retries via PersistentUploadQueue (PR #351) and chatWithMemory retry; eventual delivered=true after route lands error → recovered Accepted outage; mitigated by overnight scheduling Y

Flow: lambdaAwsApiEgress

  • Test files: N/A (post-apply smoke probe #5)
  • Core files: main/devops/main.tf

Type Definitions

AwsApiEgressInput {
  invocation: LambdaInvocation
  target: EgressTarget destination=ssm.us-east-1.amazonaws.com OR ec2.us-east-1.amazonaws.com
}

AwsApiEgressOutput {
  delivered: boolean
  resolved_via: string ("public endpoint via fck-nat" — interface endpoint deleted)
  per_call_cost_usd: number (target ≤ $0.0002)
}

Paths

path-name input output/expected state change path-type notes updated
lambdaAwsApiEgress.success AwsApiEgressInput for SSM GetParameter /encache/db/password delivered=true; resolved_via=public via fck-nat happy path Replaces private-DNS interface endpoint path; latency +20–50ms cold, ~5ms warm Y
lambdaAwsApiEgress.endpoint-deletion-regression Synthetic SSM call right after endpoint delete delivered=true; NOT EndpointConnectionError error guard Confirms public DNS resolves correctly post-deletion Y

Flow: lambdaGatewayEndpointEgress

  • Test files: N/A (post-apply smoke probe #6 — route-table prefix-list assertion; VPC Flow Logs are not provisioned by this stack so direct traffic capture is out of scope)
  • Core files: main/devops/main.tf (aws_vpc_endpoint.s3, aws_vpc_endpoint.dynamodb, aws_route_table.lambda_private)

Type Definitions

GatewayEgressInput {
  invocation: LambdaInvocation
  target: EgressTarget destination=s3.us-east-1.amazonaws.com OR dynamodb.us-east-1.amazonaws.com
}

GatewayEgressOutput {
  delivered: boolean
  resolved_via: string ("VPC Gateway endpoint — never NAT")
}

Paths

path-name input output/expected state change path-type notes updated
lambdaGatewayEndpointEgress.success GatewayEgressInput to S3 or DynamoDB delivered=true; resolved_via=VPC Gateway endpoint happy path Confirms route table change did NOT break Gateway endpoint association (S3/DDB endpoint route still present in lambda_private) Y

Flow: fckNatAutoHeal

  • Test files: N/A (failure-injection probes #8–10)
  • Core files: main/devops/main.tf (module + alarm)

Type Definitions

AutoHealInput {
  trigger: CutoverEvent triggered_by="instance termination" (manual: aws ec2 terminate-instances)
}

AutoHealOutput {
  replacement_instance_id: string (different from terminated)
  replacement_within: duration (target ≤ 5min)
  eip_reattached: boolean (true)
  route_table_updated: boolean (true — module lifecycle hook rewrites route to new ENI)
  alert_delivered: boolean (true — Discord channel receives alert)
}

Paths

path-name input output/expected state change path-type notes updated
fckNatAutoHeal.success AutoHealInput AutoHealOutput all true within 5min happy path Required by soft-prod posture (Q1=B); module's update_route_tables=true is the load-bearing setting Y
fckNatAutoHeal.az-outage trigger=AZ failure (us-east-1a down) replacement_within=undefined; alert_delivered=true; CODE CHANGE REQUIRED before re-apply error Single-AZ accepted in scope. Recovery is NOT a no-op terraform apply --var subnet_id=... — the existing data.aws_subnet.private_1b / private_1c resources point at PRIVATE subnets, which cannot host fck-nat (no IGW route). Operator must first add a data "aws_subnet" "public_1b" (or _1c) resource to main.tf (default VPC has one public subnet per AZ), then update module.fck_nat.subnet_id to that resource and apply. Y

Flow: infraAlertDelivery

  • Test files: N/A (failure-injection probe #9)
  • Core files: main/devops/main.tf (alarm + SNS subscription, both pre-existing)

Type Definitions

AlertInput {
  source: "encache-fck-nat-health alarm"
  trigger: "AWS/AutoScaling GroupInServiceInstances < 1 for 2 evaluation periods (60s each)"
}

AlertOutput {
  destination: "Discord channel via encache-discord-alerts Lambda"
  delivered_within: duration (target ≤ 3min from ASG losing in-service instance)
}

Note: the alarm tracks AWS/AutoScaling GroupInServiceInstances, not AWS/EC2 StatusCheckFailed. Terminating an EC2 instance makes the EC2 status-check metric missing (treated as not-breaching), so a StatusCheckFailed alarm would not fire on a terminate. The ASG metric drops to 0 the instant the instance leaves service and stays at 0 until the replacement is in service, which is the actual signal we care about.

Paths

path-name input output/expected state change path-type notes updated
infraAlertDelivery.success AlertInput after instance termination AlertOutput delivered_within ≤ 3min happy path Reuses verified pipeline (validated end-to-end this session — webhook 204, lambda invoke 200, daily report posted) Y

3. Pseudocode / Technical Details for Critical Flows (Optional)

Cutover sequence

Pre-apply:
  1. terraform validate
  2. terraform plan -out=tfplan
  3. Inspect plan: confirm exactly 4 deletes (nat_gateway.main, eip.nat,
                   vpc_endpoint.ssm, vpc_endpoint.ec2);
                   2 creates (aws_eip.fck_nat,
                   aws_cloudwatch_metric_alarm.fck_nat_health)
                   + 1 module add (fck-nat: ASG, launch template, IAM role,
                     SG, static ENI, internal aws_route resource);
                   1 modify (aws_route_table.lambda_private — route block
                   removed, lifecycle.ignore_changes added).
  4. Schedule cutover for overnight low-traffic window (data shows 0 RDS
                   connections / 0 NAT bytes overnight).

Apply:
  5. Open `aws logs tail /aws/lambda/encache-IngestWindowFunction-* --follow`
     in side terminal.
  6. terraform apply tfplan
  7. Wait ~5min for: NAT GW destroy, fck-nat ASG launch, EIP attach via
     userdata, module's aws_route create on lambda_private (0/0 → ENI).

  Stale-route gotcha: the route_table previously held an inline
  `route { 0/0 → nat_gateway_id }`. After NAT GW destroy this entry
  becomes a blackhole entry but still occupies the 0/0 slot, and the
  module's aws_route create fails with `RouteAlreadyExists`. With
  `ignore_changes = [route]` Terraform will NOT remove that entry on
  its own. If `terraform apply` reports `RouteAlreadyExists` on
  `module.fck_nat.aws_route.main[0]`, run:
    aws ec2 delete-route \
      --route-table-id $RT_ID \
      --destination-cidr-block 0.0.0.0/0 \
      --profile encache-workload --region us-east-1
  then re-run `terraform apply tfplan` — the module's aws_route create
  succeeds on retry.

Post-apply (within 5min):
  8. Run smoke probes 1–7 (instance health, EIP, route, lambda egress,
     SSM call, gateway endpoints preserved, alarm wired).
  9. If any probe fails: git revert + terraform apply (~10–15min rollback,
     NAT GW provision is long pole).

Steady-state validation (ongoing):
 10. Monitor next billing cycle daily cost report — expect VPC line drop
     ~$21 → ~$3.
 11. Run failure injection probes 8–10 within first week to prove auto-heal
     + alert.

Module wiring (Terraform sketch)

resource "aws_eip" "fck_nat" {
  domain = "vpc"
  tags   = { Name = "encache-fck-nat-eip" }
}

module "fck_nat" {
  # Pinned to v1.3.0 commit SHA for immutability (Checkov CKV_TF_1).
  source = "git::https://github.com/RaJiska/terraform-aws-fck-nat.git?ref=9377bf9247c96318b99273eb2978d1afce8cf0eb"

  name      = "encache-fck-nat"
  vpc_id    = data.aws_vpc.default.id
  subnet_id = data.aws_subnet.public_1a.id

  # ha_mode=true wraps the instance in an ASG of 1 for auto-heal. With a
  # single subnet_id this is still single-AZ — the "HA mode" label refers
  # to the ASG-vs-raw-instance toggle, not multi-AZ topology. ha_mode=false
  # would create a plain aws_instance with no auto-recovery.
  ha_mode            = true
  instance_type      = "t4g.nano"
  use_spot_instances = false

  # Stable source IP across ASG replacements — userdata calls
  # `aws ec2 associate-address` on every instance launch.
  eip_allocation_ids = [aws_eip.fck_nat.id]

  # NB: `use_default_security_group` / `additional_security_group_ids`
  # only affect the launch-template ENI on the running instance — they do
  # NOT touch the static data-path ENI (`aws_network_interface.main` in
  # the module), which Lambda traffic actually hits and which is
  # hard-wired to the module's default SG (VPC-CIDR ingress). Tightening
  # the data-path SG would require a fork or upstream PR. Out of scope.

  update_route_tables = true
  route_tables_ids = {
    lambda_private = aws_route_table.lambda_private.id
  }
}

resource "aws_cloudwatch_metric_alarm" "fck_nat_health" {
  alarm_name        = "encache-fck-nat-health"
  alarm_description = "fck-nat ASG has fewer than one in-service instance; NAT egress for VPC Lambdas is impaired."

  # GroupInServiceInstances drops to 0 the moment the ASG instance is
  # terminated/unhealthy and stays at 0 until the replacement is in service.
  # AWS/EC2 StatusCheckFailed would NOT fire for terminate-instances —
  # the metric goes missing once the instance disappears, and missing data
  # is treated as not-breaching by default. ASG health is the right signal
  # for "no NAT available," so we alarm on that instead.
  metric_name         = "GroupInServiceInstances"
  namespace           = "AWS/AutoScaling"
  statistic           = "Minimum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 1
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"

  dimensions = {
    AutoScalingGroupName = module.fck_nat.name
  }

  alarm_actions = [aws_sns_topic.cost_alerts.arn]
  ok_actions    = [aws_sns_topic.cost_alerts.arn]
}

Verification probes (run after apply)

# Probe Pass criteria
1 aws ec2 describe-instances --filters Name=tag:Name,Values=encache-fck-nat-* state=running, status checks 2/2
2 aws ec2 describe-addresses --query 'Addresses[?Tags[?Value==\encache-fck-nat-eip`]]'| AssociationId present (proves the dedicatedaws_eip.fck_nat` is attached to the running instance via module userdata)
3 aws ec2 describe-route-tables --route-table-ids $RT_ID NetworkInterfaceId = fck-nat ENI
4 Invoke IngestWindowFunction with synthetic payload response 200, no APIConnectionError
5 Invoke chat handler (expect auth-fail not infra-fail) error type = auth/validation, NOT EndpointConnectionError
6 aws ec2 describe-route-tables --route-table-ids $RT_ID --query 'RouteTables[].Routes[?DestinationPrefixListId][DestinationPrefixListId,GatewayId]' Two prefix-list routes returned (S3 + DynamoDB), each with GatewayId=vpce-* matching aws_vpc_endpoint.s3.id / aws_vpc_endpoint.dynamodb.id. Confirms gateway endpoints survived the route-table edit. (VPC Flow Logs are not provisioned by this stack; out of scope to add here.)
7 aws cloudwatch describe-alarms --alarm-names encache-fck-nat-health MetricName=GroupInServiceInstances, Namespace=AWS/AutoScaling, ComparisonOperator=LessThanThreshold, Threshold=1, AlarmActions contains encache-cost-alerts SNS ARN
8 aws ec2 terminate-instances --instance-ids $ID Within 5min: new instance launched by ASG, EIP re-attached, route table updated to new ENI, Lambda egress works again
9 After probe 8, while ASG count is 0 Alarm transitions OK → ALARM within ~2min; Discord channel receives notification via encache-cost-alerts SNS → encache-discord-alerts Lambda. After replacement is InService, alarm transitions ALARM → OK and Discord receives a recovery notification.
10 After probe 8 EIP re-associated to new instance

Implementation notes

  • Module pin: git source at v1.3.0 commit SHA 9377bf9247c96318b99273eb2978d1afce8cf0eb (immutable; satisfies Checkov CKV_TF_1, prevents accidental upgrades on terraform init -upgrade). v1.4.0 requires AWS provider >= 6.0.0 and this stack pins ~> 5.0, so bumping the module requires a coordinated provider upgrade. Update both source and the SHA when upgrading.
  • use_spot_instances = false — soft-prod rejects surprise interruption.
  • lifecycle { ignore_changes = [route] } on aws_route_table.lambda_private is required so the module can mutate the 0/0 route on instance replacement without TF reverting.
  • Source IP changes to new EIP — confirmed no 3rd party (OpenAI, Anthropic, Groq, GPU worker) allowlists current EIP.
  • Single-AZ accepted; multi-AZ HA mode (~$6/mo extra) defeats most savings and matches existing PR #339 single-AZ posture. Recovery from a us-east-1a outage requires a code change first (add data "aws_subnet" "public_1b" or _1c and update module.fck_nat.subnet_id to it) — there is no public-subnet data source for those AZs in the current main.tf. This is the operator runbook for the fckNatAutoHeal.az-outage flow.
  • Health alarm uses AWS/AutoScaling GroupInServiceInstances < 1, not AWS/EC2 StatusCheckFailed. Terminating an EC2 instance makes the EC2 metric missing (treated as not-breaching) — the alarm would not fire. ASG GroupInServiceInstances drops to 0 at terminate and is the correct signal for "no NAT available."
  • SG scoping limitation: the fck-nat module v1.3.x creates a static data-path ENI (aws_network_interface.main inside the module) that is hard-coded to use the module's default SG (ingress = full VPC CIDR). The module's use_default_security_group and additional_security_group_ids inputs only affect the launch-template ENI, which faces the public internet (no inbound initiation). Tightening the data-path SG to "Lambda SG only" would require forking the module or upstreaming a PR. Accepted for this migration: VPC-CIDR ingress on the data path is safe in this stack because only the Lambda private subnets route 0/0 to fck-nat. Flagged for future hardening.
  • Rollback: git revert <commit> + terraform apply recreates managed NAT + endpoints in ~10–15min.

After all stages are approved, apply .agent/skills/reconcile-plans/SKILL.md to propagate contract updates across linked plans. No existing plan in docs/plans/ references the NAT/endpoint resources directly, so reconciliation is expected to be a no-op other than confirming this plan is linkable.