Skip to content

Discord Ops

Metadata

  • System type: service

System Intent

  • What this is: Two Lambda functions that deliver AWS operational alerts to Discord and accept Discord slash commands for alert management. DiscordAlertsFunction forwards SNS alarm notifications and EC2 state-change events to a Discord webhook, with mute-gate support backed by an SSM parameter. DiscordInteractionFunction handles Discord slash commands including /cost, /status, /mute-alerts, and /unmute-alerts. A CloudWatch metric alarm on the IngestDLQ depth fires through the existing cost_alerts SNS topic into DiscordAlertsFunction when the ingest backlog exceeds 100 messages for 10 minutes.

Mermaid Diagram

flowchart TD
  DLQAlarm["aws_cloudwatch_metric_alarm\nencache-ingest-dlq-backlog\n(>100 msgs for 10min)"]
  SNS["aws_sns_topic.cost_alerts"]
  AlertsLambda["DiscordAlertsFunction\ndiscord_alerts/handler.py"]
  MuteSSM["/encache/alerts/dlq_muted_until\n(SSM String — unix timestamp)"]
  Webhook["Discord Webhook\nDISCORD_WEBHOOK_URL"]
  InteractionLambda["DiscordInteractionFunction\ndiscord_interaction/handler.py"]
  Operator["Operator (Discord)"]

  DLQAlarm -->|"breach / ok"| SNS
  SNS -->|"SNS record"| AlertsLambda
  AlertsLambda -->|"GetParameter"| MuteSSM
  MuteSSM -->|"muted_until > now: suppress"| AlertsLambda
  AlertsLambda -->|"not muted: POST"| Webhook
  Webhook --> Operator
  Operator -->|"/mute-alerts [duration]"| InteractionLambda
  Operator -->|"/unmute-alerts"| InteractionLambda
  InteractionLambda -->|"PutParameter"| MuteSSM

Flows

Flow: discordAlerts.ingestDlqAlarm

  • Core files: main/devops/lambdas/discord_alerts/handler.py
  • Test files: main/devops/lambdas/discord_interaction/test_handler.py (test_alert_suppressed_when_muted, test_alert_fires_when_unmuted, test_alert_fires_when_param_missing)

Types

SNSRecord {
  EventSource: "aws:sns"
  Sns.Subject: string
  Sns.Message: string
}

EC2StateChangeEvent {
  source: "aws.ec2"
  detail.instance-id: string
  detail.state: "running" | "stopped" | "terminated"
}

Paths

path input output path-type notes
discordAlerts.snsAlert.muted SNSRecord HTTP 200, no webhook POST happy path muted_until SSM param holds future unix timestamp
discordAlerts.snsAlert.unmuted SNSRecord POST to Discord webhook happy path SSM param absent, "0", or past timestamp
discordAlerts.ec2StateChange.muted EC2StateChangeEvent HTTP 200, no webhook POST happy path mute gate same as SNS path
discordAlerts.ec2StateChange.unmuted EC2StateChangeEvent POST to Discord webhook happy path emoji based on state (running/stopped/terminated)

Pseudocode

lambda_handler(event):
  if "Records" in event (SNS):
    for record in Records:
      if _is_muted(): return 200
      post(f"**{subject}**\n```{msg}```")
    return 200

  if event.source == "aws.ec2":
    if _is_muted(): return 200
    post(f"{emoji} EC2 {instance_id} is now {state}")
    return 200

_is_muted():
  try:
    muted_until = int(ssm.get_parameter("/encache/alerts/dlq_muted_until"))
    return muted_until > now_unix()
  except ParameterNotFound:
    return False   # default-open: alert fires
  except Exception:
    return False   # fail open on SSM errors

Flow: discordInteraction.muteAlerts

  • Core files: main/devops/lambdas/discord_interaction/handler.py_handle_mute_alerts
  • Test files: main/devops/lambdas/discord_interaction/test_handler.py (test_mute_alerts_*)

Types

SlashCommandBody {
  type: 2
  data.name: "mute-alerts" | "unmute-alerts" | "cost" | "status" | "help"
  data.options[0].value: string (optional duration for mute-alerts)
  token: string (interaction token)
}

InteractionResponse {
  statusCode: 200
  body: JSON{ type: 4, data: { content: string } }  // inline response
       JSON{ type: 5 }                               // deferred response
}

Paths

path input output path-type notes
muteAlerts.defaultDuration /mute-alerts (no arg) SSM written now + 86400; respond inline happy path Default is 1 day
muteAlerts.customDuration /mute-alerts 4h SSM written now + 14400; respond inline happy path Accepted units: m, h, d
muteAlerts.invalidDuration /mute-alerts banana Error message; SSM not written error _parse_duration returns None for invalid format
unmute-alerts /unmute-alerts SSM written "0"; respond "Alerts active" happy path Immediately re-enables alerts

Pseudocode

_parse_duration(s):
  match = re.match(r'^(\d+)([mhd])$', s.strip())
  if not match: return None
  value, unit = groups
  return value * {m:60, h:3600, d:86400}[unit]

_handle_mute_alerts(body):
  duration_str = options[0].value if options else "1d"
  seconds = _parse_duration(duration_str)
  if seconds is None: return error response (SSM not touched)
  until = now_unix() + seconds
  ssm.put_parameter("/encache/alerts/dlq_muted_until", str(until), Overwrite=True)
  log alert_muted {duration, muted_until}
  return respond(f"Alerts muted until {human_readable_utc} ({duration_str})")

_handle_unmute_alerts(body):
  ssm.put_parameter("/encache/alerts/dlq_muted_until", "0", Overwrite=True)
  log alert_unmuted
  return respond("Alerts active — DLQ backlog alarms will now fire.")

Flow: discordInteraction.deferredCommands

  • Core files: main/devops/lambdas/discord_interaction/handler.py_handle_slash_command, _handle_async_followup

/cost and /status use a deferred interaction pattern to stay within Discord's 3-second response deadline: 1. Lambda returns type-5 (deferred) immediately. 2. Lambda self-invokes asynchronously with _async_followup: true. 3. The async invocation calls build_report() and PATCHes the original interaction via Discord's webhook API (/webhooks/{APP_ID}/{token}/messages/@original), retrying up to 3 times on 404.

/help responds inline (type 4) with a command listing including the new /mute-alerts and /unmute-alerts commands.

Logs

Source Location
DiscordAlertsFunction CloudWatch: /aws/lambda/encache-discord-alerts
DiscordInteractionFunction CloudWatch: /aws/lambda/encache-discord-interaction

Structured log steps:

Step key Where Key fields
alert_suppressed discord_alerts._is_muted muted_until
alert_muted discord_interaction._handle_mute_alerts duration, muted_until
alert_unmuted discord_interaction._handle_unmute_alerts (none)

CloudWatch Alarm

aws_cloudwatch_metric_alarm.ingest_dlq_backlog (defined in main/devops/main.tf):

Property Value
Alarm name encache-ingest-dlq-backlog
Metric AWS/SQS ApproximateNumberOfMessagesVisible for queue encache-ingest-dlq
Statistic Maximum
Period 300s (5 min)
Evaluation periods 2 (alarm fires after 10 consecutive minutes in breach)
Threshold > 100 messages
Missing data treated as not breaching
Alarm + OK actions aws_sns_topic.cost_alerts

The alarm fires on both ALARM and OK transitions so operators see the all-clear when the queue drains.

SSM Parameters

Parameter Type Purpose Writer Reader
/encache/alerts/dlq_muted_until String (unix timestamp) Mute gate for Discord alerts DiscordInteractionFunction DiscordAlertsFunction
/encache/discord/webhook_url SecureString Discord webhook URL manual DiscordAlertsFunction (via DISCORD_WEBHOOK_URL env)
/encache/discord/public_key String Discord Ed25519 public key for request verification manual DiscordInteractionFunction (via DISCORD_PUBLIC_KEY env)
/encache/discord/app_id String Discord application ID manual DiscordInteractionFunction (via DISCORD_APP_ID env)

The /encache/alerts/dlq_muted_until parameter defaults to absent, which _is_muted() treats as "0" (never muted).

IAM: - DiscordInteractionFunction role: ssm:PutParameter on /encache/alerts/* - DiscordAlertsFunction role: ssm:GetParameter on /encache/alerts/*

Deployment

  • Mechanism: Terraform (main/devops/main.tf) packages and deploys the Discord Lambdas (via archive_file + aws_lambda_function) and the DLQ backlog alarm. There is no SAM template in main/devops/.
  • Deploy command:
    cd main/devops && terraform apply
    
  • Notes: Deploy DiscordAlertsFunction (mute-gate change) before the DLQ alarm goes live. The handler is safe when /encache/alerts/dlq_muted_until is absent — it fails open and fires the alert.