Discord Ops
Metadata
- System type:
service
System Intent
- What this is: Two Lambda functions that deliver AWS operational alerts to Discord and accept Discord slash commands for alert management.
DiscordAlertsFunctionforwards SNS alarm notifications and EC2 state-change events to a Discord webhook, with mute-gate support backed by an SSM parameter.DiscordInteractionFunctionhandles Discord slash commands including/cost,/status,/mute-alerts, and/unmute-alerts. A CloudWatch metric alarm on theIngestDLQdepth fires through the existingcost_alertsSNS topic intoDiscordAlertsFunctionwhen the ingest backlog exceeds 100 messages for 10 minutes.
Mermaid Diagram
flowchart TD
DLQAlarm["aws_cloudwatch_metric_alarm\nencache-ingest-dlq-backlog\n(>100 msgs for 10min)"]
SNS["aws_sns_topic.cost_alerts"]
AlertsLambda["DiscordAlertsFunction\ndiscord_alerts/handler.py"]
MuteSSM["/encache/alerts/dlq_muted_until\n(SSM String — unix timestamp)"]
Webhook["Discord Webhook\nDISCORD_WEBHOOK_URL"]
InteractionLambda["DiscordInteractionFunction\ndiscord_interaction/handler.py"]
Operator["Operator (Discord)"]
DLQAlarm -->|"breach / ok"| SNS
SNS -->|"SNS record"| AlertsLambda
AlertsLambda -->|"GetParameter"| MuteSSM
MuteSSM -->|"muted_until > now: suppress"| AlertsLambda
AlertsLambda -->|"not muted: POST"| Webhook
Webhook --> Operator
Operator -->|"/mute-alerts [duration]"| InteractionLambda
Operator -->|"/unmute-alerts"| InteractionLambda
InteractionLambda -->|"PutParameter"| MuteSSM Flows
Flow: discordAlerts.ingestDlqAlarm
- Core files:
main/devops/lambdas/discord_alerts/handler.py - Test files:
main/devops/lambdas/discord_interaction/test_handler.py(test_alert_suppressed_when_muted,test_alert_fires_when_unmuted,test_alert_fires_when_param_missing)
Types
SNSRecord {
EventSource: "aws:sns"
Sns.Subject: string
Sns.Message: string
}
EC2StateChangeEvent {
source: "aws.ec2"
detail.instance-id: string
detail.state: "running" | "stopped" | "terminated"
}
Paths
| path | input | output | path-type | notes |
|---|---|---|---|---|
discordAlerts.snsAlert.muted | SNSRecord | HTTP 200, no webhook POST | happy path | muted_until SSM param holds future unix timestamp |
discordAlerts.snsAlert.unmuted | SNSRecord | POST to Discord webhook | happy path | SSM param absent, "0", or past timestamp |
discordAlerts.ec2StateChange.muted | EC2StateChangeEvent | HTTP 200, no webhook POST | happy path | mute gate same as SNS path |
discordAlerts.ec2StateChange.unmuted | EC2StateChangeEvent | POST to Discord webhook | happy path | emoji based on state (running/stopped/terminated) |
Pseudocode
lambda_handler(event):
if "Records" in event (SNS):
for record in Records:
if _is_muted(): return 200
post(f"**{subject}**\n```{msg}```")
return 200
if event.source == "aws.ec2":
if _is_muted(): return 200
post(f"{emoji} EC2 {instance_id} is now {state}")
return 200
_is_muted():
try:
muted_until = int(ssm.get_parameter("/encache/alerts/dlq_muted_until"))
return muted_until > now_unix()
except ParameterNotFound:
return False # default-open: alert fires
except Exception:
return False # fail open on SSM errors
Flow: discordInteraction.muteAlerts
- Core files:
main/devops/lambdas/discord_interaction/handler.py→_handle_mute_alerts - Test files:
main/devops/lambdas/discord_interaction/test_handler.py(test_mute_alerts_*)
Types
SlashCommandBody {
type: 2
data.name: "mute-alerts" | "unmute-alerts" | "cost" | "status" | "help"
data.options[0].value: string (optional duration for mute-alerts)
token: string (interaction token)
}
InteractionResponse {
statusCode: 200
body: JSON{ type: 4, data: { content: string } } // inline response
JSON{ type: 5 } // deferred response
}
Paths
| path | input | output | path-type | notes |
|---|---|---|---|---|
muteAlerts.defaultDuration | /mute-alerts (no arg) | SSM written now + 86400; respond inline | happy path | Default is 1 day |
muteAlerts.customDuration | /mute-alerts 4h | SSM written now + 14400; respond inline | happy path | Accepted units: m, h, d |
muteAlerts.invalidDuration | /mute-alerts banana | Error message; SSM not written | error | _parse_duration returns None for invalid format |
unmute-alerts | /unmute-alerts | SSM written "0"; respond "Alerts active" | happy path | Immediately re-enables alerts |
Pseudocode
_parse_duration(s):
match = re.match(r'^(\d+)([mhd])$', s.strip())
if not match: return None
value, unit = groups
return value * {m:60, h:3600, d:86400}[unit]
_handle_mute_alerts(body):
duration_str = options[0].value if options else "1d"
seconds = _parse_duration(duration_str)
if seconds is None: return error response (SSM not touched)
until = now_unix() + seconds
ssm.put_parameter("/encache/alerts/dlq_muted_until", str(until), Overwrite=True)
log alert_muted {duration, muted_until}
return respond(f"Alerts muted until {human_readable_utc} ({duration_str})")
_handle_unmute_alerts(body):
ssm.put_parameter("/encache/alerts/dlq_muted_until", "0", Overwrite=True)
log alert_unmuted
return respond("Alerts active — DLQ backlog alarms will now fire.")
Flow: discordInteraction.deferredCommands
- Core files:
main/devops/lambdas/discord_interaction/handler.py→_handle_slash_command,_handle_async_followup
/cost and /status use a deferred interaction pattern to stay within Discord's 3-second response deadline: 1. Lambda returns type-5 (deferred) immediately. 2. Lambda self-invokes asynchronously with _async_followup: true. 3. The async invocation calls build_report() and PATCHes the original interaction via Discord's webhook API (/webhooks/{APP_ID}/{token}/messages/@original), retrying up to 3 times on 404.
/help responds inline (type 4) with a command listing including the new /mute-alerts and /unmute-alerts commands.
Logs
| Source | Location |
|---|---|
DiscordAlertsFunction | CloudWatch: /aws/lambda/encache-discord-alerts |
DiscordInteractionFunction | CloudWatch: /aws/lambda/encache-discord-interaction |
Structured log steps:
| Step key | Where | Key fields |
|---|---|---|
alert_suppressed | discord_alerts._is_muted | muted_until |
alert_muted | discord_interaction._handle_mute_alerts | duration, muted_until |
alert_unmuted | discord_interaction._handle_unmute_alerts | (none) |
CloudWatch Alarm
aws_cloudwatch_metric_alarm.ingest_dlq_backlog (defined in main/devops/main.tf):
| Property | Value |
|---|---|
| Alarm name | encache-ingest-dlq-backlog |
| Metric | AWS/SQS ApproximateNumberOfMessagesVisible for queue encache-ingest-dlq |
| Statistic | Maximum |
| Period | 300s (5 min) |
| Evaluation periods | 2 (alarm fires after 10 consecutive minutes in breach) |
| Threshold | > 100 messages |
| Missing data | treated as not breaching |
| Alarm + OK actions | aws_sns_topic.cost_alerts |
The alarm fires on both ALARM and OK transitions so operators see the all-clear when the queue drains.
SSM Parameters
| Parameter | Type | Purpose | Writer | Reader |
|---|---|---|---|---|
/encache/alerts/dlq_muted_until | String (unix timestamp) | Mute gate for Discord alerts | DiscordInteractionFunction | DiscordAlertsFunction |
/encache/discord/webhook_url | SecureString | Discord webhook URL | manual | DiscordAlertsFunction (via DISCORD_WEBHOOK_URL env) |
/encache/discord/public_key | String | Discord Ed25519 public key for request verification | manual | DiscordInteractionFunction (via DISCORD_PUBLIC_KEY env) |
/encache/discord/app_id | String | Discord application ID | manual | DiscordInteractionFunction (via DISCORD_APP_ID env) |
The /encache/alerts/dlq_muted_until parameter defaults to absent, which _is_muted() treats as "0" (never muted).
IAM: - DiscordInteractionFunction role: ssm:PutParameter on /encache/alerts/* - DiscordAlertsFunction role: ssm:GetParameter on /encache/alerts/*
Deployment
- Mechanism: Terraform (
main/devops/main.tf) packages and deploys the Discord Lambdas (viaarchive_file+aws_lambda_function) and the DLQ backlog alarm. There is no SAM template inmain/devops/. - Deploy command:
- Notes: Deploy
DiscordAlertsFunction(mute-gate change) before the DLQ alarm goes live. The handler is safe when/encache/alerts/dlq_muted_untilis absent — it fails open and fires the alert.