If you notice a delay between an event and the first notification, read this post: Understanding the delays on alerting .
Prometheus configuration
Prometheus reads alert rules from YAML files and evaluates them on every evaluation_interval cycle.
Keep both scrape_interval and evaluation_interval consistent — a mismatch causes stale data in range queries.
# prometheus.yml
global:
scrape_interval: 20s
# A short evaluation_interval will check alerting rules very often.
# It can be costly if you run Prometheus with 100+ alerts.
evaluation_interval: 20s
rule_files:
- 'alerts/*.yml'
scrape_configs:
# ... # alerts/example-redis.yml
groups:
- name: ExampleRedisGroup
rules:
- alert: ExampleRedisDown
expr: redis_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: Redis instance down (instance {{ $labels.instance }})
description: "Redis is unreachable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ExampleRedisHighMemory
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: Redis memory usage above 90% (instance {{ $labels.instance }})
description: "Redis memory usage is {{ $value | humanizePercentage }}\n LABELS = {{ $labels }}" AlertManager configuration
AlertManager receives alerts from Prometheus, deduplicates and groups them, then routes them to the right receiver. The three key timing parameters control when notifications are sent:
group_wait— how long to wait for more alerts to batch into the first notificationgroup_interval— how long to wait before sending a follow-up for an ongoing grouprepeat_interval— how often to re-notify if an alert hasn't resolved
# alertmanager.yml
route:
group_wait: 10s
group_interval: 30s
repeat_interval: 4h
receiver: "slack"
routes:
# warnings and criticals → Slack
- receiver: "slack"
matchers:
- severity =~ "critical|warning"
continue: true
# criticals also → PagerDuty
- receiver: "pagerduty"
matchers:
- severity = "critical"
receivers:
- name: "slack"
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx'
send_resolved: true
channel: '#monitoring'
title: '{{ if eq .Status "firing" }}:fire:{{ else }}:white_check_mark:{{ end }} {{ .CommonLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
- name: "pagerduty"
pagerduty_configs:
- routing_key: '<your-pagerduty-integration-key>'
send_resolved: true Inhibition rules
Inhibition suppresses lower-priority alerts when a higher-priority alert is already firing for the same target.
A common pattern: silence warning alerts when a critical alert is active on the same instance.
# alertmanager.yml
inhibit_rules:
# Suppress warnings when a critical is firing for the same instance
- source_matchers:
- severity = "critical"
target_matchers:
- severity = "warning"
equal:
- alertname
- instance
# Suppress all alerts for a node when NodeDown is firing
- source_matchers:
- alertname = "NodeDown"
target_matchers:
- job = "node"
equal:
- instance Reduce Prometheus server load
For expensive or frequently evaluated PromQL queries, use recording rules to precompute results. AlertManager and dashboards then reference the lightweight recorded metric instead of re-evaluating the full expression.
groups:
# 1. Define the recording rule
- name: recordings
rules:
- record: job:rabbitmq_queue_messages_delivered_total:rate5m
expr: rate(rabbitmq_queue_messages_delivered_total[5m])
# 2. Reference it in alert rules
- name: alerts
rules:
- alert: RabbitmqLowMessageDelivery
expr: sum(job:rabbitmq_queue_messages_delivered_total:rate5m) < 10
for: 2m
labels:
severity: critical
annotations:
summary: Low message delivery rate in RabbitMQ
description: "Delivery rate is {{ $value | humanize }} msg/s\n LABELS = {{ $labels }}" Troubleshooting alert delays
The total time from an event occurring to a notification being sent is the sum of several independent delays. Work through them in order:
- Scrape delay: up to
scrape_interval(20s) before the metric is collected - Evaluation delay: up to
evaluation_interval(20s) before the rule fires - Pending duration: the
for: 5mwindow must be satisfied before the alert state changes to firing - GroupWait: AlertManager waits
group_wait(10s) for other alerts to batch
In the worst case with for: 5m: 20s + 20s + 5m + 10s ≈ 6 minutes from event to notification.
Reduce evaluation_interval and for: for time-sensitive alerts, but be careful of false positives from transient spikes.