Skip to main content
APA
Sponsored by CAST AI — Kubernetes cost optimization Better Stack — Uptime monitoring and log management
Configuration Guide · By Samuel Berthe · · 5 min read

AlertManager Configuration

Prometheus and AlertManager configuration examples, recorded rules, inhibition, and troubleshooting guide for alert timing and notification routing.

If you notice a delay between an event and the first notification, read this post: Understanding the delays on alerting .

Prometheus configuration

Prometheus reads alert rules from YAML files and evaluates them on every evaluation_interval cycle. Keep both scrape_interval and evaluation_interval consistent — a mismatch causes stale data in range queries.

# prometheus.yml

global:
  scrape_interval: 20s

  # A short evaluation_interval will check alerting rules very often.
  # It can be costly if you run Prometheus with 100+ alerts.
  evaluation_interval: 20s

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  # ...
# alerts/example-redis.yml

groups:

- name: ExampleRedisGroup
  rules:
  - alert: ExampleRedisDown
    expr: redis_up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Redis instance down (instance {{ $labels.instance }})
      description: "Redis is unreachable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ExampleRedisHighMemory
    expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Redis memory usage above 90% (instance {{ $labels.instance }})
      description: "Redis memory usage is {{ $value | humanizePercentage }}\n  LABELS = {{ $labels }}"

AlertManager configuration

AlertManager receives alerts from Prometheus, deduplicates and groups them, then routes them to the right receiver. The three key timing parameters control when notifications are sent:

  • group_wait — how long to wait for more alerts to batch into the first notification
  • group_interval — how long to wait before sending a follow-up for an ongoing group
  • repeat_interval — how often to re-notify if an alert hasn't resolved
# alertmanager.yml

route:
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 4h
  receiver: "slack"

  routes:
    # warnings and criticals → Slack
    - receiver: "slack"
      matchers:
        - severity =~ "critical|warning"
      continue: true

    # criticals also → PagerDuty
    - receiver: "pagerduty"
      matchers:
        - severity = "critical"

receivers:
  - name: "slack"
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx'
        send_resolved: true
        channel: '#monitoring'
        title: '{{ if eq .Status "firing" }}:fire:{{ else }}:white_check_mark:{{ end }} {{ .CommonLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          {{ end }}

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: '<your-pagerduty-integration-key>'
        send_resolved: true

Inhibition rules

Inhibition suppresses lower-priority alerts when a higher-priority alert is already firing for the same target. A common pattern: silence warning alerts when a critical alert is active on the same instance.

# alertmanager.yml

inhibit_rules:
  # Suppress warnings when a critical is firing for the same instance
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal:
      - alertname
      - instance

  # Suppress all alerts for a node when NodeDown is firing
  - source_matchers:
      - alertname = "NodeDown"
    target_matchers:
      - job = "node"
    equal:
      - instance

Reduce Prometheus server load

For expensive or frequently evaluated PromQL queries, use recording rules to precompute results. AlertManager and dashboards then reference the lightweight recorded metric instead of re-evaluating the full expression.

groups:

  # 1. Define the recording rule
  - name: recordings
    rules:
    - record: job:rabbitmq_queue_messages_delivered_total:rate5m
      expr: rate(rabbitmq_queue_messages_delivered_total[5m])

  # 2. Reference it in alert rules
  - name: alerts
    rules:
    - alert: RabbitmqLowMessageDelivery
      expr: sum(job:rabbitmq_queue_messages_delivered_total:rate5m) < 10
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: Low message delivery rate in RabbitMQ
        description: "Delivery rate is {{ $value | humanize }} msg/s\n  LABELS = {{ $labels }}"

Troubleshooting alert delays

The total time from an event occurring to a notification being sent is the sum of several independent delays. Work through them in order:

  • Scrape delay: up to scrape_interval (20s) before the metric is collected
  • Evaluation delay: up to evaluation_interval (20s) before the rule fires
  • Pending duration: the for: 5m window must be satisfied before the alert state changes to firing
  • GroupWait: AlertManager waits group_wait (10s) for other alerts to batch

In the worst case with for: 5m: 20s + 20s + 5m + 10s ≈ 6 minutes from event to notification. Reduce evaluation_interval and for: for time-sensitive alerts, but be careful of false positives from transient spikes.

Further reading