What is the Prometheus alert rule for "Prometheus target missing"?

A Prometheus target has disappeared. An exporter might be crashed. PromQL expression: up == 0 unless on(job) (sum by (job) (up) == 0). Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Prometheus all targets missing"?

A Prometheus job does not have living target anymore. PromQL expression: sum by (job) (up) == 0. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Prometheus target missing with warmup time"?

Allow a job time to start up (10 minutes) before alerting that it's down. PromQL expression: sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600)). Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Prometheus not connected to alertmanager"?

Prometheus cannot connect the alertmanager PromQL expression: prometheus_notifications_alertmanagers_discovered < 1. Severity: critical.

What is the Prometheus alert rule for "Prometheus job missing"?

A Prometheus job has disappeared PromQL expression: absent(up{job="prometheus"}). Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Prometheus AlertManager job missing"?

A Prometheus AlertManager job has disappeared PromQL expression: absent(up{job="alertmanager"}). Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Prometheus rule evaluation failures"?

Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts. PromQL expression: increase(prometheus_rule_evaluation_failures_total[3m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus template text expansion failures"?

Prometheus encountered {{ $value }} template text expansion failures PromQL expression: increase(prometheus_template_text_expansion_failures_total[3m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus target empty"?

Prometheus has no target in service discovery PromQL expression: prometheus_sd_discovered_targets == 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus TSDB checkpoint creation failures"?

Prometheus encountered {{ $value }} checkpoint creation failures PromQL expression: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus TSDB checkpoint deletion failures"?

Prometheus encountered {{ $value }} checkpoint deletion failures PromQL expression: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus TSDB compactions failed"?

Prometheus encountered {{ $value }} TSDB compactions failures PromQL expression: increase(prometheus_tsdb_compactions_failed_total[10m]) > 0. Severity: critical. Duration: 30m.

What is the Prometheus alert rule for "Prometheus TSDB head truncations failed"?

Prometheus encountered {{ $value }} TSDB head truncation failures PromQL expression: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus TSDB reload failures"?

Prometheus encountered {{ $value }} TSDB reload failures PromQL expression: increase(prometheus_tsdb_reloads_failures_total[10m]) > 0. Severity: critical. Duration: 30m.

What is the Prometheus alert rule for "Prometheus TSDB WAL corruptions"?

Prometheus encountered {{ $value }} TSDB WAL corruptions PromQL expression: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus TSDB WAL truncations failed"?

Prometheus encountered {{ $value }} TSDB WAL truncation failures PromQL expression: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0. Severity: critical.

What is the Prometheus alert rule for "Prometheus error sending alerts to any AlertManager"?

Prometheus is failing to send at least {{ $value | humanize }}% of alerts to every configured Alertmanager, meaning alert delivery is broadly impaired. PromQL expression: min without (alertmanager) (rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])) * 100 > 3. Severity: critical. Duration: 15m.

What is the Prometheus alert rule for "Prometheus not ingesting samples"?

Prometheus has stopped appending new samples to its TSDB head although targets or rule groups are still configured, meaning monitoring data may be lost. PromQL expression: sum without(type) (rate(prometheus_tsdb_head_samples_appended_total[5m])) 0 or sum without(rule_group) (prometheus_rule_group_rules) > 0). Severity: critical. Duration: 10m.

What is the Prometheus alert rule for "Prometheus target sync failure"?

{{ $value }} Prometheus targets failed to sync because invalid scrape configuration was supplied, meaning those targets are not being scraped. PromQL expression: increase(prometheus_target_sync_failed_total[10m]) > 0. Severity: critical. Duration: 5m.

What is the Prometheus alert rule for "Prometheus too many restarts"?

Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping. PromQL expression: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2. Severity: warning.

What is the Prometheus alert rule for "Prometheus rule evaluation slow"?

Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query. PromQL expression: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Prometheus notifications backlog"?

The Prometheus notification queue has not been empty for 10 minutes PromQL expression: min_over_time(prometheus_notifications_queue_length[10m]) > 0. Severity: warning.

What is the Prometheus alert rule for "Prometheus target scraping slow"?

Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned. PromQL expression: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Prometheus large scrape"?

Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes) PromQL expression: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Prometheus target scrape duplicate"?

Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples) PromQL expression: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 3. Severity: warning.

What is the Prometheus alert rule for "Prometheus timeseries cardinality"?

The "{{ $labels.name }}" timeseries cardinality is getting very high: {{ $value }} PromQL expression: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000. Severity: warning.

What is the Prometheus alert rule for "Prometheus error sending alerts to AlertManager"?

Prometheus is failing to send {{ $value | humanize }}% of alerts to a specific Alertmanager instance, meaning some alerts may not be delivered. PromQL expression: (rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])) * 100 > 1. Severity: warning. Duration: 15m.

What is the Prometheus alert rule for "Prometheus service discovery refresh failure"?

Prometheus failed to refresh service discovery targets using mechanism {{ $labels.mechanism }}, meaning target lists may become stale. PromQL expression: increase(prometheus_sd_refresh_failures_total[10m]) > 0. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Prometheus configuration reload failure"?

Prometheus configuration reload error PromQL expression: prometheus_config_last_reload_successful != 1. Severity: warning.

What is the Prometheus alert rule for "Prometheus AlertManager configuration reload failure"?

AlertManager configuration reload error PromQL expression: alertmanager_config_last_reload_successful != 1. Severity: warning.

What is the Prometheus alert rule for "Prometheus AlertManager config not synced"?

Configurations of AlertManager cluster instances are out of sync PromQL expression: count(count_values("config_hash", alertmanager_config_hash)) > 1. Severity: warning. Duration: 20m.

Prometheus self-monitoring Prometheus Alert Rules

Q: What is the Prometheus alert rule for "Prometheus job missing"?

A Prometheus job has disappeared PromQL expression: absent(up{job="prometheus"}). Severity: critical. Duration: 1m.

Q: What is the Prometheus alert rule for "Prometheus AlertManager job missing"?

A Prometheus AlertManager job has disappeared PromQL expression: absent(up{job="alertmanager"}). Severity: critical. Duration: 1m.

36 Prometheus alerting rules for Prometheus self-monitoring.These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: EmbeddedExporter
  rules:
      # Only fire if at least one target in the job is still up.
      # If all targets are down, PrometheusJobMissing or PrometheusAllTargetsMissing will fire instead.
    - alert: PrometheusTargetMissing
      expr: up == 0 unless on(job) (sum by (job) (up) == 0)
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Prometheus target missing (instance {{ $labels.instance }})
        description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusAllTargetsMissing
      expr: sum by (job) (up) == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Prometheus all targets missing (instance {{ $labels.instance }})
        description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTargetMissingWithWarmupTime
      expr: sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600))
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
        description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusNotConnectedToAlertmanager
      expr: prometheus_notifications_alertmanagers_discovered < 1
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
        description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Replace job="prometheus" with the actual job name in your Prometheus configuration.
    - alert: PrometheusJobMissing
      expr: absent(up{job="prometheus"})
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Prometheus job missing (instance {{ $labels.instance }})
        description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Replace job="alertmanager" with the actual job name in your Prometheus configuration.
    - alert: PrometheusAlertManagerJobMissing
      expr: absent(up{job="alertmanager"})
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
        description: "A Prometheus AlertManager job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTargetScrapeOutOfOrder
      expr: increase(prometheus_target_scrapes_sample_out_of_order_total[5m]) > 3
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus target scrape out of order (instance {{ $labels.instance }})
        description: "Prometheus is dropping samples because their timestamps arrive out of order ({{ $value }} samples in the last 5 minutes), indicating a misbehaving target or clock skew.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusAlertManagerE2EDeadManSwitch
      expr: vector(1)
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
        description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusRuleEvaluationFailures
      expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTemplateTextExpansionFailures
      expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusAlertManagerNotificationFailing
      expr: (rate(alertmanager_notifications_failed_total[5m]) / ignoring(reason) group_left rate(alertmanager_notifications_total[5m])) > 0.01 and ignoring(reason) rate(alertmanager_notifications_total[5m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
        description: "Alertmanager is failing to send {{ $value | humanizePercentage }} of notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTargetEmpty
      expr: prometheus_sd_discovered_targets == 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus target empty (instance {{ $labels.instance }})
        description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBCheckpointCreationFailures
      expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBCheckpointDeletionFailures
      expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBCompactionsFailed
      expr: increase(prometheus_tsdb_compactions_failed_total[10m]) > 0
      for: 30m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBHeadTruncationsFailed
      expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBReloadFailures
      expr: increase(prometheus_tsdb_reloads_failures_total[10m]) > 0
      for: 30m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBWALCorruptions
      expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTSDBWALTruncationsFailed
      expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
        description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusErrorSendingAlertsToAnyAlertManager
      expr: min without (alertmanager) (rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])) * 100 > 3
      for: 15m
      labels:
        severity: critical
      annotations:
        summary: Prometheus error sending alerts to any AlertManager (instance {{ $labels.instance }})
        description: "Prometheus is failing to send at least {{ $value | humanize }}% of alerts to every configured Alertmanager, meaning alert delivery is broadly impaired.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusNotIngestingSamples
      expr: sum without(type) (rate(prometheus_tsdb_head_samples_appended_total[5m])) <= 0 and (sum without(scrape_job) (prometheus_target_metadata_cache_entries) > 0 or sum without(rule_group) (prometheus_rule_group_rules) > 0)
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: Prometheus not ingesting samples (instance {{ $labels.instance }})
        description: "Prometheus has stopped appending new samples to its TSDB head although targets or rule groups are still configured, meaning monitoring data may be lost.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTargetSyncFailure
      expr: increase(prometheus_target_sync_failed_total[10m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Prometheus target sync failure (instance {{ $labels.instance }})
        description: "{{ $value }} Prometheus targets failed to sync because invalid scrape configuration was supplied, meaning those targets are not being scraped.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTooManyRestarts
      expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus too many restarts (instance {{ $labels.instance }})
        description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusRuleEvaluationSlow
      expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
        description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusNotificationsBacklog
      expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus notifications backlog (instance {{ $labels.instance }})
        description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTargetScrapingSlow
      expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Prometheus target scraping slow (instance {{ $labels.instance }})
        description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusLargeScrape
      expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Prometheus large scrape (instance {{ $labels.instance }})
        description: "Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusTargetScrapeDuplicate
      expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 3
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
        description: "Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # 10000 series threshold is a rough default; cardinality scales with how many distinct label combinations your monitored infrastructure produces (dashboards, sharding, high-cardinality labels like user IDs) — adjust based on your metric design, not server size.
    - alert: PrometheusTimeseriesCardinality
      expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus timeseries cardinality (instance {{ $labels.instance }})
        description: "The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusErrorSendingAlertsToAlertManager
      expr: (rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])) * 100 > 1
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: Prometheus error sending alerts to AlertManager (instance {{ $labels.instance }})
        description: "Prometheus is failing to send {{ $value | humanize }}% of alerts to a specific Alertmanager instance, meaning some alerts may not be delivered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusServiceDiscoveryRefreshFailure
      expr: increase(prometheus_sd_refresh_failures_total[10m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Prometheus service discovery refresh failure (instance {{ $labels.instance }})
        description: "Prometheus failed to refresh service discovery targets using mechanism {{ $labels.mechanism }}, meaning target lists may become stale.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusRuleEvaluationsMissed
      expr: increase(prometheus_rule_group_iterations_missed_total[5m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Prometheus rule evaluations missed (instance {{ $labels.instance }})
        description: "Prometheus missed {{ $value }} rule group evaluations in the last 5 minutes, usually caused by rule groups taking longer to evaluate than their configured interval.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusHighQueryLoad
      expr: prometheus_engine_queries / prometheus_engine_queries_concurrent_max > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Prometheus high query load (instance {{ $labels.instance }})
        description: "Prometheus query engine has less than 20% available capacity for concurrent queries, meaning it is approaching its configured concurrency limit and may start rejecting queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusConfigurationReloadFailure
      expr: prometheus_config_last_reload_successful != 1
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
        description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PrometheusAlertManagerConfigurationReloadFailure
      expr: alertmanager_config_last_reload_successful != 1
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
        description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # for: 20m tolerates the time a rolling config change takes to propagate across the
      # Alertmanager cluster, avoiding false positives mid-deploy.
    - alert: PrometheusAlertManagerConfigNotSynced
      expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
      for: 20m
      labels:
        severity: warning
      annotations:
        summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
        description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.Prometheus self-monitoring(36 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/prometheus-self-monitoring/embedded-exporter.yml

critical

1.1.1.Prometheus target missing

A Prometheus target has disappeared. An exporter might be crashed.

  # Only fire if at least one target in the job is still up.
  # If all targets are down, PrometheusJobMissing or PrometheusAllTargetsMissing will fire instead.
- alert: PrometheusTargetMissing
  expr: up == 0 unless on(job) (sum by (job) (up) == 0)
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target missing (instance {{ $labels.instance }})
    description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.2.Prometheus all targets missing

A Prometheus job does not have living target anymore.

- alert: PrometheusAllTargetsMissing
  expr: sum by (job) (up) == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Prometheus all targets missing (instance {{ $labels.instance }})
    description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.3.Prometheus target missing with warmup time

Allow a job time to start up (10 minutes) before alerting that it's down.

- alert: PrometheusTargetMissingWithWarmupTime
  expr: sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600))
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
    description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.4.Prometheus not connected to alertmanager

Prometheus cannot connect the alertmanager

- alert: PrometheusNotConnectedToAlertmanager
  expr: prometheus_notifications_alertmanagers_discovered < 1
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
    description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.5.Prometheus job missing

A Prometheus job has disappeared

  # Replace job="prometheus" with the actual job name in your Prometheus configuration.
- alert: PrometheusJobMissing
  expr: absent(up{job="prometheus"})
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Prometheus job missing (instance {{ $labels.instance }})
    description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.6.Prometheus AlertManager job missing

A Prometheus AlertManager job has disappeared

  # Replace job="alertmanager" with the actual job name in your Prometheus configuration.
- alert: PrometheusAlertManagerJobMissing
  expr: absent(up{job="alertmanager"})
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
    description: "A Prometheus AlertManager job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.7.Prometheus target scrape out of order

Prometheus is dropping samples because their timestamps arrive out of order ({{ $value }} samples in the last 5 minutes), indicating a misbehaving target or clock skew.

- alert: PrometheusTargetScrapeOutOfOrder
  expr: increase(prometheus_target_scrapes_sample_out_of_order_total[5m]) > 3
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus target scrape out of order (instance {{ $labels.instance }})
    description: "Prometheus is dropping samples because their timestamps arrive out of order ({{ $value }} samples in the last 5 minutes), indicating a misbehaving target or clock skew.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.8.Prometheus AlertManager E2E dead man switch

Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.

- alert: PrometheusAlertManagerE2EDeadManSwitch
  expr: vector(1)
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
    description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.9.Prometheus rule evaluation failures

Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.

- alert: PrometheusRuleEvaluationFailures
  expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.10.Prometheus template text expansion failures

Prometheus encountered {{ $value }} template text expansion failures

- alert: PrometheusTemplateTextExpansionFailures
  expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.11.Prometheus AlertManager notification failing

Alertmanager is failing to send {{ $value | humanizePercentage }} of notifications

- alert: PrometheusAlertManagerNotificationFailing
  expr: (rate(alertmanager_notifications_failed_total[5m]) / ignoring(reason) group_left rate(alertmanager_notifications_total[5m])) > 0.01 and ignoring(reason) rate(alertmanager_notifications_total[5m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
    description: "Alertmanager is failing to send {{ $value | humanizePercentage }} of notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.12.Prometheus target empty

Prometheus has no target in service discovery

- alert: PrometheusTargetEmpty
  expr: prometheus_sd_discovered_targets == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target empty (instance {{ $labels.instance }})
    description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.13.Prometheus TSDB checkpoint creation failures

Prometheus encountered {{ $value }} checkpoint creation failures

- alert: PrometheusTSDBCheckpointCreationFailures
  expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.14.Prometheus TSDB checkpoint deletion failures

Prometheus encountered {{ $value }} checkpoint deletion failures

- alert: PrometheusTSDBCheckpointDeletionFailures
  expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.15.Prometheus TSDB compactions failed

Prometheus encountered {{ $value }} TSDB compactions failures

- alert: PrometheusTSDBCompactionsFailed
  expr: increase(prometheus_tsdb_compactions_failed_total[10m]) > 0
  for: 30m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.16.Prometheus TSDB head truncations failed

Prometheus encountered {{ $value }} TSDB head truncation failures

- alert: PrometheusTSDBHeadTruncationsFailed
  expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.17.Prometheus TSDB reload failures

Prometheus encountered {{ $value }} TSDB reload failures

- alert: PrometheusTSDBReloadFailures
  expr: increase(prometheus_tsdb_reloads_failures_total[10m]) > 0
  for: 30m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.18.Prometheus TSDB WAL corruptions

Prometheus encountered {{ $value }} TSDB WAL corruptions

- alert: PrometheusTSDBWALCorruptions
  expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.19.Prometheus TSDB WAL truncations failed

Prometheus encountered {{ $value }} TSDB WAL truncation failures

- alert: PrometheusTSDBWALTruncationsFailed
  expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
    description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.20.Prometheus error sending alerts to any AlertManager

Prometheus is failing to send at least {{ $value | humanize }}% of alerts to every configured Alertmanager, meaning alert delivery is broadly impaired.

- alert: PrometheusErrorSendingAlertsToAnyAlertManager
  expr: min without (alertmanager) (rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])) * 100 > 3
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: Prometheus error sending alerts to any AlertManager (instance {{ $labels.instance }})
    description: "Prometheus is failing to send at least {{ $value | humanize }}% of alerts to every configured Alertmanager, meaning alert delivery is broadly impaired.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.21.Prometheus not ingesting samples

Prometheus has stopped appending new samples to its TSDB head although targets or rule groups are still configured, meaning monitoring data may be lost.

- alert: PrometheusNotIngestingSamples
  expr: sum without(type) (rate(prometheus_tsdb_head_samples_appended_total[5m])) <= 0 and (sum without(scrape_job) (prometheus_target_metadata_cache_entries) > 0 or sum without(rule_group) (prometheus_rule_group_rules) > 0)
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: Prometheus not ingesting samples (instance {{ $labels.instance }})
    description: "Prometheus has stopped appending new samples to its TSDB head although targets or rule groups are still configured, meaning monitoring data may be lost.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

1.1.22.Prometheus target sync failure

{{ $value }} Prometheus targets failed to sync because invalid scrape configuration was supplied, meaning those targets are not being scraped.

- alert: PrometheusTargetSyncFailure
  expr: increase(prometheus_target_sync_failed_total[10m]) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Prometheus target sync failure (instance {{ $labels.instance }})
    description: "{{ $value }} Prometheus targets failed to sync because invalid scrape configuration was supplied, meaning those targets are not being scraped.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.23.Prometheus too many restarts

Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.

- alert: PrometheusTooManyRestarts
  expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus too many restarts (instance {{ $labels.instance }})
    description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.24.Prometheus rule evaluation slow

Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.

- alert: PrometheusRuleEvaluationSlow
  expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
    description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.25.Prometheus notifications backlog

The Prometheus notification queue has not been empty for 10 minutes

- alert: PrometheusNotificationsBacklog
  expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus notifications backlog (instance {{ $labels.instance }})
    description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.26.Prometheus target scraping slow

Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.

- alert: PrometheusTargetScrapingSlow
  expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus target scraping slow (instance {{ $labels.instance }})
    description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.27.Prometheus large scrape

Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes)

- alert: PrometheusLargeScrape
  expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus large scrape (instance {{ $labels.instance }})
    description: "Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.28.Prometheus target scrape duplicate

Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples)

- alert: PrometheusTargetScrapeDuplicate
  expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 3
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
    description: "Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.29.Prometheus timeseries cardinality

The "{{ $labels.name }}" timeseries cardinality is getting very high: {{ $value }}

  # 10000 series threshold is a rough default; cardinality scales with how many distinct label combinations your monitored infrastructure produces (dashboards, sharding, high-cardinality labels like user IDs) — adjust based on your metric design, not server size.
- alert: PrometheusTimeseriesCardinality
  expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus timeseries cardinality (instance {{ $labels.instance }})
    description: "The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.30.Prometheus error sending alerts to AlertManager

Prometheus is failing to send {{ $value | humanize }}% of alerts to a specific Alertmanager instance, meaning some alerts may not be delivered.

- alert: PrometheusErrorSendingAlertsToAlertManager
  expr: (rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])) * 100 > 1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Prometheus error sending alerts to AlertManager (instance {{ $labels.instance }})
    description: "Prometheus is failing to send {{ $value | humanize }}% of alerts to a specific Alertmanager instance, meaning some alerts may not be delivered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.31.Prometheus service discovery refresh failure

Prometheus failed to refresh service discovery targets using mechanism {{ $labels.mechanism }}, meaning target lists may become stale.

- alert: PrometheusServiceDiscoveryRefreshFailure
  expr: increase(prometheus_sd_refresh_failures_total[10m]) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus service discovery refresh failure (instance {{ $labels.instance }})
    description: "Prometheus failed to refresh service discovery targets using mechanism {{ $labels.mechanism }}, meaning target lists may become stale.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.32.Prometheus rule evaluations missed

Prometheus missed {{ $value }} rule group evaluations in the last 5 minutes, usually caused by rule groups taking longer to evaluate than their configured interval.

- alert: PrometheusRuleEvaluationsMissed
  expr: increase(prometheus_rule_group_iterations_missed_total[5m]) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus rule evaluations missed (instance {{ $labels.instance }})
    description: "Prometheus missed {{ $value }} rule group evaluations in the last 5 minutes, usually caused by rule groups taking longer to evaluate than their configured interval.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.33.Prometheus high query load

Prometheus query engine has less than 20% available capacity for concurrent queries, meaning it is approaching its configured concurrency limit and may start rejecting queries.

- alert: PrometheusHighQueryLoad
  expr: prometheus_engine_queries / prometheus_engine_queries_concurrent_max > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Prometheus high query load (instance {{ $labels.instance }})
    description: "Prometheus query engine has less than 20% available capacity for concurrent queries, meaning it is approaching its configured concurrency limit and may start rejecting queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.34.Prometheus configuration reload failure

Prometheus configuration reload error

- alert: PrometheusConfigurationReloadFailure
  expr: prometheus_config_last_reload_successful != 1
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
    description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.35.Prometheus AlertManager configuration reload failure

AlertManager configuration reload error

- alert: PrometheusAlertManagerConfigurationReloadFailure
  expr: alertmanager_config_last_reload_successful != 1
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
    description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

1.1.36.Prometheus AlertManager config not synced

Configurations of AlertManager cluster instances are out of sync

  # for: 20m tolerates the time a rolling config change takes to propagate across the
  # Alertmanager cluster, avoiding false positives mid-deploy.
- alert: PrometheusAlertManagerConfigNotSynced
  expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
  for: 20m
  labels:
    severity: warning
  annotations:
    summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
    description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Basic resource monitoring

Host and hardware S.M.A.R.T Device Monitoring IPMI Docker containers Blackbox Windows Server VMware Proxmox VE Netdata eBPF Process Exporter Systemd