⚠️ Caution ⚠️
Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.
Building an efficient and battle-tested monitoring platform takes time. 😉
-
# 1.1. Prometheus self-monitoring (28 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/prometheus-self-monitoring/embedded-exporter.yml-
# 1.1.1. Prometheus job missing
A Prometheus job has disappeared [copy] - alert: PrometheusJobMissing expr: absent(up{job="prometheus"}) for: 0m labels: severity: warning annotations: summary: Prometheus job missing (instance {{ $labels.instance }}) description: "A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.2. Prometheus target missing
A Prometheus target has disappeared. An exporter might be crashed. [copy] # Only fire if at least one target in the job is still up. # If all targets are down, PrometheusJobMissing or PrometheusAllTargetsMissing will fire instead. - alert: PrometheusTargetMissing expr: up == 0 unless on(job) (sum by (job) (up) == 0) for: 0m labels: severity: critical annotations: summary: Prometheus target missing (instance {{ $labels.instance }}) description: "A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.3. Prometheus all targets missing
A Prometheus job does not have living target anymore. [copy] - alert: PrometheusAllTargetsMissing expr: sum by (job) (up) == 0 for: 0m labels: severity: critical annotations: summary: Prometheus all targets missing (instance {{ $labels.instance }}) description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.4. Prometheus target missing with warmup time
Allow a job time to start up (10 minutes) before alerting that it's down. [copy] - alert: PrometheusTargetMissingWithWarmupTime expr: sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600)) for: 0m labels: severity: critical annotations: summary: Prometheus target missing with warmup time (instance {{ $labels.instance }}) description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.5. Prometheus configuration reload failure
Prometheus configuration reload error [copy] - alert: PrometheusConfigurationReloadFailure expr: prometheus_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Prometheus configuration reload failure (instance {{ $labels.instance }}) description: "Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.6. Prometheus too many restarts
Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping. [copy] - alert: PrometheusTooManyRestarts expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2 for: 0m labels: severity: warning annotations: summary: Prometheus too many restarts (instance {{ $labels.instance }}) description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.7. Prometheus AlertManager job missing
A Prometheus AlertManager job has disappeared [copy] - alert: PrometheusAlertmanagerJobMissing expr: absent(up{job="alertmanager"}) for: 0m labels: severity: warning annotations: summary: Prometheus AlertManager job missing (instance {{ $labels.instance }}) description: "A Prometheus AlertManager job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.8. Prometheus AlertManager configuration reload failure
AlertManager configuration reload error [copy] - alert: PrometheusAlertmanagerConfigurationReloadFailure expr: alertmanager_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }}) description: "AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.9. Prometheus AlertManager config not synced
Configurations of AlertManager cluster instances are out of sync [copy] - alert: PrometheusAlertmanagerConfigNotSynced expr: count(count_values("config_hash", alertmanager_config_hash)) > 1 for: 0m labels: severity: warning annotations: summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }}) description: "Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.10. Prometheus AlertManager E2E dead man switch
Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager. [copy] - alert: PrometheusAlertmanagerE2eDeadManSwitch expr: vector(1) for: 0m labels: severity: critical annotations: summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }}) description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.11. Prometheus not connected to alertmanager
Prometheus cannot connect the alertmanager [copy] - alert: PrometheusNotConnectedToAlertmanager expr: prometheus_notifications_alertmanagers_discovered < 1 for: 0m labels: severity: critical annotations: summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }}) description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.12. Prometheus rule evaluation failures
Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts. [copy] - alert: PrometheusRuleEvaluationFailures expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus rule evaluation failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.13. Prometheus template text expansion failures
Prometheus encountered {{ $value }} template text expansion failures [copy] - alert: PrometheusTemplateTextExpansionFailures expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus template text expansion failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.14. Prometheus rule evaluation slow
Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query. [copy] - alert: PrometheusRuleEvaluationSlow expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds for: 5m labels: severity: warning annotations: summary: Prometheus rule evaluation slow (instance {{ $labels.instance }}) description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.15. Prometheus notifications backlog
The Prometheus notification queue has not been empty for 10 minutes [copy] - alert: PrometheusNotificationsBacklog expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0 for: 0m labels: severity: warning annotations: summary: Prometheus notifications backlog (instance {{ $labels.instance }}) description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.16. Prometheus AlertManager notification failing
Alertmanager is failing sending notifications [copy] - alert: PrometheusAlertmanagerNotificationFailing expr: rate(alertmanager_notifications_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }}) description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.17. Prometheus target empty
Prometheus has no target in service discovery [copy] - alert: PrometheusTargetEmpty expr: prometheus_sd_discovered_targets == 0 for: 0m labels: severity: critical annotations: summary: Prometheus target empty (instance {{ $labels.instance }}) description: "Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.18. Prometheus target scraping slow
Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned. [copy] - alert: PrometheusTargetScrapingSlow expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05 for: 5m labels: severity: warning annotations: summary: Prometheus target scraping slow (instance {{ $labels.instance }}) description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.19. Prometheus large scrape
Prometheus has many scrapes that exceed the sample limit [copy] - alert: PrometheusLargeScrape expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10 for: 5m labels: severity: warning annotations: summary: Prometheus large scrape (instance {{ $labels.instance }}) description: "Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.20. Prometheus target scrape duplicate
Prometheus has many samples rejected due to duplicate timestamps but different values [copy] - alert: PrometheusTargetScrapeDuplicate expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Prometheus target scrape duplicate (instance {{ $labels.instance }}) description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.21. Prometheus TSDB checkpoint creation failures
Prometheus encountered {{ $value }} checkpoint creation failures [copy] - alert: PrometheusTsdbCheckpointCreationFailures expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.22. Prometheus TSDB checkpoint deletion failures
Prometheus encountered {{ $value }} checkpoint deletion failures [copy] - alert: PrometheusTsdbCheckpointDeletionFailures expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.23. Prometheus TSDB compactions failed
Prometheus encountered {{ $value }} TSDB compactions failures [copy] - alert: PrometheusTsdbCompactionsFailed expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.24. Prometheus TSDB head truncations failed
Prometheus encountered {{ $value }} TSDB head truncation failures [copy] - alert: PrometheusTsdbHeadTruncationsFailed expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.25. Prometheus TSDB reload failures
Prometheus encountered {{ $value }} TSDB reload failures [copy] - alert: PrometheusTsdbReloadFailures expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB reload failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.26. Prometheus TSDB WAL corruptions
Prometheus encountered {{ $value }} TSDB WAL corruptions [copy] - alert: PrometheusTsdbWalCorruptions expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.27. Prometheus TSDB WAL truncations failed
Prometheus encountered {{ $value }} TSDB WAL truncation failures [copy] - alert: PrometheusTsdbWalTruncationsFailed expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.1.28. Prometheus timeseries cardinality
The "{{ $labels.name }}" timeseries cardinality is getting very high: {{ $value }} [copy] - alert: PrometheusTimeseriesCardinality expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000 for: 0m labels: severity: warning annotations: summary: Prometheus timeseries cardinality (instance {{ $labels.instance }}) description: "The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.2. Host and hardware : node-exporter (35 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/host-and-hardware/node-exporter.yml-
# 1.2.1. Host out of memory
Node memory is filling up (< 10% left) [copy] - alert: HostOutOfMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < .10) for: 2m labels: severity: warning annotations: summary: Host out of memory (instance {{ $labels.instance }}) description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.2. Host memory under memory pressure
The node is under heavy memory pressure. High rate of loading memory pages from disk. [copy] - alert: HostMemoryUnderMemoryPressure expr: (rate(node_vmstat_pgmajfault[5m]) > 1000) for: 0m labels: severity: warning annotations: summary: Host memory under memory pressure (instance {{ $labels.instance }}) description: "The node is under heavy memory pressure. High rate of loading memory pages from disk.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.3. Host Memory is underutilized
Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }}) [copy] # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly - alert: HostMemoryIsUnderutilized expr: min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8 for: 0m labels: severity: info annotations: summary: Host Memory is underutilized (instance {{ $labels.instance }}) description: "Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.4. Host unusual network throughput in
Host receive bandwidth is high (>80%). [copy] - alert: HostUnusualNetworkThroughputIn expr: ((rate(node_network_receive_bytes_total[5m]) / node_network_speed_bytes) > .80) for: 0m labels: severity: warning annotations: summary: Host unusual network throughput in (instance {{ $labels.instance }}) description: "Host receive bandwidth is high (>80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.5. Host unusual network throughput out
Host transmit bandwidth is high (>80%) [copy] - alert: HostUnusualNetworkThroughputOut expr: ((rate(node_network_transmit_bytes_total[5m]) / node_network_speed_bytes) > .80) for: 0m labels: severity: warning annotations: summary: Host unusual network throughput out (instance {{ $labels.instance }}) description: "Host transmit bandwidth is high (>80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.6. Host disk IO utilization high
Disk utilization is high (> 80%) [copy] - alert: HostDiskIoUtilizationHigh expr: (rate(node_disk_io_time_seconds_total[5m]) > .80) for: 0m labels: severity: warning annotations: summary: Host disk IO utilization high (instance {{ $labels.instance }}) description: "Disk utilization is high (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.7. Host out of disk space
Disk is almost full (< 10% left) [copy] # Please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. - alert: HostOutOfDiskSpace expr: (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"} / node_filesystem_size_bytes < .10 and on (instance, device, mountpoint) node_filesystem_readonly == 0) for: 2m labels: severity: critical annotations: summary: Host out of disk space (instance {{ $labels.instance }}) description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.8. Host disk may fill in 24 hours
Filesystem will likely run out of space within the next 24 hours. [copy] # Please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. - alert: HostDiskMayFillIn24Hours expr: predict_linear(node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"}[3h], 86400) <= 0 and node_filesystem_avail_bytes > 0 for: 2m labels: severity: warning annotations: summary: Host disk may fill in 24 hours (instance {{ $labels.instance }}) description: "Filesystem will likely run out of space within the next 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.9. Host out of inodes
Disk is almost running out of available inodes (< 10% left) [copy] - alert: HostOutOfInodes expr: (node_filesystem_files_free / node_filesystem_files < .10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) for: 2m labels: severity: critical annotations: summary: Host out of inodes (instance {{ $labels.instance }}) description: "Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.10. Host filesystem device error
Error stat-ing the {{ $labels.mountpoint }} filesystem [copy] - alert: HostFilesystemDeviceError expr: node_filesystem_device_error{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"} == 1 for: 2m labels: severity: critical annotations: summary: Host filesystem device error (instance {{ $labels.instance }}) description: "Error stat-ing the {{ $labels.mountpoint }} filesystem\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.11. Host inodes may fill in 24 hours
Filesystem will likely run out of inodes within the next 24 hours at current write rate [copy] - alert: HostInodesMayFillIn24Hours expr: predict_linear(node_filesystem_files_free{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"}[1h], 86400) <= 0 and node_filesystem_files_free > 0 for: 2m labels: severity: warning annotations: summary: Host inodes may fill in 24 hours (instance {{ $labels.instance }}) description: "Filesystem will likely run out of inodes within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.12. Host unusual disk read latency
Disk latency is growing (read operations > 100ms) [copy] - alert: HostUnusualDiskReadLatency expr: (rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) for: 2m labels: severity: warning annotations: summary: Host unusual disk read latency (instance {{ $labels.instance }}) description: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.13. Host unusual disk write latency
Disk latency is growing (write operations > 100ms) [copy] - alert: HostUnusualDiskWriteLatency expr: (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) for: 2m labels: severity: warning annotations: summary: Host unusual disk write latency (instance {{ $labels.instance }}) description: "Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.14. Host high CPU load
CPU load is > 80% [copy] - alert: HostHighCpuLoad expr: 1 - (avg without (cpu) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > .80 for: 10m labels: severity: warning annotations: summary: Host high CPU load (instance {{ $labels.instance }}) description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.15. Host CPU is underutilized
CPU load has been < 20% for 1 week. Consider reducing the number of CPUs. [copy] # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly - alert: HostCpuIsUnderutilized expr: (min without (cpu) (rate(node_cpu_seconds_total{mode="idle"}[1h]))) > 0.8 for: 1w labels: severity: info annotations: summary: Host CPU is underutilized (instance {{ $labels.instance }}) description: "CPU load has been < 20% for 1 week. Consider reducing the number of CPUs.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.16. Host CPU steal noisy neighbor
CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. [copy] - alert: HostCpuStealNoisyNeighbor expr: avg without (cpu) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10 for: 0m labels: severity: warning annotations: summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }}) description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.17. Host CPU high iowait
CPU iowait > 10%. Your CPU is idling waiting for storage to respond. [copy] - alert: HostCpuHighIowait expr: avg without (cpu) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) > .10 for: 0m labels: severity: warning annotations: summary: Host CPU high iowait (instance {{ $labels.instance }}) description: "CPU iowait > 10%. Your CPU is idling waiting for storage to respond.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.18. Host unusual disk IO
Disk usage >80%. Check storage for issues or increase IOPS capabilities. Check storage for issues. [copy] - alert: HostUnusualDiskIo expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8 for: 5m labels: severity: warning annotations: summary: Host unusual disk IO (instance {{ $labels.instance }}) description: "Disk usage >80%. Check storage for issues or increase IOPS capabilities. Check storage for issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.19. Host context switching high
Context switching is growing on the node (twice the daily average during the last 15m) [copy] # x2 context switches is an arbitrary number. # The alert threshold depends on the nature of the application. # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 - alert: HostContextSwitchingHigh expr: (rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) > 2 for: 0m labels: severity: warning annotations: summary: Host context switching high (instance {{ $labels.instance }}) description: "Context switching is growing on the node (twice the daily average during the last 15m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.20. Host swap is filling up
Swap is filling up (>80%) [copy] - alert: HostSwapIsFillingUp expr: ((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) for: 2m labels: severity: warning annotations: summary: Host swap is filling up (instance {{ $labels.instance }}) description: "Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.21. Host systemd service crashed
systemd service {{ $labels.name }} crashed [copy] - alert: HostSystemdServiceCrashed expr: (node_systemd_unit_state{state="failed"} == 1) for: 0m labels: severity: warning annotations: summary: Host systemd service crashed (instance {{ $labels.instance }}) description: "systemd service {{ $labels.name }} crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.22. Host physical component too hot
Physical hardware component too hot [copy] - alert: HostPhysicalComponentTooHot expr: node_hwmon_temp_celsius > node_hwmon_temp_max_celsius for: 5m labels: severity: warning annotations: summary: Host physical component too hot (instance {{ $labels.instance }}) description: "Physical hardware component too hot\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.23. Host node overtemperature alarm
Physical node temperature alarm triggered [copy] - alert: HostNodeOvertemperatureAlarm expr: ((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1)) for: 0m labels: severity: critical annotations: summary: Host node overtemperature alarm (instance {{ $labels.instance }}) description: "Physical node temperature alarm triggered\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.24. Host software RAID insufficient drives
MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining. [copy] - alert: HostSoftwareRaidInsufficientDrives expr: ((node_md_disks_required - on(device, instance) node_md_disks{state="active"}) > 0) for: 0m labels: severity: critical annotations: summary: Host software RAID insufficient drives (instance {{ $labels.instance }}) description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.25. Host software RAID disk failure
MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention. [copy] - alert: HostSoftwareRaidDiskFailure expr: (node_md_disks{state="failed"} > 0) for: 2m labels: severity: warning annotations: summary: Host software RAID disk failure (instance {{ $labels.instance }}) description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.26. Host kernel version deviations
Kernel version for {{ $labels.instance }} has changed. [copy] - alert: HostKernelVersionDeviations expr: changes(node_uname_info[1h]) > 0 for: 0m labels: severity: info annotations: summary: Host kernel version deviations (instance {{ $labels.instance }}) description: "Kernel version for {{ $labels.instance }} has changed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.27. Host OOM kill detected
OOM kill detected [copy] # When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger. - alert: HostOomKillDetected expr: (increase(node_vmstat_oom_kill[30m]) > 0) for: 0m labels: severity: warning annotations: summary: Host OOM kill detected (instance {{ $labels.instance }}) description: "OOM kill detected\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.28. Host EDAC Correctable Errors detected
Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes. [copy] - alert: HostEdacCorrectableErrorsDetected expr: (increase(node_edac_correctable_errors_total[1m]) > 0) for: 0m labels: severity: info annotations: summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.29. Host EDAC Uncorrectable Errors detected
Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes. [copy] - alert: HostEdacUncorrectableErrorsDetected expr: (node_edac_uncorrectable_errors_total > 0) for: 0m labels: severity: warning annotations: summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.30. Host Network Receive Errors
Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes. [copy] - alert: HostNetworkReceiveErrors expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) for: 2m labels: severity: warning annotations: summary: Host Network Receive Errors (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.31. Host Network Transmit Errors
Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes. [copy] - alert: HostNetworkTransmitErrors expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) for: 2m labels: severity: warning annotations: summary: Host Network Transmit Errors (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.32. Host Network Bond Degraded
Bond "{{ $labels.device }}" degraded on "{{ $labels.instance }}". [copy] - alert: HostNetworkBondDegraded expr: ((node_bonding_active - node_bonding_slaves) != 0) for: 2m labels: severity: warning annotations: summary: Host Network Bond Degraded (instance {{ $labels.instance }}) description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.33. Host conntrack limit
The number of conntrack is approaching limit [copy] - alert: HostConntrackLimit expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) for: 5m labels: severity: warning annotations: summary: Host conntrack limit (instance {{ $labels.instance }}) description: "The number of conntrack is approaching limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.34. Host clock skew
Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host. [copy] - alert: HostClockSkew expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) for: 10m labels: severity: warning annotations: summary: Host clock skew (instance {{ $labels.instance }}) description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.2.35. Host clock not synchronising
Clock not synchronising. Ensure NTP is configured on this host. [copy] - alert: HostClockNotSynchronising expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) for: 2m labels: severity: warning annotations: summary: Host clock not synchronising (instance {{ $labels.instance }}) description: "Clock not synchronising. Ensure NTP is configured on this host.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.3. S.M.A.R.T Device Monitoring : smartctl-exporter (8 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/s.m.a.r.t-device-monitoring/smartctl-exporter.yml-
# 1.3.1. SMART device temperature warning
Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C [copy] - alert: SmartDeviceTemperatureWarning expr: (avg_over_time(smartctl_device_temperature{temperature_type="current"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type="drive_trip"}) > 60 for: 0m labels: severity: warning annotations: summary: SMART device temperature warning (instance {{ $labels.instance }}) description: "Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.2. SMART device temperature critical
Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C [copy] - alert: SmartDeviceTemperatureCritical expr: (max_over_time(smartctl_device_temperature{temperature_type="current"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type="drive_trip"}) > 70 for: 0m labels: severity: critical annotations: summary: SMART device temperature critical (instance {{ $labels.instance }}) description: "Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.3. SMART device temperature over trip value
Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }}) [copy] - alert: SmartDeviceTemperatureOverTripValue expr: max_over_time(smartctl_device_temperature{temperature_type="current"} [10m]) >= on(device, instance) smartctl_device_temperature{temperature_type="drive_trip"} for: 0m labels: severity: critical annotations: summary: SMART device temperature over trip value (instance {{ $labels.instance }}) description: "Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.4. SMART device temperature nearing trip value
Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }}) [copy] - alert: SmartDeviceTemperatureNearingTripValue expr: max_over_time(smartctl_device_temperature{temperature_type="current"} [10m]) >= on(device, instance) (smartctl_device_temperature{temperature_type="drive_trip"} * .80) for: 0m labels: severity: warning annotations: summary: SMART device temperature nearing trip value (instance {{ $labels.instance }}) description: "Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.5. SMART status
Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }}) [copy] - alert: SmartStatus expr: smartctl_device_smart_status != 1 for: 0m labels: severity: critical annotations: summary: SMART status (instance {{ $labels.instance }}) description: "Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.6. SMART critical warning
Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }}) [copy] - alert: SmartCriticalWarning expr: smartctl_device_critical_warning > 0 for: 0m labels: severity: critical annotations: summary: SMART critical warning (instance {{ $labels.instance }}) description: "Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.7. SMART media errors
Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }}) [copy] - alert: SmartMediaErrors expr: smartctl_device_media_errors > 0 for: 0m labels: severity: critical annotations: summary: SMART media errors (instance {{ $labels.instance }}) description: "Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.3.8. SMART Wearout Indicator
Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }}) [copy] - alert: SmartWearoutIndicator expr: smartctl_device_available_spare < smartctl_device_available_spare_threshold for: 0m labels: severity: critical annotations: summary: SMART Wearout Indicator (instance {{ $labels.instance }}) description: "Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.4. IPMI : prometheus-community/ipmi_exporter (17 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ipmi/ipmi-exporter.yml-
# 1.4.1. IPMI collector down
IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity. [copy] # The ipmi_up metric is per-collector. A value of 0 means the collector could not retrieve data from the BMC. - alert: IpmiCollectorDown expr: ipmi_up == 0 for: 5m labels: severity: warning annotations: summary: IPMI collector down (instance {{ $labels.instance }}) description: "IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.2. IPMI temperature sensor warning
IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state. [copy] # State values: 0=nominal, 1=warning, 2=critical. Thresholds are defined in the BMC firmware. - alert: IpmiTemperatureSensorWarning expr: ipmi_temperature_state == 1 for: 5m labels: severity: warning annotations: summary: IPMI temperature sensor warning (instance {{ $labels.instance }}) description: "IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.3. IPMI temperature sensor critical
IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage. [copy] - alert: IpmiTemperatureSensorCritical expr: ipmi_temperature_state == 2 for: 0m labels: severity: critical annotations: summary: IPMI temperature sensor critical (instance {{ $labels.instance }}) description: "IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.4. IPMI fan speed sensor warning
IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state. [copy] - alert: IpmiFanSpeedSensorWarning expr: ipmi_fan_speed_state == 1 for: 5m labels: severity: warning annotations: summary: IPMI fan speed sensor warning (instance {{ $labels.instance }}) description: "IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.5. IPMI fan speed sensor critical
IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed. [copy] - alert: IpmiFanSpeedSensorCritical expr: ipmi_fan_speed_state == 2 for: 0m labels: severity: critical annotations: summary: IPMI fan speed sensor critical (instance {{ $labels.instance }}) description: "IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.6. IPMI fan speed zero
IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed. [copy] - alert: IpmiFanSpeedZero expr: ipmi_fan_speed_rpm == 0 for: 5m labels: severity: critical annotations: summary: IPMI fan speed zero (instance {{ $labels.instance }}) description: "IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.7. IPMI voltage sensor warning
IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state. [copy] - alert: IpmiVoltageSensorWarning expr: ipmi_voltage_state == 1 for: 5m labels: severity: warning annotations: summary: IPMI voltage sensor warning (instance {{ $labels.instance }}) description: "IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.8. IPMI voltage sensor critical
IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible. [copy] - alert: IpmiVoltageSensorCritical expr: ipmi_voltage_state == 2 for: 0m labels: severity: critical annotations: summary: IPMI voltage sensor critical (instance {{ $labels.instance }}) description: "IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.9. IPMI current sensor warning
IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state. [copy] - alert: IpmiCurrentSensorWarning expr: ipmi_current_state == 1 for: 5m labels: severity: warning annotations: summary: IPMI current sensor warning (instance {{ $labels.instance }}) description: "IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.10. IPMI current sensor critical
IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. [copy] - alert: IpmiCurrentSensorCritical expr: ipmi_current_state == 2 for: 0m labels: severity: critical annotations: summary: IPMI current sensor critical (instance {{ $labels.instance }}) description: "IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.11. IPMI power sensor warning
IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state. [copy] - alert: IpmiPowerSensorWarning expr: ipmi_power_state == 1 for: 5m labels: severity: warning annotations: summary: IPMI power sensor warning (instance {{ $labels.instance }}) description: "IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.12. IPMI power sensor critical
IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. [copy] - alert: IpmiPowerSensorCritical expr: ipmi_power_state == 2 for: 0m labels: severity: critical annotations: summary: IPMI power sensor critical (instance {{ $labels.instance }}) description: "IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.13. IPMI generic sensor critical
IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state. [copy] # Catches any sensor type not covered by the specific temperature/fan/voltage/current/power alerts. - alert: IpmiGenericSensorCritical expr: ipmi_sensor_state == 2 for: 0m labels: severity: critical annotations: summary: IPMI generic sensor critical (instance {{ $labels.instance }}) description: "IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.14. IPMI chassis power off
IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly. [copy] - alert: IpmiChassisPowerOff expr: ipmi_chassis_power_state == 0 for: 0m labels: severity: critical annotations: summary: IPMI chassis power off (instance {{ $labels.instance }}) description: "IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.15. IPMI chassis drive fault
IPMI reports a drive fault on {{ $labels.instance }}. Check disk health. [copy] # The metric uses inverted logic: 1=no fault, 0=fault detected. - alert: IpmiChassisDriveFault expr: ipmi_chassis_drive_fault_state == 0 for: 0m labels: severity: critical annotations: summary: IPMI chassis drive fault (instance {{ $labels.instance }}) description: "IPMI reports a drive fault on {{ $labels.instance }}. Check disk health.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.16. IPMI chassis cooling fault
IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow. [copy] # The metric uses inverted logic: 1=no fault, 0=fault detected. - alert: IpmiChassisCoolingFault expr: ipmi_chassis_cooling_fault_state == 0 for: 0m labels: severity: critical annotations: summary: IPMI chassis cooling fault (instance {{ $labels.instance }}) description: "IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.4.17. IPMI SEL almost full
IPMI System Event Log on {{ $labels.instance }} has only {{ printf "%.0f" $value }} bytes free. Clear the SEL to prevent loss of new events. [copy] # SEL storage is typically very limited (e.g., 16KB). When full, new events may be dropped. - alert: IpmiSelAlmostFull expr: ipmi_sel_free_space_bytes < 512 for: 5m labels: severity: warning annotations: summary: IPMI SEL almost full (instance {{ $labels.instance }}) description: "IPMI System Event Log on {{ $labels.instance }} has only {{ printf \"%.0f\" $value }} bytes free. Clear the SEL to prevent loss of new events.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.5. Docker containers : google/cAdvisor (9 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/docker-containers/google-cadvisor.yml-
# 1.5.1. Container killed
A container has disappeared [copy] # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment. - alert: ContainerKilled expr: time() - container_last_seen > 60 for: 0m labels: severity: warning annotations: summary: Container killed (instance {{ $labels.instance }}) description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.2. Container absent
A container is absent for 5 min [copy] # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment. - alert: ContainerAbsent expr: absent(container_last_seen) for: 5m labels: severity: warning annotations: summary: Container absent (instance {{ $labels.instance }}) description: "A container is absent for 5 min\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.3. Container High CPU utilization
Container CPU utilization is above 80% [copy] - alert: ContainerHighCpuUtilization expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80 for: 2m labels: severity: warning annotations: summary: Container High CPU utilization (instance {{ $labels.instance }}) description: "Container CPU utilization is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.4. Container High Memory usage
Container Memory usage is above 80% [copy] # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d - alert: ContainerHighMemoryUsage expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80 for: 2m labels: severity: warning annotations: summary: Container High Memory usage (instance {{ $labels.instance }}) description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.5. Container Volume usage
Container Volume usage is above 80% [copy] - alert: ContainerVolumeUsage expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80 for: 2m labels: severity: warning annotations: summary: Container Volume usage (instance {{ $labels.instance }}) description: "Container Volume usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.6. Container high throttle rate
Container is being throttled [copy] - alert: ContainerHighThrottleRate expr: sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 ) for: 5m labels: severity: warning annotations: summary: Container high throttle rate (instance {{ $labels.instance }}) description: "Container is being throttled\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.7. Container high low change CPU usage
This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%. [copy] - alert: ContainerHighLowChangeCpuUsage expr: (abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m] offset 1m)) * 100)) or abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[5m] offset 1m)) * 100))) > 25 for: 0m labels: severity: info annotations: summary: Container high low change CPU usage (instance {{ $labels.instance }}) description: "This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.8. Container Low CPU utilization
Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU. [copy] - alert: ContainerLowCpuUtilization expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) < 20 for: 7d labels: severity: info annotations: summary: Container Low CPU utilization (instance {{ $labels.instance }}) description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.5.9. Container Low Memory usage
Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory. [copy] - alert: ContainerLowMemoryUsage expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20 for: 7d labels: severity: info annotations: summary: Container Low Memory usage (instance {{ $labels.instance }}) description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.6. Blackbox : prometheus/blackbox_exporter (9 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/blackbox/blackbox-exporter.yml-
# 1.6.1. Blackbox probe failed
Probe failed [copy] - alert: BlackboxProbeFailed expr: probe_success == 0 for: 0m labels: severity: critical annotations: summary: Blackbox probe failed (instance {{ $labels.instance }}) description: "Probe failed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.2. Blackbox configuration reload failure
Blackbox configuration reload failure [copy] - alert: BlackboxConfigurationReloadFailure expr: blackbox_exporter_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Blackbox configuration reload failure (instance {{ $labels.instance }}) description: "Blackbox configuration reload failure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.3. Blackbox slow probe
Blackbox probe took more than 1s to complete [copy] - alert: BlackboxSlowProbe expr: avg_over_time(probe_duration_seconds[1m]) > 1 for: 1m labels: severity: warning annotations: summary: Blackbox slow probe (instance {{ $labels.instance }}) description: "Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.4. Blackbox probe HTTP failure
HTTP status code is not 200-399 [copy] - alert: BlackboxProbeHttpFailure expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 0m labels: severity: critical annotations: summary: Blackbox probe HTTP failure (instance {{ $labels.instance }}) description: "HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.5. Blackbox SSL certificate will expire soon
SSL certificate expires in less than 20 days [copy] - alert: BlackboxSslCertificateWillExpireSoon expr: 3 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 20 for: 0m labels: severity: warning annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: "SSL certificate expires in less than 20 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.6. Blackbox SSL certificate will expire very soon
SSL certificate expires in less than 3 days [copy] - alert: BlackboxSslCertificateWillExpireVerySoon expr: 0 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 3 for: 0m labels: severity: critical annotations: summary: Blackbox SSL certificate will expire very soon (instance {{ $labels.instance }}) description: "SSL certificate expires in less than 3 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.7. Blackbox SSL certificate expired
SSL certificate has expired already [copy] # For probe_ssl_earliest_cert_expiry to be exposed after expiration, you # need to enable insecure_skip_verify. Note that this will disable # certificate validation. # See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config - alert: BlackboxSslCertificateExpired expr: round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0 for: 0m labels: severity: critical annotations: summary: Blackbox SSL certificate expired (instance {{ $labels.instance }}) description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.8. Blackbox probe slow HTTP
HTTP request took more than 1s [copy] - alert: BlackboxProbeSlowHttp expr: avg_over_time(probe_http_duration_seconds[1m]) > 1 for: 1m labels: severity: warning annotations: summary: Blackbox probe slow HTTP (instance {{ $labels.instance }}) description: "HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.6.9. Blackbox probe slow ping
Blackbox ping took more than 1s [copy] - alert: BlackboxProbeSlowPing expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1 for: 1m labels: severity: warning annotations: summary: Blackbox probe slow ping (instance {{ $labels.instance }}) description: "Blackbox ping took more than 1s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.7. Windows Server : prometheus-community/windows_exporter (5 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/windows-server/windows-exporter.yml-
# 1.7.1. Windows Server collector Error
Collector {{ $labels.collector }} was not successful [copy] - alert: WindowsServerCollectorError expr: windows_exporter_collector_success == 0 for: 0m labels: severity: critical annotations: summary: Windows Server collector Error (instance {{ $labels.instance }}) description: "Collector {{ $labels.collector }} was not successful\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.7.2. Windows Server service Status
Windows Service state is not OK [copy] - alert: WindowsServerServiceStatus expr: windows_service_status{status="ok"} != 1 for: 1m labels: severity: critical annotations: summary: Windows Server service Status (instance {{ $labels.instance }}) description: "Windows Service state is not OK\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.7.3. Windows Server CPU Usage
CPU Usage is more than 80% [copy] - alert: WindowsServerCpuUsage expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80 for: 0m labels: severity: warning annotations: summary: Windows Server CPU Usage (instance {{ $labels.instance }}) description: "CPU Usage is more than 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.7.4. Windows Server memory Usage
Memory usage is more than 90% [copy] - alert: WindowsServerMemoryUsage expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90 for: 2m labels: severity: warning annotations: summary: Windows Server memory Usage (instance {{ $labels.instance }}) description: "Memory usage is more than 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.7.5. Windows Server disk Space Usage
Disk usage is more than 80% [copy] - alert: WindowsServerDiskSpaceUsage expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80 for: 2m labels: severity: critical annotations: summary: Windows Server disk Space Usage (instance {{ $labels.instance }}) description: "Disk usage is more than 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.8. VMware : pryorda/vmware_exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/vmware/pryorda-vmware-exporter.yml-
# 1.8.1. Virtual Machine Memory Warning
High memory usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: VirtualMachineMemoryWarning expr: vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90 for: 5m labels: severity: warning annotations: summary: Virtual Machine Memory Warning (instance {{ $labels.instance }}) description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.8.2. Virtual Machine Memory Critical
High memory usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: VirtualMachineMemoryCritical expr: vmware_vm_mem_usage_average / 100 >= 90 for: 1m labels: severity: critical annotations: summary: Virtual Machine Memory Critical (instance {{ $labels.instance }}) description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.8.3. High Number of Snapshots
High snapshots number on {{ $labels.instance }}: {{ $value }} [copy] - alert: HighNumberOfSnapshots expr: vmware_vm_snapshots > 3 for: 30m labels: severity: warning annotations: summary: High Number of Snapshots (instance {{ $labels.instance }}) description: "High snapshots number on {{ $labels.instance }}: {{ $value }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.8.4. Outdated Snapshots
Outdated snapshots on {{ $labels.instance }}: {{ $value | printf "%.0f"}} days [copy] - alert: OutdatedSnapshots expr: (time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3 for: 5m labels: severity: warning annotations: summary: Outdated Snapshots (instance {{ $labels.instance }}) description: "Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \"%.0f\"}} days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.9. Proxmox VE : prometheus-pve/prometheus-pve-exporter (9 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/proxmox-ve/prometheus-pve-exporter.yml-
# 1.9.1. PVE node down
Proxmox VE node {{ $labels.id }} is down. [copy] - alert: PveNodeDown expr: pve_up{id=~"node/.*"} == 0 for: 2m labels: severity: critical annotations: summary: PVE node down (instance {{ $labels.instance }}) description: "Proxmox VE node {{ $labels.id }} is down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.2. PVE VM/CT down
Proxmox VE guest {{ $labels.id }} is not running. [copy] # This alert triggers for all VMs and containers that are not running. # You may want to filter by specific guests using the `id` label, or exclude # intentionally stopped guests with additional label matchers. - alert: PveVm/ctDown expr: pve_up{id=~"(qemu|lxc)/.*"} == 0 for: 5m labels: severity: warning annotations: summary: PVE VM/CT down (instance {{ $labels.instance }}) description: "Proxmox VE guest {{ $labels.id }} is not running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.3. PVE high CPU usage
Proxmox VE CPU usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf "%.2f" }}% [copy] - alert: PveHighCpuUsage expr: pve_cpu_usage_ratio * 100 > 90 for: 5m labels: severity: warning annotations: summary: PVE high CPU usage (instance {{ $labels.instance }}) description: "Proxmox VE CPU usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf \"%.2f\" }}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.4. PVE high memory usage
Proxmox VE memory usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf "%.2f" }}% [copy] - alert: PveHighMemoryUsage expr: pve_memory_usage_bytes / pve_memory_size_bytes * 100 > 90 for: 5m labels: severity: warning annotations: summary: PVE high memory usage (instance {{ $labels.instance }}) description: "Proxmox VE memory usage is above 90% on {{ $labels.id }}. Current value: {{ $value | printf \"%.2f\" }}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.5. PVE storage filling up
Proxmox VE storage {{ $labels.id }} is above 80% used. Current value: {{ $value | printf "%.2f" }}% [copy] - alert: PveStorageFillingUp expr: pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"} * 100 > 80 and pve_disk_size_bytes{id=~"storage/.*"} > 0 for: 5m labels: severity: warning annotations: summary: PVE storage filling up (instance {{ $labels.instance }}) description: "Proxmox VE storage {{ $labels.id }} is above 80% used. Current value: {{ $value | printf \"%.2f\" }}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.6. PVE storage almost full
Proxmox VE storage {{ $labels.id }} is above 95% used. Current value: {{ $value | printf "%.2f" }}% [copy] - alert: PveStorageAlmostFull expr: pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"} * 100 > 95 and pve_disk_size_bytes{id=~"storage/.*"} > 0 for: 2m labels: severity: critical annotations: summary: PVE storage almost full (instance {{ $labels.instance }}) description: "Proxmox VE storage {{ $labels.id }} is above 95% used. Current value: {{ $value | printf \"%.2f\" }}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.7. PVE guest not backed up
{{ $value }} Proxmox VE guest(s) are not covered by any backup job. [copy] - alert: PveGuestNotBackedUp expr: pve_not_backed_up_total > 0 for: 0m labels: severity: warning annotations: summary: PVE guest not backed up (instance {{ $labels.instance }}) description: "{{ $value }} Proxmox VE guest(s) are not covered by any backup job.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.8. PVE replication failed
Proxmox VE replication for {{ $labels.id }} has {{ $value }} failed sync(s). [copy] - alert: PveReplicationFailed expr: pve_replication_failed_syncs > 0 for: 0m labels: severity: warning annotations: summary: PVE replication failed (instance {{ $labels.instance }}) description: "Proxmox VE replication for {{ $labels.id }} has {{ $value }} failed sync(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.9.9. PVE cluster not quorate
Proxmox VE cluster has lost quorum. [copy] # Loss of quorum means the cluster cannot make decisions about VM placement # and fencing. This requires immediate attention. - alert: PveClusterNotQuorate expr: pve_cluster_info{quorate="0"} == 1 for: 0m labels: severity: critical annotations: summary: PVE cluster not quorate (instance {{ $labels.instance }}) description: "Proxmox VE cluster has lost quorum.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.10. Netdata : Embedded exporter (9 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/netdata/embedded-exporter.yml-
# 1.10.1. Netdata high cpu usage
Netdata high CPU usage (> 80%) [copy] - alert: NetdataHighCpuUsage expr: rate(netdata_cpu_cpu_percentage_average{dimension="idle"}[1m]) > 80 for: 5m labels: severity: warning annotations: summary: Netdata high cpu usage (instance {{ $labels.instance }}) description: "Netdata high CPU usage (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.2. Host CPU steal noisy neighbor
CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. [copy] - alert: HostCpuStealNoisyNeighbor expr: rate(netdata_cpu_cpu_percentage_average{dimension="steal"}[1m]) > 10 for: 5m labels: severity: warning annotations: summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }}) description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.3. Netdata high memory usage
Netdata high memory usage (> 80%) [copy] - alert: NetdataHighMemoryUsage expr: 100 / netdata_system_ram_MiB_average * netdata_system_ram_MiB_average{dimension=~"free|cached"} < 20 for: 5m labels: severity: warning annotations: summary: Netdata high memory usage (instance {{ $labels.instance }}) description: "Netdata high memory usage (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.4. Netdata low disk space
Netdata low disk space (> 80%) [copy] - alert: NetdataLowDiskSpace expr: 100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20 for: 5m labels: severity: warning annotations: summary: Netdata low disk space (instance {{ $labels.instance }}) description: "Netdata low disk space (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.5. Netdata predicted disk full
Netdata predicted disk full in 24 hours [copy] - alert: NetdataPredictedDiskFull expr: predict_linear(netdata_disk_space_GB_average{dimension=~"avail|cached"}[3h], 24 * 3600) < 0 for: 0m labels: severity: warning annotations: summary: Netdata predicted disk full (instance {{ $labels.instance }}) description: "Netdata predicted disk full in 24 hours\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.6. Netdata MD mismatch cnt unsynchronized blocks
RAID Array have unsynchronized blocks [copy] - alert: NetdataMdMismatchCntUnsynchronizedBlocks expr: netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024 for: 2m labels: severity: warning annotations: summary: Netdata MD mismatch cnt unsynchronized blocks (instance {{ $labels.instance }}) description: "RAID Array have unsynchronized blocks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.7. Netdata disk reallocated sectors
Reallocated sectors on disk [copy] - alert: NetdataDiskReallocatedSectors expr: increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0 for: 0m labels: severity: info annotations: summary: Netdata disk reallocated sectors (instance {{ $labels.instance }}) description: "Reallocated sectors on disk\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.8. Netdata disk current pending sector
Disk current pending sector [copy] - alert: NetdataDiskCurrentPendingSector expr: netdata_smartd_log_current_pending_sector_count_sectors_average > 0 for: 0m labels: severity: warning annotations: summary: Netdata disk current pending sector (instance {{ $labels.instance }}) description: "Disk current pending sector\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.10.9. Netdata reported uncorrectable disk sectors
Reported uncorrectable disk sectors [copy] - alert: NetdataReportedUncorrectableDiskSectors expr: increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0 for: 0m labels: severity: warning annotations: summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }}) description: "Reported uncorrectable disk sectors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.11. eBPF : cloudflare/ebpf_exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ebpf/ebpf-exporter.yml-
# 1.11.1. eBPF exporter program not attached
eBPF program {{ $labels.name }} failed to attach. The program is not collecting data. (instance {{ $labels.instance }}) [copy] # The exporter uses loose attachment: if a program fails to load (missing BTF, kernel incompatibility), it sets this metric to 0 and continues running. - alert: EbpfExporterProgramNotAttached expr: ebpf_exporter_ebpf_program_attached == 0 for: 5m labels: severity: warning annotations: summary: eBPF exporter program not attached (instance {{ $labels.instance }}) description: "eBPF program {{ $labels.name }} failed to attach. The program is not collecting data. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.11.2. eBPF exporter decoder errors
eBPF exporter is experiencing decoder errors for program {{ $labels.name }}. Kernel data is not being correctly transformed into labels. (instance {{ $labels.instance }}) [copy] - alert: EbpfExporterDecoderErrors expr: rate(ebpf_exporter_decoder_errors_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: eBPF exporter decoder errors (instance {{ $labels.instance }}) description: "eBPF exporter is experiencing decoder errors for program {{ $labels.name }}. Kernel data is not being correctly transformed into labels. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.11.3. eBPF exporter no enabled configs
eBPF exporter has no enabled configurations. No eBPF programs are being run. (instance {{ $labels.instance }}) [copy] - alert: EbpfExporterNoEnabledConfigs expr: ebpf_exporter_enabled_configs == 0 for: 5m labels: severity: warning annotations: summary: eBPF exporter no enabled configs (instance {{ $labels.instance }}) description: "eBPF exporter has no enabled configurations. No eBPF programs are being run. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.12. Process Exporter : ncabatoff/process-exporter (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/process-exporter/process-exporter.yml-
# 1.12.1. Process exporter group down
No processes found for group {{ $labels.groupname }}. The service may have stopped. (instance {{ $labels.instance }}) [copy] - alert: ProcessExporterGroupDown expr: namedprocess_namegroup_num_procs == 0 for: 2m labels: severity: critical annotations: summary: Process exporter group down (instance {{ $labels.instance }}) description: "No processes found for group {{ $labels.groupname }}. The service may have stopped. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.2. Process exporter high memory usage
Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of resident memory. (instance {{ $labels.instance }}) [copy] # Threshold of 4GB is arbitrary and depends on the process being monitored. Adjust per group. - alert: ProcessExporterHighMemoryUsage expr: namedprocess_namegroup_memory_bytes{memtype="resident"} > 4e+09 for: 5m labels: severity: warning annotations: summary: Process exporter high memory usage (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of resident memory. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.3. Process exporter high CPU usage
Process group {{ $labels.groupname }} is using {{ $value }}% CPU (core-equivalent). (instance {{ $labels.instance }}) [copy] # Value is core-equivalent %: 100% = 1 full core, 200% = 2 cores, etc. Threshold of 80% is per-core. Adjust based on expected workload. - alert: ProcessExporterHighCpuUsage expr: rate(namedprocess_namegroup_cpu_seconds_total[5m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: Process exporter high CPU usage (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} is using {{ $value }}% CPU (core-equivalent). (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.4. Process exporter high file descriptor usage
Process group {{ $labels.groupname }} is using more than 80% of its file descriptor limit. (instance {{ $labels.instance }}) [copy] - alert: ProcessExporterHighFileDescriptorUsage expr: namedprocess_namegroup_worst_fd_ratio > 0.8 for: 5m labels: severity: warning annotations: summary: Process exporter high file descriptor usage (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} is using more than 80% of its file descriptor limit. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.5. Process exporter file descriptors exhausted
Process group {{ $labels.groupname }} has nearly exhausted its file descriptor limit. (instance {{ $labels.instance }}) [copy] - alert: ProcessExporterFileDescriptorsExhausted expr: namedprocess_namegroup_worst_fd_ratio > 0.95 for: 2m labels: severity: critical annotations: summary: Process exporter file descriptors exhausted (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} has nearly exhausted its file descriptor limit. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.6. Process exporter high swap usage
Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of swap. (instance {{ $labels.instance }}) [copy] # Threshold of 512MB is arbitrary. Adjust per group and environment. - alert: ProcessExporterHighSwapUsage expr: namedprocess_namegroup_memory_bytes{memtype="swapped"} > 512e+06 for: 5m labels: severity: warning annotations: summary: Process exporter high swap usage (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of swap. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.7. Process exporter zombie processes
Process group {{ $labels.groupname }} has {{ $value }} zombie processes. (instance {{ $labels.instance }}) [copy] - alert: ProcessExporterZombieProcesses expr: namedprocess_namegroup_states{state="Zombie"} > 0 for: 5m labels: severity: warning annotations: summary: Process exporter zombie processes (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} has {{ $value }} zombie processes. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.8. Process exporter high context switching
Process group {{ $labels.groupname }} has a high rate of context switches ({{ $value }}/s). (instance {{ $labels.instance }}) [copy] # Threshold of 10000 switches/s is a rough default. Adjust based on the workload profile. - alert: ProcessExporterHighContextSwitching expr: rate(namedprocess_namegroup_context_switches_total[5m]) > 10000 for: 5m labels: severity: warning annotations: summary: Process exporter high context switching (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} has a high rate of context switches ({{ $value }}/s). (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.9. Process exporter high disk write IO
Process group {{ $labels.groupname }} is performing {{ $value | humanize }}B/s of disk writes. (instance {{ $labels.instance }}) [copy] # Threshold of 100MB/s is arbitrary. Adjust per group. - alert: ProcessExporterHighDiskWriteIo expr: rate(namedprocess_namegroup_write_bytes_total[5m]) > 100e+06 for: 5m labels: severity: warning annotations: summary: Process exporter high disk write IO (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} is performing {{ $value | humanize }}B/s of disk writes. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.12.10. Process exporter process restarting
Process group {{ $labels.groupname }} has restarted (oldest process start time changed). (instance {{ $labels.instance }}) [copy] # Detects restarts by watching for changes in the oldest process start time within the group. - alert: ProcessExporterProcessRestarting expr: changes(namedprocess_namegroup_oldest_start_time_seconds[5m]) > 0 and namedprocess_namegroup_num_procs > 0 for: 0m labels: severity: info annotations: summary: Process exporter process restarting (instance {{ $labels.instance }}) description: "Process group {{ $labels.groupname }} has restarted (oldest process start time changed). (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 1.13. Systemd : prometheus-community/systemd_exporter (7 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/systemd/systemd-exporter.yml-
# 1.13.1. Systemd unit failed
Systemd unit {{ $labels.name }} has entered failed state. (instance {{ $labels.instance }}) [copy] - alert: SystemdUnitFailed expr: systemd_unit_state{state="failed"} == 1 for: 5m labels: severity: warning annotations: summary: Systemd unit failed (instance {{ $labels.instance }}) description: "Systemd unit {{ $labels.name }} has entered failed state. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.13.2. Systemd unit inactive
Systemd unit {{ $labels.name }} is inactive. (instance {{ $labels.instance }}) [copy] # Many units are legitimately inactive. You must adjust the name=~ filter to match your critical services. - alert: SystemdUnitInactive expr: systemd_unit_state{state="inactive", type="service", name=~"your-critical-service.+"} == 1 for: 5m labels: severity: warning annotations: summary: Systemd unit inactive (instance {{ $labels.instance }}) description: "Systemd unit {{ $labels.name }} is inactive. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.13.3. Systemd service crash looping
Systemd service {{ $labels.name }} has restarted {{ $value }} times in the last hour. (instance {{ $labels.instance }}) [copy] - alert: SystemdServiceCrashLooping expr: increase(systemd_service_restart_total[1h]) > 5 for: 5m labels: severity: critical annotations: summary: Systemd service crash looping (instance {{ $labels.instance }}) description: "Systemd service {{ $labels.name }} has restarted {{ $value }} times in the last hour. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.13.4. Systemd unit tasks near limit
Systemd unit {{ $labels.name }} is using {{ $value | humanizePercentage }} of its task limit. (instance {{ $labels.instance }}) [copy] - alert: SystemdUnitTasksNearLimit expr: systemd_unit_tasks_current / systemd_unit_tasks_max > 0.9 and systemd_unit_tasks_max > 0 for: 5m labels: severity: warning annotations: summary: Systemd unit tasks near limit (instance {{ $labels.instance }}) description: "Systemd unit {{ $labels.name }} is using {{ $value | humanizePercentage }} of its task limit. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.13.5. Systemd socket refused connections
Systemd socket {{ $labels.name }} is refusing connections. (instance {{ $labels.instance }}) [copy] - alert: SystemdSocketRefusedConnections expr: increase(systemd_socket_refused_connections_total[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Systemd socket refused connections (instance {{ $labels.instance }}) description: "Systemd socket {{ $labels.name }} is refusing connections. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.13.6. Systemd socket high connections
Systemd socket {{ $labels.name }} has {{ $value }} active connections. (instance {{ $labels.instance }}) [copy] # Threshold of 100 connections is arbitrary. Adjust to your workload. - alert: SystemdSocketHighConnections expr: systemd_socket_current_connections > 100 for: 0m labels: severity: warning annotations: summary: Systemd socket high connections (instance {{ $labels.instance }}) description: "Systemd socket {{ $labels.name }} has {{ $value }} active connections. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 1.13.7. Systemd timer missed trigger
Systemd timer {{ $labels.name }} has not triggered for over 24 hours. (instance {{ $labels.instance }}) [copy] # Triggers if timer hasn't fired in 24 hours. Adjust threshold per timer schedule. - alert: SystemdTimerMissedTrigger expr: (time() - systemd_timer_last_trigger_seconds) / 3600 > 24 and systemd_timer_last_trigger_seconds > 0 for: 5m labels: severity: warning annotations: summary: Systemd timer missed trigger (instance {{ $labels.instance }}) description: "Systemd timer {{ $labels.name }} has not triggered for over 24 hours. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.1. MySQL : prometheus/mysqld_exporter (14 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/mysql/mysqld-exporter.yml-
# 2.1.1. MySQL down
MySQL instance is down on {{ $labels.instance }} [copy] # 1m delay allows a restart without triggering an alert. - alert: MysqlDown expr: mysql_up == 0 for: 1m labels: severity: critical annotations: summary: MySQL down (instance {{ $labels.instance }}) description: "MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.2. MySQL too many connections (> 80%)
More than 80% of MySQL connections are in use on {{ $labels.instance }} [copy] - alert: MysqlTooManyConnections(>80%) expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80 for: 2m labels: severity: warning annotations: summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }}) description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.3. MySQL high prepared statements utilization (> 80%)
High utilization of prepared statements (>80%) on {{ $labels.instance }} [copy] - alert: MysqlHighPreparedStatementsUtilization(>80%) expr: max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80 for: 2m labels: severity: warning annotations: summary: MySQL high prepared statements utilization (> 80%) (instance {{ $labels.instance }}) description: "High utilization of prepared statements (>80%) on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.4. MySQL high threads running
More than 60% of MySQL connections are in running state on {{ $labels.instance }} [copy] - alert: MysqlHighThreadsRunning expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60 for: 2m labels: severity: warning annotations: summary: MySQL high threads running (instance {{ $labels.instance }}) description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.5. MySQL Slave IO thread not running
MySQL Slave IO thread not running on {{ $labels.instance }} [copy] # 1m delay allows a restart without triggering an alert. - alert: MysqlSlaveIoThreadNotRunning expr: ( mysql_slave_status_slave_io_running and ON (instance) mysql_slave_status_master_server_id > 0 ) == 0 for: 1m labels: severity: critical annotations: summary: MySQL Slave IO thread not running (instance {{ $labels.instance }}) description: "MySQL Slave IO thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.6. MySQL Slave SQL thread not running
MySQL Slave SQL thread not running on {{ $labels.instance }} [copy] # 1m delay allows a restart without triggering an alert. - alert: MysqlSlaveSqlThreadNotRunning expr: ( mysql_slave_status_slave_sql_running and ON (instance) mysql_slave_status_master_server_id > 0) == 0 for: 1m labels: severity: critical annotations: summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }}) description: "MySQL Slave SQL thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.7. MySQL Slave replication lag
MySQL replication lag on {{ $labels.instance }} [copy] - alert: MysqlSlaveReplicationLag expr: ( (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) and ON (instance) mysql_slave_status_master_server_id > 0 ) > 30 for: 1m labels: severity: critical annotations: summary: MySQL Slave replication lag (instance {{ $labels.instance }}) description: "MySQL replication lag on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.8. MySQL slow queries
MySQL server mysql has some new slow query. [copy] - alert: MysqlSlowQueries expr: increase(mysql_global_status_slow_queries[1m]) > 0 for: 2m labels: severity: warning annotations: summary: MySQL slow queries (instance {{ $labels.instance }}) description: "MySQL server mysql has some new slow query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.9. MySQL InnoDB log waits
MySQL innodb log writes stalling [copy] - alert: MysqlInnodbLogWaits expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10 for: 0m labels: severity: warning annotations: summary: MySQL InnoDB log waits (instance {{ $labels.instance }}) description: "MySQL innodb log writes stalling\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.10. MySQL restarted
MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}. [copy] - alert: MysqlRestarted expr: mysql_global_status_uptime < 60 for: 0m labels: severity: info annotations: summary: MySQL restarted (instance {{ $labels.instance }}) description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.11. MySQL High QPS
MySQL is being overload with unusual QPS (> 10k QPS). [copy] - alert: MysqlHighQps expr: irate(mysql_global_status_questions[1m]) > 10000 for: 2m labels: severity: info annotations: summary: MySQL High QPS (instance {{ $labels.instance }}) description: "MySQL is being overload with unusual QPS (> 10k QPS).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.12. MySQL too many open files
MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}. [copy] - alert: MysqlTooManyOpenFiles expr: mysql_global_status_innodb_num_open_files / mysql_global_variables_open_files_limit * 100 > 75 for: 2m labels: severity: warning annotations: summary: MySQL too many open files (instance {{ $labels.instance }}) description: "MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.13. MySQL InnoDB Force Recovery is enabled
MySQL InnoDB force recovery is enabled on {{ $labels.instance }} [copy] - alert: MysqlInnodbForceRecoveryIsEnabled expr: mysql_global_variables_innodb_force_recovery != 0 for: 2m labels: severity: warning annotations: summary: MySQL InnoDB Force Recovery is enabled (instance {{ $labels.instance }}) description: "MySQL InnoDB force recovery is enabled on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.1.14. MySQL InnoDB history_len too long
MySQL history_len (undo log) too long on {{ $labels.instance }} [copy] - alert: MysqlInnodbHistory_lenTooLong expr: mysql_info_schema_innodb_metrics_transaction_trx_rseg_history_len > 50000 for: 2m labels: severity: warning annotations: summary: MySQL InnoDB history_len too long (instance {{ $labels.instance }}) description: "MySQL history_len (undo log) too long on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.2. PostgreSQL : prometheus-community/postgres_exporter (20 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/postgresql/postgres-exporter.yml-
# 2.2.1. Postgresql down
Postgresql instance is down [copy] # 1m delay allows a restart without triggering an alert. - alert: PostgresqlDown expr: pg_up == 0 for: 1m labels: severity: critical annotations: summary: Postgresql down (instance {{ $labels.instance }}) description: "Postgresql instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.2. Postgresql restarted
Postgresql restarted [copy] - alert: PostgresqlRestarted expr: time() - pg_postmaster_start_time_seconds < 60 for: 0m labels: severity: critical annotations: summary: Postgresql restarted (instance {{ $labels.instance }}) description: "Postgresql restarted\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.3. Postgresql exporter error
Postgresql exporter is showing errors. A query may be buggy in query.yaml [copy] - alert: PostgresqlExporterError expr: pg_exporter_last_scrape_error > 0 for: 0m labels: severity: critical annotations: summary: Postgresql exporter error (instance {{ $labels.instance }}) description: "Postgresql exporter is showing errors. A query may be buggy in query.yaml\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.4. Postgresql table not auto vacuumed
Table {{ $labels.relname }} has not been auto vacuumed for 10 days [copy] - alert: PostgresqlTableNotAutoVacuumed expr: ((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_vacuum_threshold) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10 for: 0m labels: severity: warning annotations: summary: Postgresql table not auto vacuumed (instance {{ $labels.instance }}) description: "Table {{ $labels.relname }} has not been auto vacuumed for 10 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.5. Postgresql table not auto analyzed
Table {{ $labels.relname }} has not been auto analyzed for 10 days [copy] - alert: PostgresqlTableNotAutoAnalyzed expr: ((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_analyze_threshold) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10 for: 0m labels: severity: warning annotations: summary: Postgresql table not auto analyzed (instance {{ $labels.instance }}) description: "Table {{ $labels.relname }} has not been auto analyzed for 10 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.6. Postgresql too many connections
PostgreSQL instance has too many connections (> 80%). [copy] - alert: PostgresqlTooManyConnections expr: sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8) for: 2m labels: severity: warning annotations: summary: Postgresql too many connections (instance {{ $labels.instance }}) description: "PostgreSQL instance has too many connections (> 80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.7. Postgresql not enough connections
PostgreSQL instance should have more connections (> 5) [copy] - alert: PostgresqlNotEnoughConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5 for: 2m labels: severity: critical annotations: summary: Postgresql not enough connections (instance {{ $labels.instance }}) description: "PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.8. Postgresql dead locks
PostgreSQL has dead-locks [copy] - alert: PostgresqlDeadLocks expr: increase(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 5 for: 0m labels: severity: warning annotations: summary: Postgresql dead locks (instance {{ $labels.instance }}) description: "PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.9. Postgresql high rollback rate
Ratio of transactions being aborted compared to committed is > 2 % [copy] - alert: PostgresqlHighRollbackRate expr: sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m])))) > 0.02 for: 0m labels: severity: warning annotations: summary: Postgresql high rollback rate (instance {{ $labels.instance }}) description: "Ratio of transactions being aborted compared to committed is > 2 %\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.10. Postgresql commit rate low
Postgresql seems to be processing very few transactions [copy] - alert: PostgresqlCommitRateLow expr: increase(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[5m]) < 5 for: 2m labels: severity: critical annotations: summary: Postgresql commit rate low (instance {{ $labels.instance }}) description: "Postgresql seems to be processing very few transactions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.11. Postgresql low XID consumption
Postgresql seems to be consuming transaction IDs very slowly [copy] - alert: PostgresqlLowXidConsumption expr: rate(pg_txid_current[1m]) < 5 for: 2m labels: severity: warning annotations: summary: Postgresql low XID consumption (instance {{ $labels.instance }}) description: "Postgresql seems to be consuming transaction IDs very slowly\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.12. Postgresql unused replication slot
Unused Replication Slots [copy] - alert: PostgresqlUnusedReplicationSlot expr: (pg_replication_slots_active == 0) and (pg_replication_is_replica == 0) for: 1m labels: severity: warning annotations: summary: Postgresql unused replication slot (instance {{ $labels.instance }}) description: "Unused Replication Slots\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.13. Postgresql too many dead tuples
PostgreSQL dead tuples is too large [copy] - alert: PostgresqlTooManyDeadTuples expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 for: 2m labels: severity: warning annotations: summary: Postgresql too many dead tuples (instance {{ $labels.instance }}) description: "PostgreSQL dead tuples is too large\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.14. Postgresql configuration changed
Postgres Database configuration change has occurred [copy] - alert: PostgresqlConfigurationChanged expr: {__name__=~"pg_settings_.*",__name__!="pg_settings_transaction_read_only"} != ON(__name__, instance) {__name__=~"pg_settings_.*",__name__!="pg_settings_transaction_read_only"} OFFSET 5m for: 0m labels: severity: info annotations: summary: Postgresql configuration changed (instance {{ $labels.instance }}) description: "Postgres Database configuration change has occurred\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.15. Postgresql SSL compression active
Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`. [copy] - alert: PostgresqlSslCompressionActive expr: sum by (instance) (pg_stat_ssl_compression) > 0 for: 0m labels: severity: warning annotations: summary: Postgresql SSL compression active (instance {{ $labels.instance }}) description: "Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.16. Postgresql too many locks acquired
Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction. [copy] - alert: PostgresqlTooManyLocksAcquired expr: ((sum by (instance) (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20 for: 2m labels: severity: critical annotations: summary: Postgresql too many locks acquired (instance {{ $labels.instance }}) description: "Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.17. Postgresql bloat index high (> 80%)
The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};` [copy] # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737 - alert: PostgresqlBloatIndexHigh(>80%) expr: pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000) for: 1h labels: severity: warning annotations: summary: Postgresql bloat index high (> 80%) (instance {{ $labels.instance }}) description: "The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.18. Postgresql bloat table high (> 80%)
The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};` [copy] # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737 - alert: PostgresqlBloatTableHigh(>80%) expr: pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000) for: 1h labels: severity: warning annotations: summary: Postgresql bloat table high (> 80%) (instance {{ $labels.instance }}) description: "The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.19. Postgresql invalid index
The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};` [copy] # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737 - alert: PostgresqlInvalidIndex expr: pg_general_index_info_pg_relation_size{indexrelname=~".*ccnew.*"} for: 6h labels: severity: warning annotations: summary: Postgresql invalid index (instance {{ $labels.instance }}) description: "The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.2.20. Postgresql replication lag
The PostgreSQL replication lag is high (> 5s) [copy] - alert: PostgresqlReplicationLag expr: pg_replication_lag_seconds > 5 for: 30s labels: severity: warning annotations: summary: Postgresql replication lag (instance {{ $labels.instance }}) description: "The PostgreSQL replication lag is high (> 5s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.3. SQL Server : Ozarklake/prometheus-mssql-exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/sql-server/ozarklake-mssql-exporter.yml-
# 2.3.1. SQL Server down
SQL server instance is down [copy] # 1m delay allows a restart without triggering an alert. - alert: SqlServerDown expr: mssql_up == 0 for: 1m labels: severity: critical annotations: summary: SQL Server down (instance {{ $labels.instance }}) description: "SQL server instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.3.2. SQL Server deadlock
SQL Server {{ $labels.instance }} is experiencing deadlocks ({{ $value }}/s) [copy] - alert: SqlServerDeadlock expr: mssql_deadlocks > 5 for: 1m labels: severity: warning annotations: summary: SQL Server deadlock (instance {{ $labels.instance }}) description: "SQL Server {{ $labels.instance }} is experiencing deadlocks ({{ $value }}/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.4. Oracle Database : iamseth/oracledb_exporter (8 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/oracle-database/iamseth-oracledb-exporter.yml-
# 2.4.1. Oracle DB down
Oracle Database instance is down on {{ $labels.instance }} [copy] # 1m delay allows a restart without triggering an alert. - alert: OracleDbDown expr: oracledb_up == 0 for: 1m labels: severity: critical annotations: summary: Oracle DB down (instance {{ $labels.instance }}) description: "Oracle Database instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.2. Oracle DB sessions reaching limit (> 85%)
Oracle Database session utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%) [copy] # Threshold is workload-dependent. Adjust 85% to suit your environment. - alert: OracleDbSessionsReachingLimit(>85%) expr: oracledb_resource_current_utilization{resource_name="sessions"} / oracledb_resource_limit_value{resource_name="sessions"} * 100 > 85 and oracledb_resource_limit_value{resource_name="sessions"} > 0 for: 5m labels: severity: warning annotations: summary: Oracle DB sessions reaching limit (> 85%) (instance {{ $labels.instance }}) description: "Oracle Database session utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.3. Oracle DB processes reaching limit (> 85%)
Oracle Database process utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%) [copy] # Threshold is workload-dependent. Adjust 85% to suit your environment. - alert: OracleDbProcessesReachingLimit(>85%) expr: oracledb_resource_current_utilization{resource_name="processes"} / oracledb_resource_limit_value{resource_name="processes"} * 100 > 85 and oracledb_resource_limit_value{resource_name="processes"} > 0 for: 5m labels: severity: warning annotations: summary: Oracle DB processes reaching limit (> 85%) (instance {{ $labels.instance }}) description: "Oracle Database process utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.4. Oracle DB tablespace reaching capacity (> 85%)
Oracle Database tablespace {{ $labels.tablespace }} is above 85% usage on {{ $labels.instance }} (current value: {{ $value }}%) [copy] - alert: OracleDbTablespaceReachingCapacity(>85%) expr: oracledb_tablespace_used_percent > 85 for: 5m labels: severity: warning annotations: summary: Oracle DB tablespace reaching capacity (> 85%) (instance {{ $labels.instance }}) description: "Oracle Database tablespace {{ $labels.tablespace }} is above 85% usage on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.5. Oracle DB tablespace full (> 95%)
Oracle Database tablespace {{ $labels.tablespace }} is critically full on {{ $labels.instance }} (current value: {{ $value }}%) [copy] - alert: OracleDbTablespaceFull(>95%) expr: oracledb_tablespace_used_percent > 95 for: 5m labels: severity: critical annotations: summary: Oracle DB tablespace full (> 95%) (instance {{ $labels.instance }}) description: "Oracle Database tablespace {{ $labels.tablespace }} is critically full on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.6. Oracle DB high user rollbacks
Oracle Database on {{ $labels.instance }} has a high rollback rate ({{ $value }}% of transactions are rolled back) [copy] # A high rollback rate (>20%) often indicates application-level issues such as deadlocks, constraint violations, or poorly designed transactions. - alert: OracleDbHighUserRollbacks expr: rate(oracledb_activity_user_rollbacks[5m]) / (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) * 100 > 20 and (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Oracle DB high user rollbacks (instance {{ $labels.instance }}) description: "Oracle Database on {{ $labels.instance }} has a high rollback rate ({{ $value }}% of transactions are rolled back)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.7. Oracle DB too many active sessions
Oracle Database on {{ $labels.instance }} has too many active user sessions (current value: {{ $value }}) [copy] # Threshold is highly workload-dependent. Adjust 200 to suit your environment. - alert: OracleDbTooManyActiveSessions expr: oracledb_sessions_activity{status="ACTIVE", type="USER"} > 200 for: 5m labels: severity: warning annotations: summary: Oracle DB too many active sessions (instance {{ $labels.instance }}) description: "Oracle Database on {{ $labels.instance }} has too many active user sessions (current value: {{ $value }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.4.8. Oracle DB high wait time (user I/O)
Oracle Database on {{ $labels.instance }} is experiencing high user I/O wait time [copy] # High user I/O wait time indicates storage performance issues (slow disks, SAN latency, etc.). # The metric is in centiseconds per second. Threshold 300 means 3 seconds of I/O wait per second of wall time. - alert: OracleDbHighWaitTime(userI/o) expr: rate(oracledb_wait_time_user_io[5m]) > 300 for: 5m labels: severity: warning annotations: summary: Oracle DB high wait time (user I/O) (instance {{ $labels.instance }}) description: "Oracle Database on {{ $labels.instance }} is experiencing high user I/O wait time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.5. Patroni : Embedded exporter (Patroni >= 2.1.0) (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/patroni/embedded-exporter-patroni.yml-
# 2.5.1. Patroni has no Leader
A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }} [copy] # 1m delay allows a restart without triggering an alert. - alert: PatroniHasNoLeader expr: (max by (scope) (patroni_primary) < 1) and (max by (scope) (patroni_standby_leader) < 1) for: 1m labels: severity: critical annotations: summary: Patroni has no Leader (instance {{ $labels.instance }}) description: "A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.6. PGBouncer : spreaker/prometheus-pgbouncer-exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/pgbouncer/spreaker-pgbouncer-exporter.yml-
# 2.6.1. PGBouncer active connections
PGBouncer pools are filling up [copy] - alert: PgbouncerActiveConnections expr: pgbouncer_pools_server_active_connections > 200 for: 2m labels: severity: warning annotations: summary: PGBouncer active connections (instance {{ $labels.instance }}) description: "PGBouncer pools are filling up\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.6.2. PGBouncer errors
PGBouncer is logging errors. This may be due to a server restart or an admin typing commands at the pgbouncer console. [copy] - alert: PgbouncerErrors expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[1m]) > 10 for: 0m labels: severity: warning annotations: summary: PGBouncer errors (instance {{ $labels.instance }}) description: "PGBouncer is logging errors. This may be due to a server restart or an admin typing commands at the pgbouncer console.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.6.3. PGBouncer max connections
The number of PGBouncer client connections has reached max_client_conn. [copy] - alert: PgbouncerMaxConnections expr: increase(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[2m]) > 0 for: 0m labels: severity: critical annotations: summary: PGBouncer max connections (instance {{ $labels.instance }}) description: "The number of PGBouncer client connections has reached max_client_conn.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.7. Redis : oliver006/redis_exporter (12 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/redis/oliver006-redis-exporter.yml-
# 2.7.1. Redis down
Redis instance is down [copy] # 1m delay allows a restart without triggering an alert. - alert: RedisDown expr: redis_up == 0 for: 1m labels: severity: critical annotations: summary: Redis down (instance {{ $labels.instance }}) description: "Redis instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.2. Redis missing master
Redis cluster has no node marked as master. [copy] - alert: RedisMissingMaster expr: (count(redis_instance_info{role="master"}) or vector(0)) < 1 for: 0m labels: severity: critical annotations: summary: Redis missing master (instance {{ $labels.instance }}) description: "Redis cluster has no node marked as master.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.3. Redis too many masters
Redis cluster has too many nodes marked as master. [copy] # 1m delay allows a restart without triggering an alert. - alert: RedisTooManyMasters expr: count(redis_instance_info{role="master"}) > 1 for: 1m labels: severity: critical annotations: summary: Redis too many masters (instance {{ $labels.instance }}) description: "Redis cluster has too many nodes marked as master.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.4. Redis disconnected slaves
Redis not replicating for all slaves. Consider reviewing the redis replication status. [copy] - alert: RedisDisconnectedSlaves expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 0 for: 0m labels: severity: critical annotations: summary: Redis disconnected slaves (instance {{ $labels.instance }}) description: "Redis not replicating for all slaves. Consider reviewing the redis replication status.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.5. Redis replication broken
Redis instance lost a slave [copy] - alert: RedisReplicationBroken expr: delta(redis_connected_slaves[1m]) < 0 for: 0m labels: severity: critical annotations: summary: Redis replication broken (instance {{ $labels.instance }}) description: "Redis instance lost a slave\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.6. Redis cluster flapping
Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping). [copy] - alert: RedisClusterFlapping expr: changes(redis_connected_slaves[1m]) > 1 for: 2m labels: severity: critical annotations: summary: Redis cluster flapping (instance {{ $labels.instance }}) description: "Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.7. Redis missing backup
Redis has not been backed up for 48 hours [copy] - alert: RedisMissingBackup expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 48 for: 0m labels: severity: critical annotations: summary: Redis missing backup (instance {{ $labels.instance }}) description: "Redis has not been backed up for 48 hours\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.8. Redis out of system memory
Redis is running out of system memory (> 90%) [copy] # The exporter must be started with --include-system-metrics flag or REDIS_EXPORTER_INCL_SYSTEM_METRICS=true environment variable. - alert: RedisOutOfSystemMemory expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 for: 2m labels: severity: warning annotations: summary: Redis out of system memory (instance {{ $labels.instance }}) description: "Redis is running out of system memory (> 90%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.9. Redis out of configured maxmemory
Redis is running out of configured maxmemory (> 90%) [copy] - alert: RedisOutOfConfiguredMaxmemory expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90 and on(instance) redis_memory_max_bytes > 0 for: 2m labels: severity: warning annotations: summary: Redis out of configured maxmemory (instance {{ $labels.instance }}) description: "Redis is running out of configured maxmemory (> 90%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.10. Redis too many connections
Redis is running out of connections (> 90% used) [copy] - alert: RedisTooManyConnections expr: redis_connected_clients / redis_config_maxclients * 100 > 90 for: 2m labels: severity: warning annotations: summary: Redis too many connections (instance {{ $labels.instance }}) description: "Redis is running out of connections (> 90% used)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.11. Redis not enough connections
Redis instance should have more connections (> 5) [copy] - alert: RedisNotEnoughConnections expr: redis_connected_clients < 5 for: 2m labels: severity: warning annotations: summary: Redis not enough connections (instance {{ $labels.instance }}) description: "Redis instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.7.12. Redis rejected connections
Some connections to Redis has been rejected [copy] - alert: RedisRejectedConnections expr: increase(redis_rejected_connections_total[1m]) > 5 for: 0m labels: severity: warning annotations: summary: Redis rejected connections (instance {{ $labels.instance }}) description: "Some connections to Redis has been rejected\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.8. Memcached : prometheus/memcached_exporter (9 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/memcached/memcached-exporter.yml-
# 2.8.1. Memcached down
Memcached instance is down on {{ $labels.instance }} [copy] # 1m delay allows a restart without triggering an alert. - alert: MemcachedDown expr: memcached_up == 0 for: 1m labels: severity: critical annotations: summary: Memcached down (instance {{ $labels.instance }}) description: "Memcached instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.2. Memcached connection limit approaching (> 80%)
Memcached connection usage is above 80% on {{ $labels.instance }} (current value: {{ $value }}%) [copy] - alert: MemcachedConnectionLimitApproaching(>80%) expr: (memcached_current_connections / memcached_max_connections * 100) > 80 and memcached_max_connections > 0 for: 2m labels: severity: warning annotations: summary: Memcached connection limit approaching (> 80%) (instance {{ $labels.instance }}) description: "Memcached connection usage is above 80% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.3. Memcached connection limit approaching (> 95%)
Memcached connection usage is above 95% on {{ $labels.instance }} (current value: {{ $value }}%) [copy] - alert: MemcachedConnectionLimitApproaching(>95%) expr: (memcached_current_connections / memcached_max_connections * 100) > 95 and memcached_max_connections > 0 for: 2m labels: severity: critical annotations: summary: Memcached connection limit approaching (> 95%) (instance {{ $labels.instance }}) description: "Memcached connection usage is above 95% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.4. Memcached out of memory errors
Memcached is returning out-of-memory errors on {{ $labels.instance }} [copy] - alert: MemcachedOutOfMemoryErrors expr: sum without (slab) (rate(memcached_slab_items_outofmemory_total[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Memcached out of memory errors (instance {{ $labels.instance }}) description: "Memcached is returning out-of-memory errors on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.5. Memcached memory usage high (> 90%)
Memcached memory usage is above 90% on {{ $labels.instance }} (current value: {{ $value }}%) [copy] # High memory usage is expected if the cache is well-utilized. This alert fires when it approaches the configured limit, which may cause evictions. - alert: MemcachedMemoryUsageHigh(>90%) expr: (memcached_current_bytes / memcached_limit_bytes * 100) > 90 and memcached_limit_bytes > 0 for: 5m labels: severity: warning annotations: summary: Memcached memory usage high (> 90%) (instance {{ $labels.instance }}) description: "Memcached memory usage is above 90% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.6. Memcached high eviction rate
Memcached is evicting items at a high rate on {{ $labels.instance }} ({{ $value }} evictions/s) [copy] # A sustained eviction rate indicates memory pressure. Consider increasing memcached memory limit or reducing cache usage. Threshold of 10 evictions/s is a rough default — adjust based on your workload. - alert: MemcachedHighEvictionRate expr: rate(memcached_items_evicted_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: Memcached high eviction rate (instance {{ $labels.instance }}) description: "Memcached is evicting items at a high rate on {{ $labels.instance }} ({{ $value }} evictions/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.7. Memcached low cache hit rate (< 80%)
Memcached cache hit rate is below 80% on {{ $labels.instance }} (current value: {{ $value }}%) [copy] # A low hit rate may indicate poor cache utilization, incorrect cache keys, or TTLs that are too short. Threshold of 80% is a rough default — adjust based on your workload and access patterns. - alert: MemcachedLowCacheHitRate(<80%) expr: (rate(memcached_commands_total{command="get", status="hit"}[5m]) / (rate(memcached_commands_total{command="get", status="hit"}[5m]) + rate(memcached_commands_total{command="get", status="miss"}[5m])) * 100) < 80 and (rate(memcached_commands_total{command="get", status="hit"}[5m]) + rate(memcached_commands_total{command="get", status="miss"}[5m])) > 0 for: 10m labels: severity: warning annotations: summary: Memcached low cache hit rate (< 80%) (instance {{ $labels.instance }}) description: "Memcached cache hit rate is below 80% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.8. Memcached connections rejected
Memcached is rejecting connections on {{ $labels.instance }} ({{ $value }} rejections in the last 5m) [copy] - alert: MemcachedConnectionsRejected expr: increase(memcached_connections_rejected_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Memcached connections rejected (instance {{ $labels.instance }}) description: "Memcached is rejecting connections on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.8.9. Memcached items too large
Memcached is rejecting items exceeding max-item-size on {{ $labels.instance }} ({{ $value }} rejections in the last 5m) [copy] - alert: MemcachedItemsTooLarge expr: increase(memcached_item_too_large_total[5m]) > 0 for: 5m labels: severity: info annotations: summary: Memcached items too large (instance {{ $labels.instance }}) description: "Memcached is rejecting items exceeding max-item-size on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.9.1. MongoDB : percona/mongodb_exporter (7 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/mongodb/percona-mongodb-exporter.yml-
# 2.9.1.1. MongoDB Down
MongoDB instance is down [copy] # 1m delay allows a restart without triggering an alert. - alert: MongodbDown expr: mongodb_up == 0 for: 1m labels: severity: critical annotations: summary: MongoDB Down (instance {{ $labels.instance }}) description: "MongoDB instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.1.2. Mongodb replica member unhealthy
MongoDB replica member is not healthy [copy] # 1m delay allows a restart without triggering an alert. - alert: MongodbReplicaMemberUnhealthy expr: mongodb_rs_members_health == 0 for: 1m labels: severity: critical annotations: summary: Mongodb replica member unhealthy (instance {{ $labels.instance }}) description: "MongoDB replica member is not healthy\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.1.3. MongoDB replication lag
Mongodb replication lag is more than 10s [copy] - alert: MongodbReplicationLag expr: (mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"}) / 1000 > 10 for: 0m labels: severity: critical annotations: summary: MongoDB replication lag (instance {{ $labels.instance }}) description: "Mongodb replication lag is more than 10s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.1.4. MongoDB replication headroom
MongoDB replication headroom is <= 0 [copy] # This query mixes old (mongodb_mongod_*) and new (mongodb_rs_*) metric names. It requires the Percona exporter to run with --compatible-mode to expose both. - alert: MongodbReplicationHeadroom expr: sum(avg(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp)) - sum(avg(mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"})) <= 0 for: 0m labels: severity: critical annotations: summary: MongoDB replication headroom (instance {{ $labels.instance }}) description: "MongoDB replication headroom is <= 0\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.1.5. MongoDB number cursors open
Too many cursors opened by MongoDB for clients (> 10k) [copy] - alert: MongodbNumberCursorsOpen expr: mongodb_ss_metrics_cursor_open{csr_type="total"} > 10 * 1000 for: 2m labels: severity: warning annotations: summary: MongoDB number cursors open (instance {{ $labels.instance }}) description: "Too many cursors opened by MongoDB for clients (> 10k)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.1.6. MongoDB cursors timeouts
Too many cursors are timing out [copy] - alert: MongodbCursorsTimeouts expr: increase(mongodb_ss_metrics_cursor_timedOut[1m]) > 100 for: 2m labels: severity: warning annotations: summary: MongoDB cursors timeouts (instance {{ $labels.instance }}) description: "Too many cursors are timing out\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.1.7. MongoDB too many connections
Too many connections (> 80%) [copy] - alert: MongodbTooManyConnections expr: mongodb_ss_connections{conn_type="current"} / (mongodb_ss_connections{conn_type="current"} + mongodb_ss_connections{conn_type="available"}) * 100 > 80 for: 2m labels: severity: warning annotations: summary: MongoDB too many connections (instance {{ $labels.instance }}) description: "Too many connections (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.9.2. MongoDB : dcu/mongodb_exporter (9 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/mongodb/dcu-mongodb-exporter.yml-
# 2.9.2.1. MongoDB replication lag
Mongodb replication lag is more than 10s [copy] - alert: MongodbReplicationLag expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10 for: 0m labels: severity: critical annotations: summary: MongoDB replication lag (instance {{ $labels.instance }}) description: "Mongodb replication lag is more than 10s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.2. MongoDB replication Status 3
MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync [copy] - alert: MongodbReplicationStatus3 expr: mongodb_replset_member_state == 3 for: 0m labels: severity: critical annotations: summary: MongoDB replication Status 3 (instance {{ $labels.instance }}) description: "MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.3. MongoDB replication Status 6
MongoDB Replication set member as seen from another member of the set, is not yet known [copy] - alert: MongodbReplicationStatus6 expr: mongodb_replset_member_state == 6 for: 0m labels: severity: critical annotations: summary: MongoDB replication Status 6 (instance {{ $labels.instance }}) description: "MongoDB Replication set member as seen from another member of the set, is not yet known\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.4. MongoDB replication Status 8
MongoDB Replication set member as seen from another member of the set, is unreachable [copy] - alert: MongodbReplicationStatus8 expr: mongodb_replset_member_state == 8 for: 0m labels: severity: critical annotations: summary: MongoDB replication Status 8 (instance {{ $labels.instance }}) description: "MongoDB Replication set member as seen from another member of the set, is unreachable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.5. MongoDB replication Status 9
MongoDB Replication set member is actively performing a rollback. Data is not available for reads [copy] - alert: MongodbReplicationStatus9 expr: mongodb_replset_member_state == 9 for: 0m labels: severity: critical annotations: summary: MongoDB replication Status 9 (instance {{ $labels.instance }}) description: "MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.6. MongoDB replication Status 10
MongoDB Replication set member was once in a replica set but was subsequently removed [copy] - alert: MongodbReplicationStatus10 expr: mongodb_replset_member_state == 10 for: 0m labels: severity: critical annotations: summary: MongoDB replication Status 10 (instance {{ $labels.instance }}) description: "MongoDB Replication set member was once in a replica set but was subsequently removed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.7. MongoDB number cursors open
Too many cursors opened by MongoDB for clients (> 10k) [copy] - alert: MongodbNumberCursorsOpen expr: mongodb_metrics_cursor_open{state="total_open"} > 10000 for: 2m labels: severity: warning annotations: summary: MongoDB number cursors open (instance {{ $labels.instance }}) description: "Too many cursors opened by MongoDB for clients (> 10k)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.8. MongoDB cursors timeouts
Too many cursors are timing out [copy] - alert: MongodbCursorsTimeouts expr: increase(mongodb_metrics_cursor_timed_out_total[1m]) > 100 for: 2m labels: severity: warning annotations: summary: MongoDB cursors timeouts (instance {{ $labels.instance }}) description: "Too many cursors are timing out\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.9.2.9. MongoDB too many connections
Too many connections (> 80%) [copy] - alert: MongodbTooManyConnections expr: mongodb_connections{state="current"} / (mongodb_connections{state="current"} + mongodb_connections{state="available"}) * 100 > 80 for: 2m labels: severity: warning annotations: summary: MongoDB too many connections (instance {{ $labels.instance }}) description: "Too many connections (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.9.3. MongoDB : stefanprodan/mgob (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/mongodb/stefanprodan-mgob-exporter.yml-
# 2.9.3.1. Mgob backup failed
MongoDB backup has failed [copy] - alert: MgobBackupFailed expr: changes(mgob_scheduler_backup_total{status="500"}[1h]) > 0 for: 0m labels: severity: critical annotations: summary: Mgob backup failed (instance {{ $labels.instance }}) description: "MongoDB backup has failed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.10.1. RabbitMQ : rabbitmq/rabbitmq-prometheus (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/rabbitmq/rabbitmq-exporter.yml-
# 2.10.1.1. RabbitMQ node down
Less than 3 nodes running in RabbitMQ cluster [copy] # 1m delay allows a restart without triggering an alert. - alert: RabbitmqNodeDown expr: sum(rabbitmq_build_info) < 3 for: 1m labels: severity: critical annotations: summary: RabbitMQ node down (instance {{ $labels.instance }}) description: "Less than 3 nodes running in RabbitMQ cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.2. RabbitMQ node not distributed
Distribution link state is not 'up' [copy] # 1m delay allows a restart without triggering an alert. - alert: RabbitmqNodeNotDistributed expr: erlang_vm_dist_node_state < 3 for: 1m labels: severity: critical annotations: summary: RabbitMQ node not distributed (instance {{ $labels.instance }}) description: "Distribution link state is not 'up'\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.3. RabbitMQ instances different versions
Running different version of RabbitMQ in the same cluster, can lead to failure. [copy] - alert: RabbitmqInstancesDifferentVersions expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1 for: 1h labels: severity: warning annotations: summary: RabbitMQ instances different versions (instance {{ $labels.instance }}) description: "Running different version of RabbitMQ in the same cluster, can lead to failure.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.4. RabbitMQ memory high
A node use more than 90% of allocated RAM [copy] - alert: RabbitmqMemoryHigh expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90 and rabbitmq_resident_memory_limit_bytes > 0 for: 2m labels: severity: warning annotations: summary: RabbitMQ memory high (instance {{ $labels.instance }}) description: "A node use more than 90% of allocated RAM\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.5. RabbitMQ file descriptors usage
A node use more than 90% of file descriptors [copy] - alert: RabbitmqFileDescriptorsUsage expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90 and rabbitmq_process_max_fds > 0 for: 2m labels: severity: warning annotations: summary: RabbitMQ file descriptors usage (instance {{ $labels.instance }}) description: "A node use more than 90% of file descriptors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.6. RabbitMQ too many ready messages
RabbitMQ too many ready messages on {{ $labels.instance }} [copy] - alert: RabbitmqTooManyReadyMessages expr: sum(rabbitmq_queue_messages_ready) BY (queue) > 1000 for: 1m labels: severity: warning annotations: summary: RabbitMQ too many ready messages (instance {{ $labels.instance }}) description: "RabbitMQ too many ready messages on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.7. RabbitMQ too many unack messages
Too many unacknowledged messages [copy] - alert: RabbitmqTooManyUnackMessages expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000 for: 1m labels: severity: warning annotations: summary: RabbitMQ too many unack messages (instance {{ $labels.instance }}) description: "Too many unacknowledged messages\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.8. RabbitMQ too many connections
The total connections of a node is too high [copy] - alert: RabbitmqTooManyConnections expr: rabbitmq_connections > 1000 for: 2m labels: severity: warning annotations: summary: RabbitMQ too many connections (instance {{ $labels.instance }}) description: "The total connections of a node is too high\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.9. RabbitMQ no queue consumer
A queue has less than 1 consumer [copy] - alert: RabbitmqNoQueueConsumer expr: rabbitmq_queue_consumers < 1 for: 1m labels: severity: warning annotations: summary: RabbitMQ no queue consumer (instance {{ $labels.instance }}) description: "A queue has less than 1 consumer\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.1.10. RabbitMQ unroutable messages
A queue has unroutable messages [copy] - alert: RabbitmqUnroutableMessages expr: increase(rabbitmq_channel_messages_unroutable_returned_total[1m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[1m]) > 0 for: 2m labels: severity: warning annotations: summary: RabbitMQ unroutable messages (instance {{ $labels.instance }}) description: "A queue has unroutable messages\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.10.2. RabbitMQ : kbudde/rabbitmq-exporter (11 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/rabbitmq/kbudde-rabbitmq-exporter.yml-
# 2.10.2.1. RabbitMQ down
RabbitMQ node down [copy] # 1m delay allows a restart without triggering an alert. - alert: RabbitmqDown expr: rabbitmq_up == 0 for: 1m labels: severity: critical annotations: summary: RabbitMQ down (instance {{ $labels.instance }}) description: "RabbitMQ node down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.2. RabbitMQ cluster down
Less than 3 nodes running in RabbitMQ cluster [copy] # 1m delay allows a restart without triggering an alert. - alert: RabbitmqClusterDown expr: sum(rabbitmq_running) < 3 for: 1m labels: severity: critical annotations: summary: RabbitMQ cluster down (instance {{ $labels.instance }}) description: "Less than 3 nodes running in RabbitMQ cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.3. RabbitMQ cluster partition
Cluster partition [copy] - alert: RabbitmqClusterPartition expr: rabbitmq_partitions > 0 for: 0m labels: severity: critical annotations: summary: RabbitMQ cluster partition (instance {{ $labels.instance }}) description: "Cluster partition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.4. RabbitMQ out of memory
Memory available for RabbitMQ is low (< 10%) [copy] - alert: RabbitmqOutOfMemory expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90 and rabbitmq_node_mem_limit > 0 for: 2m labels: severity: warning annotations: summary: RabbitMQ out of memory (instance {{ $labels.instance }}) description: "Memory available for RabbitMQ is low (< 10%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.5. RabbitMQ too many connections
RabbitMQ instance has too many connections (> 1000) [copy] - alert: RabbitmqTooManyConnections expr: rabbitmq_connectionsTotal > 1000 for: 2m labels: severity: warning annotations: summary: RabbitMQ too many connections (instance {{ $labels.instance }}) description: "RabbitMQ instance has too many connections (> 1000)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.6. RabbitMQ dead letter queue filling up
Dead letter queue is filling up (> 10 msgs) [copy] # Indicate the queue name in dedicated label. - alert: RabbitmqDeadLetterQueueFillingUp expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10 for: 1m labels: severity: warning annotations: summary: RabbitMQ dead letter queue filling up (instance {{ $labels.instance }}) description: "Dead letter queue is filling up (> 10 msgs)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.7. RabbitMQ too many messages in queue
Queue is filling up (> 1000 msgs) [copy] # Indicate the queue name in dedicated label. - alert: RabbitmqTooManyMessagesInQueue expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000 for: 2m labels: severity: warning annotations: summary: RabbitMQ too many messages in queue (instance {{ $labels.instance }}) description: "Queue is filling up (> 1000 msgs)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.8. RabbitMQ slow queue consuming
Queue messages are consumed slowly (> 60s) [copy] # Indicate the queue name in dedicated label. - alert: RabbitmqSlowQueueConsuming expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60 for: 2m labels: severity: warning annotations: summary: RabbitMQ slow queue consuming (instance {{ $labels.instance }}) description: "Queue messages are consumed slowly (> 60s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.9. RabbitMQ no consumer
Queue has no consumer [copy] # Allows a short service restart. - alert: RabbitmqNoConsumer expr: rabbitmq_queue_consumers == 0 for: 5m labels: severity: critical annotations: summary: RabbitMQ no consumer (instance {{ $labels.instance }}) description: "Queue has no consumer\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.10. RabbitMQ too many consumers
Queue should have only 1 consumer [copy] # Indicate the queue name in dedicated label. - alert: RabbitmqTooManyConsumers expr: rabbitmq_queue_consumers{queue="my-queue"} > 1 for: 0m labels: severity: critical annotations: summary: RabbitMQ too many consumers (instance {{ $labels.instance }}) description: "Queue should have only 1 consumer\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.10.2.11. RabbitMQ inactive exchange
Exchange receive less than 5 msgs per second [copy] # Indicate the exchange name in dedicated label. - alert: RabbitmqInactiveExchange expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5 for: 2m labels: severity: warning annotations: summary: RabbitMQ inactive exchange (instance {{ $labels.instance }}) description: "Exchange receive less than 5 msgs per second\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.11. Elasticsearch : prometheus-community/elasticsearch_exporter (19 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/elasticsearch/prometheus-community-elasticsearch-exporter.yml-
# 2.11.1. Elasticsearch Heap Usage Too High
The heap usage is over 90% [copy] - alert: ElasticsearchHeapUsageTooHigh expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90 for: 2m labels: severity: critical annotations: summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }}) description: "The heap usage is over 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.2. Elasticsearch Heap Usage warning
The heap usage is over 80% [copy] - alert: ElasticsearchHeapUsageWarning expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80 for: 2m labels: severity: warning annotations: summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }}) description: "The heap usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.3. Elasticsearch disk out of space
The disk usage is over 90% [copy] - alert: ElasticsearchDiskOutOfSpace expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10 for: 0m labels: severity: critical annotations: summary: Elasticsearch disk out of space (instance {{ $labels.instance }}) description: "The disk usage is over 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.4. Elasticsearch disk space low
The disk usage is over 80% [copy] - alert: ElasticsearchDiskSpaceLow expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20 for: 2m labels: severity: warning annotations: summary: Elasticsearch disk space low (instance {{ $labels.instance }}) description: "The disk usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.5. Elasticsearch Cluster Red
Elastic Cluster Red status [copy] - alert: ElasticsearchClusterRed expr: elasticsearch_cluster_health_status{color="red"} == 1 for: 0m labels: severity: critical annotations: summary: Elasticsearch Cluster Red (instance {{ $labels.instance }}) description: "Elastic Cluster Red status\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.6. Elasticsearch Cluster Yellow
Elastic Cluster Yellow status [copy] - alert: ElasticsearchClusterYellow expr: elasticsearch_cluster_health_status{color="yellow"} == 1 for: 0m labels: severity: warning annotations: summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }}) description: "Elastic Cluster Yellow status\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.7. Elasticsearch Healthy Nodes
Missing node in Elasticsearch cluster [copy] # 1m delay allows a restart without triggering an alert. - alert: ElasticsearchHealthyNodes expr: elasticsearch_cluster_health_number_of_nodes < 3 for: 1m labels: severity: critical annotations: summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }}) description: "Missing node in Elasticsearch cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.8. Elasticsearch Healthy Data Nodes
Missing data node in Elasticsearch cluster [copy] # 1m delay allows a restart without triggering an alert. - alert: ElasticsearchHealthyDataNodes expr: elasticsearch_cluster_health_number_of_data_nodes < 3 for: 1m labels: severity: critical annotations: summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }}) description: "Missing data node in Elasticsearch cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.9. Elasticsearch relocating shards
Elasticsearch is relocating shards [copy] - alert: ElasticsearchRelocatingShards expr: elasticsearch_cluster_health_relocating_shards > 0 for: 0m labels: severity: info annotations: summary: Elasticsearch relocating shards (instance {{ $labels.instance }}) description: "Elasticsearch is relocating shards\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.10. Elasticsearch relocating shards too long
Elasticsearch has been relocating shards for 15min [copy] - alert: ElasticsearchRelocatingShardsTooLong expr: elasticsearch_cluster_health_relocating_shards > 0 for: 15m labels: severity: warning annotations: summary: Elasticsearch relocating shards too long (instance {{ $labels.instance }}) description: "Elasticsearch has been relocating shards for 15min\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.11. Elasticsearch initializing shards
Elasticsearch is initializing shards [copy] - alert: ElasticsearchInitializingShards expr: elasticsearch_cluster_health_initializing_shards > 0 for: 0m labels: severity: info annotations: summary: Elasticsearch initializing shards (instance {{ $labels.instance }}) description: "Elasticsearch is initializing shards\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.12. Elasticsearch initializing shards too long
Elasticsearch has been initializing shards for 15 min [copy] - alert: ElasticsearchInitializingShardsTooLong expr: elasticsearch_cluster_health_initializing_shards > 0 for: 15m labels: severity: warning annotations: summary: Elasticsearch initializing shards too long (instance {{ $labels.instance }}) description: "Elasticsearch has been initializing shards for 15 min\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.13. Elasticsearch unassigned shards
Elasticsearch has unassigned shards [copy] - alert: ElasticsearchUnassignedShards expr: elasticsearch_cluster_health_unassigned_shards > 0 for: 2m labels: severity: critical annotations: summary: Elasticsearch unassigned shards (instance {{ $labels.instance }}) description: "Elasticsearch has unassigned shards\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.14. Elasticsearch pending tasks
Elasticsearch has pending tasks. Cluster works slowly. [copy] - alert: ElasticsearchPendingTasks expr: elasticsearch_cluster_health_number_of_pending_tasks > 0 for: 15m labels: severity: warning annotations: summary: Elasticsearch pending tasks (instance {{ $labels.instance }}) description: "Elasticsearch has pending tasks. Cluster works slowly.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.15. Elasticsearch no new documents
No new documents for 10 min! [copy] - alert: ElasticsearchNoNewDocuments expr: increase(elasticsearch_indices_indexing_index_total{es_data_node="true"}[10m]) < 1 for: 0m labels: severity: warning annotations: summary: Elasticsearch no new documents (instance {{ $labels.instance }}) description: "No new documents for 10 min!\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.16. Elasticsearch High Indexing Latency
The indexing latency on Elasticsearch cluster is higher than the threshold. [copy] - alert: ElasticsearchHighIndexingLatency expr: increase(elasticsearch_indices_indexing_index_time_seconds_total[1m]) / increase(elasticsearch_indices_indexing_index_total[1m]) > 0.0005 and increase(elasticsearch_indices_indexing_index_total[1m]) > 0 for: 10m labels: severity: warning annotations: summary: Elasticsearch High Indexing Latency (instance {{ $labels.instance }}) description: "The indexing latency on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.17. Elasticsearch High Indexing Rate
The indexing rate on Elasticsearch cluster is higher than the threshold. [copy] - alert: ElasticsearchHighIndexingRate expr: sum(rate(elasticsearch_indices_indexing_index_total[1m]))> 10000 for: 5m labels: severity: warning annotations: summary: Elasticsearch High Indexing Rate (instance {{ $labels.instance }}) description: "The indexing rate on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.18. Elasticsearch High Query Rate
The query rate on Elasticsearch cluster is higher than the threshold. [copy] - alert: ElasticsearchHighQueryRate expr: sum(rate(elasticsearch_indices_search_query_total[1m])) > 100 for: 5m labels: severity: warning annotations: summary: Elasticsearch High Query Rate (instance {{ $labels.instance }}) description: "The query rate on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.11.19. Elasticsearch High Query Latency
The query latency on Elasticsearch cluster is higher than the threshold. [copy] - alert: ElasticsearchHighQueryLatency expr: increase(elasticsearch_indices_search_query_time_seconds[1m]) / increase(elasticsearch_indices_search_query_total[1m]) > 1 and increase(elasticsearch_indices_search_query_total[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Elasticsearch High Query Latency (instance {{ $labels.instance }}) description: "The query latency on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.12. Meilisearch : Embedded exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/meilisearch/embedded-exporter.yml-
# 2.12.1. Meilisearch index is empty
Meilisearch index {{ $labels.index }} has zero documents [copy] - alert: MeilisearchIndexIsEmpty expr: meilisearch_index_docs_count == 0 for: 0m labels: severity: warning annotations: summary: Meilisearch index is empty (instance {{ $labels.instance }}) description: "Meilisearch index {{ $labels.index }} has zero documents\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.12.2. Meilisearch http response time
Meilisearch http response time is too high [copy] - alert: MeilisearchHttpResponseTime expr: meilisearch_http_response_time_seconds > 0.5 for: 0m labels: severity: warning annotations: summary: Meilisearch http response time (instance {{ $labels.instance }}) description: "Meilisearch http response time is too high\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.13.1. Cassandra : instaclustr/cassandra-exporter (12 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/cassandra/instaclustr-cassandra-exporter.yml-
# 2.13.1.1. Cassandra Node is unavailable
Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }} [copy] # 1m delay allows a restart without triggering an alert. - alert: CassandraNodeIsUnavailable expr: cassandra_endpoint_active < 1 for: 1m labels: severity: critical annotations: summary: Cassandra Node is unavailable (instance {{ $labels.instance }}) description: "Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.2. Cassandra many compaction tasks are pending
Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraManyCompactionTasksArePending expr: cassandra_table_estimated_pending_compactions > 100 for: 0m labels: severity: warning annotations: summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }}) description: "Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.3. Cassandra commitlog pending tasks
Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraCommitlogPendingTasks expr: cassandra_commit_log_pending_tasks > 15 for: 2m labels: severity: warning annotations: summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }}) description: "Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.4. Cassandra compaction executor blocked tasks
Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraCompactionExecutorBlockedTasks expr: cassandra_thread_pool_blocked_tasks{pool="CompactionExecutor"} > 15 for: 2m labels: severity: warning annotations: summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }}) description: "Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.5. Cassandra flush writer blocked tasks
Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraFlushWriterBlockedTasks expr: cassandra_thread_pool_blocked_tasks{pool="MemtableFlushWriter"} > 15 for: 2m labels: severity: warning annotations: summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }}) description: "Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.6. Cassandra connection timeouts total
Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraConnectionTimeoutsTotal expr: sum by (cassandra_cluster,instance) (rate(cassandra_client_request_timeouts_total[5m])) > 5 for: 2m labels: severity: critical annotations: summary: Cassandra connection timeouts total (instance {{ $labels.instance }}) description: "Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.7. Cassandra storage exceptions
Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraStorageExceptions expr: changes(cassandra_storage_exceptions_total[1m]) > 1 for: 0m labels: severity: critical annotations: summary: Cassandra storage exceptions (instance {{ $labels.instance }}) description: "Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.8. Cassandra tombstone dump
Cassandra tombstone dump - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraTombstoneDump expr: avg(cassandra_table_tombstones_scanned{quantile="0.99"}) by (instance,cassandra_cluster,keyspace) > 100 for: 2m labels: severity: critical annotations: summary: Cassandra tombstone dump (instance {{ $labels.instance }}) description: "Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.9. Cassandra client request unavailable write
Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraClientRequestUnavailableWrite expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="write"}[1m]) > 0 for: 2m labels: severity: critical annotations: summary: Cassandra client request unavailable write (instance {{ $labels.instance }}) description: "Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.10. Cassandra client request unavailable read
Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraClientRequestUnavailableRead expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="read"}[1m]) > 0 for: 2m labels: severity: critical annotations: summary: Cassandra client request unavailable read (instance {{ $labels.instance }}) description: "Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.11. Cassandra client request write failure
Write failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraClientRequestWriteFailure expr: increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 0 for: 2m labels: severity: critical annotations: summary: Cassandra client request write failure (instance {{ $labels.instance }}) description: "Write failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.1.12. Cassandra client request read failure
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }} [copy] - alert: CassandraClientRequestReadFailure expr: increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 0 for: 2m labels: severity: critical annotations: summary: Cassandra client request read failure (instance {{ $labels.instance }}) description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.13.2. Cassandra : criteo/cassandra_exporter (18 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/cassandra/criteo-cassandra-exporter.yml-
# 2.13.2.1. Cassandra hints count
Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down [copy] - alert: CassandraHintsCount expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3 for: 0m labels: severity: critical annotations: summary: Cassandra hints count (instance {{ $labels.instance }}) description: "Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.2. Cassandra compaction task pending
Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster. [copy] - alert: CassandraCompactionTaskPending expr: cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"} > 100 for: 2m labels: severity: warning annotations: summary: Cassandra compaction task pending (instance {{ $labels.instance }}) description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.3. Cassandra viewwrite latency
High viewwrite latency on {{ $labels.instance }} cassandra node [copy] - alert: CassandraViewwriteLatency expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile"} > 100000 for: 2m labels: severity: warning annotations: summary: Cassandra viewwrite latency (instance {{ $labels.instance }}) description: "High viewwrite latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.4. Cassandra authentication failures
Increase of Cassandra authentication failures [copy] - alert: CassandraAuthenticationFailures expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5 for: 2m labels: severity: warning annotations: summary: Cassandra authentication failures (instance {{ $labels.instance }}) description: "Increase of Cassandra authentication failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.5. Cassandra node down
Cassandra node down [copy] # 1m delay allows a restart without triggering an alert. - alert: CassandraNodeDown expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0 for: 1m labels: severity: critical annotations: summary: Cassandra node down (instance {{ $labels.instance }}) description: "Cassandra node down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.6. Cassandra commitlog pending tasks
Unexpected number of Cassandra commitlog pending tasks [copy] - alert: CassandraCommitlogPendingTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15 for: 2m labels: severity: warning annotations: summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }}) description: "Unexpected number of Cassandra commitlog pending tasks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.7. Cassandra compaction executor blocked tasks
Some Cassandra compaction executor tasks are blocked [copy] - alert: CassandraCompactionExecutorBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0 for: 2m labels: severity: warning annotations: summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }}) description: "Some Cassandra compaction executor tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.8. Cassandra flush writer blocked tasks
Some Cassandra flush writer tasks are blocked [copy] - alert: CassandraFlushWriterBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0 for: 2m labels: severity: warning annotations: summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }}) description: "Some Cassandra flush writer tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.9. Cassandra repair pending tasks
Some Cassandra repair tasks are pending [copy] - alert: CassandraRepairPendingTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2 for: 2m labels: severity: warning annotations: summary: Cassandra repair pending tasks (instance {{ $labels.instance }}) description: "Some Cassandra repair tasks are pending\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.10. Cassandra repair blocked tasks
Some Cassandra repair tasks are blocked [copy] - alert: CassandraRepairBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0 for: 2m labels: severity: warning annotations: summary: Cassandra repair blocked tasks (instance {{ $labels.instance }}) description: "Some Cassandra repair tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.11. Cassandra connection timeouts total
Some connection between nodes are ending in timeout [copy] - alert: CassandraConnectionTimeoutsTotal expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5 for: 2m labels: severity: critical annotations: summary: Cassandra connection timeouts total (instance {{ $labels.instance }}) description: "Some connection between nodes are ending in timeout\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.12. Cassandra storage exceptions
Something is going wrong with cassandra storage [copy] - alert: CassandraStorageExceptions expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1 for: 0m labels: severity: critical annotations: summary: Cassandra storage exceptions (instance {{ $labels.instance }}) description: "Something is going wrong with cassandra storage\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.13. Cassandra tombstone dump
Too much tombstones scanned in queries [copy] - alert: CassandraTombstoneDump expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000 for: 0m labels: severity: critical annotations: summary: Cassandra tombstone dump (instance {{ $labels.instance }}) description: "Too much tombstones scanned in queries\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.14. Cassandra client request unavailable write
Write failures have occurred because too many nodes are unavailable [copy] - alert: CassandraClientRequestUnavailableWrite expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Cassandra client request unavailable write (instance {{ $labels.instance }}) description: "Write failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.15. Cassandra client request unavailable read
Read failures have occurred because too many nodes are unavailable [copy] - alert: CassandraClientRequestUnavailableRead expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Cassandra client request unavailable read (instance {{ $labels.instance }}) description: "Read failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.16. Cassandra client request write failure
A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. [copy] - alert: CassandraClientRequestWriteFailure expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"} > 0 for: 0m labels: severity: critical annotations: summary: Cassandra client request write failure (instance {{ $labels.instance }}) description: "A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.17. Cassandra client request read failure
A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. [copy] - alert: CassandraClientRequestReadFailure expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"} > 0 for: 0m labels: severity: critical annotations: summary: Cassandra client request read failure (instance {{ $labels.instance }}) description: "A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.13.2.18. Cassandra cache hit rate key cache
Key cache hit rate is below 85% [copy] - alert: CassandraCacheHitRateKeyCache expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85 for: 2m labels: severity: critical annotations: summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }}) description: "Key cache hit rate is below 85%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.14. Clickhouse : Embedded Exporter (19 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/clickhouse/embedded-exporter.yml-
# 2.14.1. ClickHouse node down
No metrics received from ClickHouse exporter for over 2 minutes. [copy] # Adjust the job label to match your Prometheus configuration. - alert: ClickhouseNodeDown expr: up{job="clickhouse"} == 0 for: 2m labels: severity: critical annotations: summary: ClickHouse node down (instance {{ $labels.instance }}) description: "No metrics received from ClickHouse exporter for over 2 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.2. ClickHouse Memory Usage Critical
Memory usage is critically high, over 90%. [copy] - alert: ClickhouseMemoryUsageCritical expr: ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0 for: 5m labels: severity: critical annotations: summary: ClickHouse Memory Usage Critical (instance {{ $labels.instance }}) description: "Memory usage is critically high, over 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.3. ClickHouse Memory Usage Warning
Memory usage is over 80%. [copy] - alert: ClickhouseMemoryUsageWarning expr: ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0 for: 5m labels: severity: warning annotations: summary: ClickHouse Memory Usage Warning (instance {{ $labels.instance }}) description: "Memory usage is over 80%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.4. ClickHouse Disk Space Low on Default
Disk space on default is below 20%. [copy] - alert: ClickhouseDiskSpaceLowOnDefault expr: ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20 for: 2m labels: severity: warning annotations: summary: ClickHouse Disk Space Low on Default (instance {{ $labels.instance }}) description: "Disk space on default is below 20%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.5. ClickHouse Disk Space Critical on Default
Disk space on default disk is critically low, below 10%. [copy] - alert: ClickhouseDiskSpaceCriticalOnDefault expr: ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10 for: 2m labels: severity: critical annotations: summary: ClickHouse Disk Space Critical on Default (instance {{ $labels.instance }}) description: "Disk space on default disk is critically low, below 10%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.6. ClickHouse Disk Space Low on Backups
Disk space on backups is below 20%. [copy] - alert: ClickhouseDiskSpaceLowOnBackups expr: ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20 for: 2m labels: severity: warning annotations: summary: ClickHouse Disk Space Low on Backups (instance {{ $labels.instance }}) description: "Disk space on backups is below 20%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.7. ClickHouse Replica Errors
Critical replica errors detected, either all replicas are stale or lost. [copy] - alert: ClickhouseReplicaErrors expr: ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1 for: 0m labels: severity: critical annotations: summary: ClickHouse Replica Errors (instance {{ $labels.instance }}) description: "Critical replica errors detected, either all replicas are stale or lost.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.8. ClickHouse No Available Replicas
No available replicas in ClickHouse. [copy] - alert: ClickhouseNoAvailableReplicas expr: ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1 for: 0m labels: severity: critical annotations: summary: ClickHouse No Available Replicas (instance {{ $labels.instance }}) description: "No available replicas in ClickHouse.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.9. ClickHouse No Live Replicas
There are too few live replicas available, risking data loss and service disruption. [copy] - alert: ClickhouseNoLiveReplicas expr: ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1 for: 0m labels: severity: critical annotations: summary: ClickHouse No Live Replicas (instance {{ $labels.instance }}) description: "There are too few live replicas available, risking data loss and service disruption.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.10. ClickHouse High TCP Connections
High number of TCP connections, indicating heavy client or inter-cluster communication. [copy] # Please replace the threshold with an appropriate value - alert: ClickhouseHighTcpConnections expr: ClickHouseMetrics_TCPConnection > 400 for: 5m labels: severity: warning annotations: summary: ClickHouse High TCP Connections (instance {{ $labels.instance }}) description: "High number of TCP connections, indicating heavy client or inter-cluster communication.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.11. ClickHouse Interserver Connection Issues
High number of interserver connections may indicate replication or distributed query handling issues. [copy] # Adjust the threshold based on your cluster size and expected replication traffic. - alert: ClickhouseInterserverConnectionIssues expr: ClickHouseMetrics_InterserverConnection > 50 for: 5m labels: severity: warning annotations: summary: ClickHouse Interserver Connection Issues (instance {{ $labels.instance }}) description: "High number of interserver connections may indicate replication or distributed query handling issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.12. ClickHouse ZooKeeper Connection Issues
ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination. [copy] - alert: ClickhouseZookeeperConnectionIssues expr: ClickHouseMetrics_ZooKeeperSession != 1 for: 3m labels: severity: warning annotations: summary: ClickHouse ZooKeeper Connection Issues (instance {{ $labels.instance }}) description: "ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.13. ClickHouse Authentication Failures
Authentication failures detected, indicating potential security issues or misconfiguration. [copy] - alert: ClickhouseAuthenticationFailures expr: increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0 for: 0m labels: severity: info annotations: summary: ClickHouse Authentication Failures (instance {{ $labels.instance }}) description: "Authentication failures detected, indicating potential security issues or misconfiguration.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.14. ClickHouse Access Denied Errors
Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts. [copy] - alert: ClickhouseAccessDeniedErrors expr: increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 0 for: 0m labels: severity: info annotations: summary: ClickHouse Access Denied Errors (instance {{ $labels.instance }}) description: "Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.15. ClickHouse rejected insert queries
INSERTs rejected due to too many active data parts. Reduce insert frequency. [copy] - alert: ClickhouseRejectedInsertQueries expr: increase(ClickHouseProfileEvents_RejectedInserts[1m]) > 0 for: 1m labels: severity: warning annotations: summary: ClickHouse rejected insert queries (instance {{ $labels.instance }}) description: "INSERTs rejected due to too many active data parts. Reduce insert frequency.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.16. ClickHouse delayed insert queries
INSERTs delayed due to high number of active parts. [copy] - alert: ClickhouseDelayedInsertQueries expr: increase(ClickHouseProfileEvents_DelayedInserts[5m]) > 0 for: 2m labels: severity: warning annotations: summary: ClickHouse delayed insert queries (instance {{ $labels.instance }}) description: "INSERTs delayed due to high number of active parts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.17. ClickHouse zookeeper hardware exception
Zookeeper hardware exception: network issues communicating with ZooKeeper [copy] - alert: ClickhouseZookeeperHardwareException expr: increase(ClickHouseProfileEvents_ZooKeeperHardwareExceptions[1m]) > 0 for: 1m labels: severity: critical annotations: summary: ClickHouse zookeeper hardware exception (instance {{ $labels.instance }}) description: "Zookeeper hardware exception: network issues communicating with ZooKeeper\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.18. ClickHouse high network usage
High network usage. ClickHouse network usage exceeds 100MB/s. [copy] # Please replace the threshold with an appropriate value - alert: ClickhouseHighNetworkUsage expr: rate(ClickHouseProfileEvents_NetworkSendBytes[1m]) > 100*1024*1024 or rate(ClickHouseProfileEvents_NetworkReceiveBytes[1m]) > 100*1024*1024 for: 2m labels: severity: warning annotations: summary: ClickHouse high network usage (instance {{ $labels.instance }}) description: "High network usage. ClickHouse network usage exceeds 100MB/s.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.14.19. ClickHouse distributed rejected inserts
INSERTs into Distributed tables rejected due to pending bytes limit. [copy] - alert: ClickhouseDistributedRejectedInserts expr: increase(ClickHouseProfileEvents_DistributedRejectedInserts[5m]) > 0 for: 2m labels: severity: critical annotations: summary: ClickHouse distributed rejected inserts (instance {{ $labels.instance }}) description: "INSERTs into Distributed tables rejected due to pending bytes limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.15. CouchDB : gesellix/couchdb-prometheus-exporter (18 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/couchdb/gesellix-couchdb-prometheus-exporter.yml-
# 2.15.1. CouchDB node down
CouchDB node is not responding (node_up metric is 0) for more than 2 minutes [copy] - alert: CouchdbNodeDown expr: couchdb_httpd_node_up == 0 or couchdb_httpd_up == 0 for: 2m labels: severity: critical annotations: summary: CouchDB node down (instance {{ $labels.instance }}) description: "CouchDB node is not responding (node_up metric is 0) for more than 2 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.2. CouchDB atom memory usage critical
Atom memory usage is above 90% of limit [copy] - alert: CouchdbAtomMemoryUsageCritical expr: couchdb_erlang_memory_atom_used > 0.9 * couchdb_erlang_memory_atom for: 5m labels: severity: critical annotations: summary: CouchDB atom memory usage critical (instance {{ $labels.instance }}) description: "Atom memory usage is above 90% of limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.3. CouchDB open databases critical
Number of open databases exceeds 90% of node capacity [copy] - alert: CouchdbOpenDatabasesCritical expr: couchdb_httpd_open_databases > 0.9 * 1000 for: 5m labels: severity: critical annotations: summary: CouchDB open databases critical (instance {{ $labels.instance }}) description: "Number of open databases exceeds 90% of node capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.4. CouchDB open OS files critical
CouchDB is using more than 90% of allowed OS file descriptors, may fail to open new files [copy] - alert: CouchdbOpenOsFilesCritical expr: couchdb_httpd_open_os_files > 0.9 * 65535 for: 5m labels: severity: critical annotations: summary: CouchDB open OS files critical (instance {{ $labels.instance }}) description: "CouchDB is using more than 90% of allowed OS file descriptors, may fail to open new files\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.5. CouchDB 5xx error ratio high
More than 5% of HTTP requests are returning 5xx errors [copy] - alert: Couchdb5xxErrorRatioHigh expr: rate(couchdb_httpd_status_codes{code=~"5.."}[5m]) / rate(couchdb_httpd_requests[5m]) > 0.05 and rate(couchdb_httpd_requests[5m]) > 0 for: 5m labels: severity: critical annotations: summary: CouchDB 5xx error ratio high (instance {{ $labels.instance }}) description: "More than 5% of HTTP requests are returning 5xx errors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.6. CouchDB temporary view read rate critical
Temporary view read rate exceeds 100 reads/sec, high risk of performance degradation [copy] - alert: CouchdbTemporaryViewReadRateCritical expr: rate(couchdb_httpd_temporary_view_reads[5m]) > 100 for: 5m labels: severity: critical annotations: summary: CouchDB temporary view read rate critical (instance {{ $labels.instance }}) description: "Temporary view read rate exceeds 100 reads/sec, high risk of performance degradation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.7. CouchDB Mango queries scanning too many docs
Some Mango queries are scanning too many documents, consider adding indexes [copy] - alert: CouchdbMangoQueriesScanningTooManyDocs expr: rate(couchdb_mango_too_many_docs_scanned[5m]) > 50 for: 5m labels: severity: warning annotations: summary: CouchDB Mango queries scanning too many docs (instance {{ $labels.instance }}) description: "Some Mango queries are scanning too many documents, consider adding indexes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.8. CouchDB Mango queries failed due to invalid index
Some Mango queries failed to execute because the index was missing or invalid [copy] - alert: CouchdbMangoQueriesFailedDueToInvalidIndex expr: rate(couchdb_mango_query_invalid_index[5m]) > 5 for: 5m labels: severity: warning annotations: summary: CouchDB Mango queries failed due to invalid index (instance {{ $labels.instance }}) description: "Some Mango queries failed to execute because the index was missing or invalid\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.9. CouchDB Mango docs examined high
High number of documents examined per Mango queries, consider indexing [copy] - alert: CouchdbMangoDocsExaminedHigh expr: rate(couchdb_mango_docs_examined[5m]) > 1000 for: 5m labels: severity: warning annotations: summary: CouchDB Mango docs examined high (instance {{ $labels.instance }}) description: "High number of documents examined per Mango queries, consider indexing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.10. CouchDB Replicator manager died
Replication manager process has crashed [copy] - alert: CouchdbReplicatorManagerDied expr: increase(couchdb_replicator_changes_manager_deaths[5m]) > 0 for: 1m labels: severity: critical annotations: summary: CouchDB Replicator manager died (instance {{ $labels.instance }}) description: "Replication manager process has crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.11. CouchDB Replicator queue process died
Replication queue process has crashed [copy] - alert: CouchdbReplicatorQueueProcessDied expr: increase(couchdb_replicator_changes_queue_deaths[5m]) > 0 for: 1m labels: severity: critical annotations: summary: CouchDB Replicator queue process died (instance {{ $labels.instance }}) description: "Replication queue process has crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.12. CouchDB Replicator reader process died
Replication reader process has crashed [copy] - alert: CouchdbReplicatorReaderProcessDied expr: increase(couchdb_replicator_changes_reader_deaths[5m]) > 0 for: 1m labels: severity: critical annotations: summary: CouchDB Replicator reader process died (instance {{ $labels.instance }}) description: "Replication reader process has crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.13. CouchDB Replicator failed to start
One or more replication tasks failed to start [copy] - alert: CouchdbReplicatorFailedToStart expr: increase(couchdb_replicator_failed_starts[5m]) > 0 for: 1m labels: severity: critical annotations: summary: CouchDB Replicator failed to start (instance {{ $labels.instance }}) description: "One or more replication tasks failed to start\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.14. CouchDB replication cluster unstable
The replication cluster is unstable, replication may be interrupted [copy] - alert: CouchdbReplicationClusterUnstable expr: couchdb_replicator_cluster_is_stable == 0 for: 2m labels: severity: critical annotations: summary: CouchDB replication cluster unstable (instance {{ $labels.instance }}) description: "The replication cluster is unstable, replication may be interrupted\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.15. CouchDB replication read failures
Replication changes feed has failed reads more than 5 times in 5 minutes [copy] - alert: CouchdbReplicationReadFailures expr: increase(couchdb_replicator_changes_read_failures[5m]) > 5 for: 5m labels: severity: warning annotations: summary: CouchDB replication read failures (instance {{ $labels.instance }}) description: "Replication changes feed has failed reads more than 5 times in 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.16. CouchDB file descriptors high
Process is using more than 85% of allowed file descriptors [copy] - alert: CouchdbFileDescriptorsHigh expr: process_open_fds / process_max_fds > 0.85 for: 5m labels: severity: warning annotations: summary: CouchDB file descriptors high (instance {{ $labels.instance }}) description: "Process is using more than 85% of allowed file descriptors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.17. CouchDB process restarted
CouchDB process has restarted recently [copy] - alert: CouchdbProcessRestarted expr: changes(process_start_time_seconds[1h]) > 0 for: 1m labels: severity: info annotations: summary: CouchDB process restarted (instance {{ $labels.instance }}) description: "CouchDB process has restarted recently\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.15.18. CouchDB critical log entries
Critical or error log entries detected in the last 5 minutes [copy] - alert: CouchdbCriticalLogEntries expr: increase(couchdb_server_couch_log{level=~"error|critical"}[5m]) > 0 for: 1m labels: severity: critical annotations: summary: CouchDB critical log entries (instance {{ $labels.instance }}) description: "Critical or error log entries detected in the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.16.1. Zookeeper : cloudflare/kafka_zookeeper_exporter
// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
-
# 2.16.2. Zookeeper : dabealu/zookeeper-exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/zookeeper/dabealu-zookeeper-exporter.yml-
# 2.16.2.1. Zookeeper Down
Zookeeper down on instance {{ $labels.instance }} [copy] # 1m delay allows a restart without triggering an alert. - alert: ZookeeperDown expr: zk_up == 0 for: 1m labels: severity: critical annotations: summary: Zookeeper Down (instance {{ $labels.instance }}) description: "Zookeeper down on instance {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.16.2.2. Zookeeper missing leader
Zookeeper cluster has no node marked as leader [copy] - alert: ZookeeperMissingLeader expr: sum(zk_server_leader) == 0 for: 0m labels: severity: critical annotations: summary: Zookeeper missing leader (instance {{ $labels.instance }}) description: "Zookeeper cluster has no node marked as leader\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.16.2.3. Zookeeper Too Many Leaders
Zookeeper cluster has too many nodes marked as leader [copy] - alert: ZookeeperTooManyLeaders expr: sum(zk_server_leader) > 1 for: 0m labels: severity: critical annotations: summary: Zookeeper Too Many Leaders (instance {{ $labels.instance }}) description: "Zookeeper cluster has too many nodes marked as leader\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.16.2.4. Zookeeper Not Ok
Zookeeper instance is not ok [copy] - alert: ZookeeperNotOk expr: zk_ruok == 0 for: 3m labels: severity: warning annotations: summary: Zookeeper Not Ok (instance {{ $labels.instance }}) description: "Zookeeper instance is not ok\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.17.1. Kafka : danielqsj/kafka_exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/kafka/danielqsj-kafka-exporter.yml-
# 2.17.1.1. Kafka topics replicas
Kafka topic in-sync partition [copy] - alert: KafkaTopicsReplicas expr: min(kafka_topic_partition_in_sync_replica) by (topic) < 3 for: 0m labels: severity: critical annotations: summary: Kafka topics replicas (instance {{ $labels.instance }}) description: "Kafka topic in-sync partition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.17.1.2. Kafka consumer group lag
Kafka consumer group {{ $labels.consumergroup }} is lagging behind ({{ $value }} messages) [copy] - alert: KafkaConsumerGroupLag expr: sum(kafka_consumergroup_lag) by (consumergroup) > 10000 for: 1m labels: severity: warning annotations: summary: Kafka consumer group lag (instance {{ $labels.instance }}) description: "Kafka consumer group {{ $labels.consumergroup }} is lagging behind ({{ $value }} messages)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.17.2. Kafka : linkedin/Burrow (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/kafka/linkedin-kafka-exporter.yml-
# 2.17.2.1. Kafka topic offset decreased
Kafka topic offset has decreased [copy] - alert: KafkaTopicOffsetDecreased expr: delta(kafka_burrow_partition_current_offset[1m]) < 0 for: 0m labels: severity: warning annotations: summary: Kafka topic offset decreased (instance {{ $labels.instance }}) description: "Kafka topic offset has decreased\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.17.2.2. Kafka consumer lag
Kafka consumer has a 30 minutes and increasing lag [copy] - alert: KafkaConsumerLag expr: kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset >= (kafka_burrow_topic_partition_offset offset 15m - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset offset 15m) AND kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset > 0 for: 15m labels: severity: warning annotations: summary: Kafka consumer lag (instance {{ $labels.instance }}) description: "Kafka consumer has a 30 minutes and increasing lag\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.18. Pulsar : embedded exporter (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/pulsar/embedded-exporter.yml-
# 2.18.1. Pulsar subscription high number of backlog entries
The number of subscription backlog entries is over 5k [copy] - alert: PulsarSubscriptionHighNumberOfBacklogEntries expr: sum(pulsar_subscription_back_log) by (subscription) > 5000 for: 1h labels: severity: warning annotations: summary: Pulsar subscription high number of backlog entries (instance {{ $labels.instance }}) description: "The number of subscription backlog entries is over 5k\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.2. Pulsar subscription very high number of backlog entries
The number of subscription backlog entries is over 100k [copy] - alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries expr: sum(pulsar_subscription_back_log) by (subscription) > 100000 for: 1h labels: severity: critical annotations: summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }}) description: "The number of subscription backlog entries is over 100k\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.3. Pulsar topic large backlog storage size
The topic backlog storage size is over 5 GB [copy] - alert: PulsarTopicLargeBacklogStorageSize expr: sum(pulsar_storage_size) by (topic) > 5*1024*1024*1024 for: 1h labels: severity: warning annotations: summary: Pulsar topic large backlog storage size (instance {{ $labels.instance }}) description: "The topic backlog storage size is over 5 GB\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.4. Pulsar topic very large backlog storage size
The topic backlog storage size is over 20 GB [copy] - alert: PulsarTopicVeryLargeBacklogStorageSize expr: sum(pulsar_storage_size) by (topic) > 20*1024*1024*1024 for: 1h labels: severity: critical annotations: summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance }}) description: "The topic backlog storage size is over 20 GB\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.5. Pulsar high write latency
Messages cannot be written in a timely fashion [copy] - alert: PulsarHighWriteLatency expr: sum(pulsar_storage_write_latency_overflow > 0) by (topic) for: 1h labels: severity: critical annotations: summary: Pulsar high write latency (instance {{ $labels.instance }}) description: "Messages cannot be written in a timely fashion\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.6. Pulsar large message payload
Observing large message payload (> 1MB) [copy] - alert: PulsarLargeMessagePayload expr: sum(pulsar_entry_size_overflow > 0) by (topic) for: 1h labels: severity: warning annotations: summary: Pulsar large message payload (instance {{ $labels.instance }}) description: "Observing large message payload (> 1MB)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.7. Pulsar high ledger disk usage
Observing Ledger Disk Usage (> 75%) [copy] - alert: PulsarHighLedgerDiskUsage expr: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75 for: 1h labels: severity: critical annotations: summary: Pulsar high ledger disk usage (instance {{ $labels.instance }}) description: "Observing Ledger Disk Usage (> 75%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.8. Pulsar read only bookies
Observing Readonly Bookies [copy] - alert: PulsarReadOnlyBookies expr: count(bookie_SERVER_STATUS{} == 0) by (pod) for: 5m labels: severity: critical annotations: summary: Pulsar read only bookies (instance {{ $labels.instance }}) description: "Observing Readonly Bookies\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.9. Pulsar high number of function errors
Observing more than 10 Function errors per minute [copy] - alert: PulsarHighNumberOfFunctionErrors expr: sum(rate(pulsar_function_user_exceptions_total[1m]) + rate(pulsar_function_system_exceptions_total[1m])) by (name) > 10 for: 1m labels: severity: critical annotations: summary: Pulsar high number of function errors (instance {{ $labels.instance }}) description: "Observing more than 10 Function errors per minute\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.18.10. Pulsar high number of sink errors
Observing more than 10 Sink errors per minute [copy] - alert: PulsarHighNumberOfSinkErrors expr: sum(rate(pulsar_sink_sink_exceptions_total[1m])) by (name) > 10 for: 1m labels: severity: critical annotations: summary: Pulsar high number of sink errors (instance {{ $labels.instance }}) description: "Observing more than 10 Sink errors per minute\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.19. Nats : nats-io/prometheus-nats-exporter (13 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/nats/nats-exporter.yml-
# 2.19.1. Nats high routes count
High number of NATS routes ({{ $value }}) for {{ $labels.instance }} [copy] - alert: NatsHighRoutesCount expr: gnatsd_varz_routes > 10 for: 3m labels: severity: warning annotations: summary: Nats high routes count (instance {{ $labels.instance }}) description: "High number of NATS routes ({{ $value }}) for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.2. Nats high memory usage
NATS server memory usage is above 200MB for {{ $labels.instance }} [copy] - alert: NatsHighMemoryUsage expr: gnatsd_varz_mem > 200 * 1024 * 1024 for: 5m labels: severity: warning annotations: summary: Nats high memory usage (instance {{ $labels.instance }}) description: "NATS server memory usage is above 200MB for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.3. Nats slow consumers
There are slow consumers in NATS for {{ $labels.instance }} [copy] - alert: NatsSlowConsumers expr: gnatsd_varz_slow_consumers > 0 for: 3m labels: severity: critical annotations: summary: Nats slow consumers (instance {{ $labels.instance }}) description: "There are slow consumers in NATS for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.4. Nats server down
NATS server has been down for more than 5 minutes [copy] - alert: NatsServerDown expr: absent(up{job="nats"}) for: 5m labels: severity: critical annotations: summary: Nats server down (instance {{ $labels.instance }}) description: "NATS server has been down for more than 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.5. Nats high CPU usage
NATS server is using more than 80% CPU for the last 5 minutes [copy] # gnatsd_varz_cpu is a gauge reporting CPU percentage (0-100 scale). - alert: NatsHighCpuUsage expr: gnatsd_varz_cpu > 80 for: 5m labels: severity: warning annotations: summary: Nats high CPU usage (instance {{ $labels.instance }}) description: "NATS server is using more than 80% CPU for the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.6. Nats high number of connections
NATS server has more than 1000 active connections [copy] - alert: NatsHighNumberOfConnections expr: gnatsd_connz_num_connections > 1000 for: 5m labels: severity: warning annotations: summary: Nats high number of connections (instance {{ $labels.instance }}) description: "NATS server has more than 1000 active connections\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.7. Nats high JetStream store usage
JetStream store usage is over 80% [copy] - alert: NatsHighJetstreamStoreUsage expr: gnatsd_varz_jetstream_stats_storage / gnatsd_varz_jetstream_config_max_storage > 0.8 and gnatsd_varz_jetstream_config_max_storage > 0 for: 5m labels: severity: warning annotations: summary: Nats high JetStream store usage (instance {{ $labels.instance }}) description: "JetStream store usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.8. Nats high JetStream memory usage
JetStream memory usage is over 80% [copy] - alert: NatsHighJetstreamMemoryUsage expr: gnatsd_varz_jetstream_stats_memory / gnatsd_varz_jetstream_config_max_memory > 0.8 and gnatsd_varz_jetstream_config_max_memory > 0 for: 5m labels: severity: warning annotations: summary: Nats high JetStream memory usage (instance {{ $labels.instance }}) description: "JetStream memory usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.9. Nats high number of subscriptions
NATS server has more than 1000 active subscriptions [copy] - alert: NatsHighNumberOfSubscriptions expr: gnatsd_connz_subscriptions > 1000 for: 5m labels: severity: warning annotations: summary: Nats high number of subscriptions (instance {{ $labels.instance }}) description: "NATS server has more than 1000 active subscriptions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.10. Nats high pending bytes
NATS server has more than 100,000 pending bytes [copy] - alert: NatsHighPendingBytes expr: gnatsd_connz_pending_bytes > 100000 for: 5m labels: severity: warning annotations: summary: Nats high pending bytes (instance {{ $labels.instance }}) description: "NATS server has more than 100,000 pending bytes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.11. Nats too many errors
NATS server has encountered errors in the last 5 minutes [copy] - alert: NatsTooManyErrors expr: increase(gnatsd_varz_jetstream_stats_api_errors[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Nats too many errors (instance {{ $labels.instance }}) description: "NATS server has encountered errors in the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.12. Nats JetStream accounts exceeded
JetStream has more than 100 active accounts [copy] - alert: NatsJetstreamAccountsExceeded expr: sum(gnatsd_varz_jetstream_stats_accounts) > 100 for: 5m labels: severity: warning annotations: summary: Nats JetStream accounts exceeded (instance {{ $labels.instance }}) description: "JetStream has more than 100 active accounts\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.19.13. Nats leaf node connection issue
No leaf node connections on {{ $labels.instance }} [copy] - alert: NatsLeafNodeConnectionIssue expr: gnatsd_varz_leafnodes == 0 for: 5m labels: severity: warning annotations: summary: Nats leaf node connection issue (instance {{ $labels.instance }}) description: "No leaf node connections on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.20. Solr : embedded exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/solr/embedded-exporter.yml-
# 2.20.1. Solr update errors
Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy] - alert: SolrUpdateErrors expr: increase(solr_metrics_core_update_handler_errors_total[1m]) > 1 for: 0m labels: severity: critical annotations: summary: Solr update errors (instance {{ $labels.instance }}) description: "Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.20.2. Solr query errors
Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy] - alert: SolrQueryErrors expr: increase(solr_metrics_core_errors_total{category="QUERY"}[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Solr query errors (instance {{ $labels.instance }}) description: "Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.20.3. Solr replication errors
Solr collection {{ $labels.collection }} has replication errors for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy] - alert: SolrReplicationErrors expr: increase(solr_metrics_core_errors_total{category="REPLICATION"}[1m]) > 1 for: 0m labels: severity: critical annotations: summary: Solr replication errors (instance {{ $labels.instance }}) description: "Solr collection {{ $labels.collection }} has replication errors for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.20.4. Solr low live node count
Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy] - alert: SolrLowLiveNodeCount expr: solr_collections_live_nodes < 2 for: 0m labels: severity: critical annotations: summary: Solr low live node count (instance {{ $labels.instance }}) description: "Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 2.21. Hadoop : hadoop/jmx_exporter (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/hadoop/jmx_exporter.yml-
# 2.21.1. Hadoop Name Node Down
The Hadoop NameNode service is unavailable. [copy] - alert: HadoopNameNodeDown expr: up{job="hadoop-namenode"} == 0 for: 5m labels: severity: critical annotations: summary: Hadoop Name Node Down (instance {{ $labels.instance }}) description: "The Hadoop NameNode service is unavailable.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.2. Hadoop Resource Manager Down
The Hadoop ResourceManager service is unavailable. [copy] - alert: HadoopResourceManagerDown expr: up{job="hadoop-resourcemanager"} == 0 for: 5m labels: severity: critical annotations: summary: Hadoop Resource Manager Down (instance {{ $labels.instance }}) description: "The Hadoop ResourceManager service is unavailable.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.3. Hadoop Data Node Out Of Service
The Hadoop DataNode is not sending heartbeats. [copy] - alert: HadoopDataNodeOutOfService expr: hadoop_datanode_last_heartbeat == 0 for: 10m labels: severity: warning annotations: summary: Hadoop Data Node Out Of Service (instance {{ $labels.instance }}) description: "The Hadoop DataNode is not sending heartbeats.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.4. Hadoop HDFS Disk Space Low
Available HDFS disk space is running low. [copy] - alert: HadoopHdfsDiskSpaceLow expr: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1 for: 15m labels: severity: warning annotations: summary: Hadoop HDFS Disk Space Low (instance {{ $labels.instance }}) description: "Available HDFS disk space is running low.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.5. Hadoop Map Reduce Task Failures
There is an unusually high number of MapReduce task failures. [copy] - alert: HadoopMapReduceTaskFailures expr: increase(hadoop_mapreduce_task_failures_total[1h]) > 100 for: 10m labels: severity: critical annotations: summary: Hadoop Map Reduce Task Failures (instance {{ $labels.instance }}) description: "There is an unusually high number of MapReduce task failures.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.6. Hadoop Resource Manager Memory High
The Hadoop ResourceManager is approaching its memory limit. [copy] - alert: HadoopResourceManagerMemoryHigh expr: hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8 for: 15m labels: severity: warning annotations: summary: Hadoop Resource Manager Memory High (instance {{ $labels.instance }}) description: "The Hadoop ResourceManager is approaching its memory limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.7. Hadoop YARN Container Allocation Failures
There is a significant number of YARN container allocation failures. [copy] - alert: HadoopYarnContainerAllocationFailures expr: increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10 for: 10m labels: severity: warning annotations: summary: Hadoop YARN Container Allocation Failures (instance {{ $labels.instance }}) description: "There is a significant number of YARN container allocation failures.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.8. Hadoop HBase Region Count High
The HBase cluster has an unusually high number of regions. [copy] - alert: HadoopHbaseRegionCountHigh expr: hadoop_hbase_region_count > 5000 for: 15m labels: severity: warning annotations: summary: Hadoop HBase Region Count High (instance {{ $labels.instance }}) description: "The HBase cluster has an unusually high number of regions.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.9. Hadoop HBase Region Server Heap Low
HBase Region Servers are running low on heap space. [copy] - alert: HadoopHbaseRegionServerHeapLow expr: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8 for: 10m labels: severity: warning annotations: summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }}) description: "HBase Region Servers are running low on heap space.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 2.21.10. Hadoop HBase Write Requests Latency High
HBase Write Requests are experiencing high latency. [copy] - alert: HadoopHbaseWriteRequestsLatencyHigh expr: hadoop_hbase_write_requests_latency_seconds > 0.5 for: 10m labels: severity: warning annotations: summary: Hadoop HBase Write Requests Latency High (instance {{ $labels.instance }}) description: "HBase Write Requests are experiencing high latency.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.1. Nginx : knyar/nginx-lua-prometheus (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/nginx/knyar-nginx-exporter.yml-
# 3.1.1. Nginx high HTTP 4xx error rate
Too many HTTP requests with status 4xx (> 5%) [copy] - alert: NginxHighHttp4xxErrorRate expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.1.2. Nginx high HTTP 5xx error rate
Too many HTTP requests with status 5xx (> 5%) [copy] - alert: NginxHighHttp5xxErrorRate expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 5xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.1.3. Nginx latency high
Nginx p99 latency is higher than 3 seconds [copy] - alert: NginxLatencyHigh expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3 for: 2m labels: severity: warning annotations: summary: Nginx latency high (instance {{ $labels.instance }}) description: "Nginx p99 latency is higher than 3 seconds\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.2. Apache : Lusitaniae/apache_exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/apache/lusitaniae-apache-exporter.yml-
# 3.2.1. Apache down
Apache down [copy] - alert: ApacheDown expr: apache_up == 0 for: 0m labels: severity: critical annotations: summary: Apache down (instance {{ $labels.instance }}) description: "Apache down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.2.2. Apache workers load
Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }} [copy] - alert: ApacheWorkersLoad expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80 for: 2m labels: severity: warning annotations: summary: Apache workers load (instance {{ $labels.instance }}) description: "Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.2.3. Apache restart
Apache has just been restarted. [copy] - alert: ApacheRestart expr: apache_uptime_seconds_total / 60 < 1 for: 0m labels: severity: warning annotations: summary: Apache restart (instance {{ $labels.instance }}) description: "Apache has just been restarted.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.3.1. HaProxy : Embedded exporter (HAProxy >= v2) (14 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/haproxy/embedded-exporter-v2.yml-
# 3.3.1.1. HAProxy high HTTP 4xx error rate backend
Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy] - alert: HaproxyHighHttp4xxErrorRateBackend expr: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.2. HAProxy high HTTP 5xx error rate backend
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy] - alert: HaproxyHighHttp5xxErrorRateBackend expr: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.3. HAProxy high HTTP 4xx error rate server
Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }} [copy] - alert: HaproxyHighHttp4xxErrorRateServer expr: ((sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.4. HAProxy high HTTP 5xx error rate server
Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }} [copy] - alert: HaproxyHighHttp5xxErrorRateServer expr: ((sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.5. HAProxy server response errors
Too many response errors to {{ $labels.server }} server (> 5%). [copy] - alert: HaproxyServerResponseErrors expr: (sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5 for: 1m labels: severity: critical annotations: summary: HAProxy server response errors (instance {{ $labels.instance }}) description: "Too many response errors to {{ $labels.server }} server (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.6. HAProxy backend connection errors
Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high. [copy] - alert: HaproxyBackendConnectionErrors expr: (sum by (proxy) (rate(haproxy_backend_connection_errors_total[1m]))) > 100 for: 1m labels: severity: critical annotations: summary: HAProxy backend connection errors (instance {{ $labels.instance }}) description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.7. HAProxy server connection errors
Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high. [copy] - alert: HaproxyServerConnectionErrors expr: (sum by (proxy) (rate(haproxy_server_connection_errors_total[1m]))) > 100 for: 0m labels: severity: critical annotations: summary: HAProxy server connection errors (instance {{ $labels.instance }}) description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.8. HAProxy backend max active session > 80%
Session limit from backend {{ $labels.proxy }} reached 80% of limit - {{ $value | printf "%.2f"}}% [copy] - alert: HaproxyBackendMaxActiveSession>80% expr: ((haproxy_backend_current_sessions >0) * 100) / (haproxy_backend_limit_sessions > 0) > 80 for: 2m labels: severity: warning annotations: summary: HAProxy backend max active session > 80% (instance {{ $labels.instance }}) description: "Session limit from backend {{ $labels.proxy }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.9. HAProxy pending requests
Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf "%.2f"}} [copy] - alert: HaproxyPendingRequests expr: sum by (proxy) (rate(haproxy_backend_current_queue[2m])) > 0 for: 2m labels: severity: warning annotations: summary: HAProxy pending requests (instance {{ $labels.instance }}) description: "Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.10. HAProxy HTTP slowing down
Average request time is increasing - {{ $value | printf "%.2f"}} [copy] - alert: HaproxyHttpSlowingDown expr: avg by (instance, proxy) (haproxy_backend_max_total_time_seconds) > 1 for: 1m labels: severity: warning annotations: summary: HAProxy HTTP slowing down (instance {{ $labels.instance }}) description: "Average request time is increasing - {{ $value | printf \"%.2f\"}}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.11. HAProxy retry high
High rate of retry on {{ $labels.proxy }} - {{ $value | printf "%.2f"}} [copy] - alert: HaproxyRetryHigh expr: sum by (proxy) (rate(haproxy_backend_retry_warnings_total[1m])) > 10 for: 2m labels: severity: warning annotations: summary: HAProxy retry high (instance {{ $labels.instance }}) description: "High rate of retry on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.12. HAproxy has no alive backends
HAProxy has no alive active or backup backends for {{ $labels.proxy }} [copy] - alert: HaproxyHasNoAliveBackends expr: haproxy_backend_active_servers + haproxy_backend_backup_servers == 0 for: 0m labels: severity: critical annotations: summary: HAproxy has no alive backends (instance {{ $labels.instance }}) description: "HAProxy has no alive active or backup backends for {{ $labels.proxy }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.13. HAProxy frontend security blocked requests
HAProxy is blocking requests for security reason [copy] - alert: HaproxyFrontendSecurityBlockedRequests expr: sum by (proxy) (rate(haproxy_frontend_denied_connections_total[2m])) > 10 for: 2m labels: severity: warning annotations: summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }}) description: "HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.1.14. HAProxy server healthcheck failure
Some server healthcheck are failing on {{ $labels.server }} [copy] - alert: HaproxyServerHealthcheckFailure expr: increase(haproxy_server_check_failures_total[1m]) > 0 for: 1m labels: severity: warning annotations: summary: HAProxy server healthcheck failure (instance {{ $labels.instance }}) description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.3.2. HaProxy : prometheus/haproxy_exporter (HAProxy < v2) (16 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/haproxy/haproxy-exporter-v1.yml-
# 3.3.2.1. HAProxy down
HAProxy down [copy] - alert: HaproxyDown expr: haproxy_up == 0 for: 0m labels: severity: critical annotations: summary: HAProxy down (instance {{ $labels.instance }}) description: "HAProxy down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.2. HAProxy high HTTP 4xx error rate backend
Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy] - alert: HaproxyHighHttp4xxErrorRateBackend expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.3. HAProxy high HTTP 5xx error rate backend
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy] - alert: HaproxyHighHttp5xxErrorRateBackend expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.4. HAProxy high HTTP 4xx error rate server
Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }} [copy] - alert: HaproxyHighHttp4xxErrorRateServer expr: sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.5. HAProxy high HTTP 5xx error rate server
Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }} [copy] - alert: HaproxyHighHttp5xxErrorRateServer expr: sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }}) description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.6. HAProxy server response errors
Too many response errors to {{ $labels.server }} server (> 5%). [copy] - alert: HaproxyServerResponseErrors expr: sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 for: 1m labels: severity: critical annotations: summary: HAProxy server response errors (instance {{ $labels.instance }}) description: "Too many response errors to {{ $labels.server }} server (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.7. HAProxy backend connection errors
Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high. [copy] - alert: HaproxyBackendConnectionErrors expr: sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) > 100 for: 1m labels: severity: critical annotations: summary: HAProxy backend connection errors (instance {{ $labels.instance }}) description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.8. HAProxy server connection errors
Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high. [copy] - alert: HaproxyServerConnectionErrors expr: sum by (server) (rate(haproxy_server_connection_errors_total[1m])) > 100 for: 0m labels: severity: critical annotations: summary: HAProxy server connection errors (instance {{ $labels.instance }}) description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.9. HAProxy backend max active session
HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%). [copy] - alert: HaproxyBackendMaxActiveSession expr: ((sum by (backend) (avg_over_time(haproxy_backend_current_sessions[2m]) * 100) / sum by (backend) (avg_over_time(haproxy_backend_limit_sessions[2m])))) > 80 for: 2m labels: severity: warning annotations: summary: HAProxy backend max active session (instance {{ $labels.instance }}) description: "HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.10. HAProxy pending requests
Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend [copy] - alert: HaproxyPendingRequests expr: sum by (backend) (haproxy_backend_current_queue) > 0 for: 2m labels: severity: warning annotations: summary: HAProxy pending requests (instance {{ $labels.instance }}) description: "Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.11. HAProxy HTTP slowing down
Average request time is increasing [copy] - alert: HaproxyHttpSlowingDown expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 1 for: 1m labels: severity: warning annotations: summary: HAProxy HTTP slowing down (instance {{ $labels.instance }}) description: "Average request time is increasing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.12. HAProxy retry high
High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend [copy] - alert: HaproxyRetryHigh expr: sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10 for: 2m labels: severity: warning annotations: summary: HAProxy retry high (instance {{ $labels.instance }}) description: "High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.13. HAProxy backend down
HAProxy backend is down [copy] - alert: HaproxyBackendDown expr: haproxy_backend_up == 0 for: 0m labels: severity: critical annotations: summary: HAProxy backend down (instance {{ $labels.instance }}) description: "HAProxy backend is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.14. HAProxy server down
HAProxy server is down [copy] - alert: HaproxyServerDown expr: haproxy_server_up == 0 for: 0m labels: severity: critical annotations: summary: HAProxy server down (instance {{ $labels.instance }}) description: "HAProxy server is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.15. HAProxy frontend security blocked requests
HAProxy is blocking requests for security reason [copy] - alert: HaproxyFrontendSecurityBlockedRequests expr: sum by (frontend) (rate(haproxy_frontend_requests_denied_total[2m])) > 10 for: 2m labels: severity: warning annotations: summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }}) description: "HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.3.2.16. HAProxy server healthcheck failure
Some server healthcheck are failing on {{ $labels.server }} [copy] - alert: HaproxyServerHealthcheckFailure expr: increase(haproxy_server_check_failures_total[1m]) > 0 for: 1m labels: severity: warning annotations: summary: HAProxy server healthcheck failure (instance {{ $labels.instance }}) description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.4.1. Traefik : Embedded exporter v2 (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/traefik/embedded-exporter-v2.yml-
# 3.4.1.1. Traefik service down
All Traefik services are down [copy] - alert: TraefikServiceDown expr: count(traefik_service_server_up) by (service) == 0 for: 0m labels: severity: critical annotations: summary: Traefik service down (instance {{ $labels.instance }}) description: "All Traefik services are down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.4.1.2. Traefik high HTTP 4xx error rate service
Traefik service 4xx error rate is above 5% [copy] - alert: TraefikHighHttp4xxErrorRateService expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }}) description: "Traefik service 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.4.1.3. Traefik high HTTP 5xx error rate service
Traefik service 5xx error rate is above 5% [copy] - alert: TraefikHighHttp5xxErrorRateService expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }}) description: "Traefik service 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.4.2. Traefik : Embedded exporter v1 (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/traefik/embedded-exporter-v1.yml-
# 3.4.2.1. Traefik backend down
All Traefik backends are down [copy] - alert: TraefikBackendDown expr: count(traefik_backend_server_up) by (backend) == 0 for: 0m labels: severity: critical annotations: summary: Traefik backend down (instance {{ $labels.instance }}) description: "All Traefik backends are down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.4.2.2. Traefik high HTTP 4xx error rate backend
Traefik backend 4xx error rate is above 5% [copy] - alert: TraefikHighHttp4xxErrorRateBackend expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: "Traefik backend 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.4.2.3. Traefik high HTTP 5xx error rate backend
Traefik backend 5xx error rate is above 5% [copy] - alert: TraefikHighHttp5xxErrorRateBackend expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }}) description: "Traefik backend 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.5. Caddy : Embedded exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/caddy/embedded-exporter.yml-
# 3.5.1. Caddy Reverse Proxy Down
All Caddy reverse proxies are down [copy] - alert: CaddyReverseProxyDown expr: count(caddy_reverse_proxy_upstreams_healthy) by (upstream) == 0 for: 0m labels: severity: critical annotations: summary: Caddy Reverse Proxy Down (instance {{ $labels.instance }}) description: "All Caddy reverse proxies are down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.5.2. Caddy high HTTP 4xx error rate service
Caddy service 4xx error rate is above 5% [copy] - alert: CaddyHighHttp4xxErrorRateService expr: sum(rate(caddy_http_request_duration_seconds_count{code=~"4.."}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Caddy high HTTP 4xx error rate service (instance {{ $labels.instance }}) description: "Caddy service 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.5.3. Caddy high HTTP 5xx error rate service
Caddy service 5xx error rate is above 5% [copy] - alert: CaddyHighHttp5xxErrorRateService expr: sum(rate(caddy_http_request_duration_seconds_count{code=~"5.."}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Caddy high HTTP 5xx error rate service (instance {{ $labels.instance }}) description: "Caddy service 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 3.6. Envoy : Built-in metrics (19 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/envoy/embedded-exporter.yml-
# 3.6.1. Envoy server not live
Envoy server is not live (draining or shutting down) on {{ $labels.instance }} [copy] - alert: EnvoyServerNotLive expr: envoy_server_live != 1 for: 1m labels: severity: critical annotations: summary: Envoy server not live (instance {{ $labels.instance }}) description: "Envoy server is not live (draining or shutting down) on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.2. Envoy high memory usage
Envoy memory allocated is above 90% of heap size on {{ $labels.instance }} [copy] - alert: EnvoyHighMemoryUsage expr: envoy_server_memory_allocated / envoy_server_memory_heap_size * 100 > 90 for: 5m labels: severity: warning annotations: summary: Envoy high memory usage (instance {{ $labels.instance }}) description: "Envoy memory allocated is above 90% of heap size on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.3. Envoy high downstream HTTP 5xx error rate
More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf "%.1f" }}%) [copy] - alert: EnvoyHighDownstreamHttp5xxErrorRate expr: sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class="5"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 5 for: 1m labels: severity: critical annotations: summary: Envoy high downstream HTTP 5xx error rate (instance {{ $labels.instance }}) description: "More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf \"%.1f\" }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.4. Envoy high downstream HTTP 4xx error rate
More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf "%.1f" }}%) [copy] - alert: EnvoyHighDownstreamHttp4xxErrorRate expr: sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class="4"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 10 for: 5m labels: severity: warning annotations: summary: Envoy high downstream HTTP 4xx error rate (instance {{ $labels.instance }}) description: "More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf \"%.1f\" }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.5. Envoy downstream connections overflowing
Downstream connections are being rejected due to listener overflow on {{ $labels.instance }} [copy] - alert: EnvoyDownstreamConnectionsOverflowing expr: increase(envoy_listener_downstream_cx_overflow[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Envoy downstream connections overflowing (instance {{ $labels.instance }}) description: "Downstream connections are being rejected due to listener overflow on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.6. Envoy cluster membership empty
Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members [copy] - alert: EnvoyClusterMembershipEmpty expr: envoy_cluster_membership_healthy == 0 for: 1m labels: severity: critical annotations: summary: Envoy cluster membership empty (instance {{ $labels.instance }}) description: "Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.7. Envoy cluster membership degraded
More than 25% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are unhealthy [copy] - alert: EnvoyClusterMembershipDegraded expr: envoy_cluster_membership_healthy / envoy_cluster_membership_total * 100 < 75 and envoy_cluster_membership_total > 0 for: 5m labels: severity: warning annotations: summary: Envoy cluster membership degraded (instance {{ $labels.instance }}) description: "More than 25% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are unhealthy\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.8. Envoy high cluster upstream connection failures
High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyHighClusterUpstreamConnectionFailures expr: increase(envoy_cluster_upstream_cx_connect_fail[5m]) > 10 for: 5m labels: severity: warning annotations: summary: Envoy high cluster upstream connection failures (instance {{ $labels.instance }}) description: "High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.9. Envoy high cluster upstream request timeout rate
More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyHighClusterUpstreamRequestTimeoutRate expr: increase(envoy_cluster_upstream_rq_timeout[5m]) / increase(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and increase(envoy_cluster_upstream_rq_completed[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Envoy high cluster upstream request timeout rate (instance {{ $labels.instance }}) description: "More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.10. Envoy high cluster upstream 5xx error rate
More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyHighClusterUpstream5xxErrorRate expr: increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class="5"}[5m]) / increase(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and increase(envoy_cluster_upstream_rq_completed[5m]) > 0 for: 1m labels: severity: critical annotations: summary: Envoy high cluster upstream 5xx error rate (instance {{ $labels.instance }}) description: "More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.11. Envoy cluster health check failures
Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyClusterHealthCheckFailures expr: increase(envoy_cluster_health_check_failure[5m]) > 5 for: 5m labels: severity: warning annotations: summary: Envoy cluster health check failures (instance {{ $labels.instance }}) description: "Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.12. Envoy cluster outlier detection ejections active
There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyClusterOutlierDetectionEjectionsActive expr: envoy_cluster_outlier_detection_ejections_active > 0 for: 5m labels: severity: info annotations: summary: Envoy cluster outlier detection ejections active (instance {{ $labels.instance }}) description: "There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.13. Envoy listener SSL connection errors
Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }} [copy] - alert: EnvoyListenerSslConnectionErrors expr: increase(envoy_listener_ssl_connection_error[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Envoy listener SSL connection errors (instance {{ $labels.instance }}) description: "Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.14. Envoy global downstream connections overflowing
Downstream connections are being rejected due to global connection limit on {{ $labels.instance }} [copy] - alert: EnvoyGlobalDownstreamConnectionsOverflowing expr: increase(envoy_listener_downstream_global_cx_overflow[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Envoy global downstream connections overflowing (instance {{ $labels.instance }}) description: "Downstream connections are being rejected due to global connection limit on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.15. Envoy SSL certificate expiring soon
SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days [copy] - alert: EnvoySslCertificateExpiringSoon expr: envoy_server_days_until_first_cert_expiring < 7 for: 0m labels: severity: warning annotations: summary: Envoy SSL certificate expiring soon (instance {{ $labels.instance }}) description: "SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.16. Envoy SSL certificate expired
SSL certificate loaded by Envoy on {{ $labels.instance }} has expired [copy] - alert: EnvoySslCertificateExpired expr: envoy_server_days_until_first_cert_expiring < 0 for: 0m labels: severity: critical annotations: summary: Envoy SSL certificate expired (instance {{ $labels.instance }}) description: "SSL certificate loaded by Envoy on {{ $labels.instance }} has expired\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.17. Envoy cluster circuit breaker tripped
Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyClusterCircuitBreakerTripped expr: envoy_cluster_circuit_breakers_default_cx_open == 1 or envoy_cluster_circuit_breakers_default_rq_open == 1 for: 0m labels: severity: critical annotations: summary: Envoy cluster circuit breaker tripped (instance {{ $labels.instance }}) description: "Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.18. Envoy no healthy upstream
Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} [copy] - alert: EnvoyNoHealthyUpstream expr: increase(envoy_cluster_upstream_cx_none_healthy[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Envoy no healthy upstream (instance {{ $labels.instance }}) description: "Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 3.6.19. Envoy high downstream request timeout rate
Downstream requests are timing out on {{ $labels.instance }} [copy] - alert: EnvoyHighDownstreamRequestTimeoutRate expr: increase(envoy_http_downstream_rq_timeout[5m]) > 5 for: 5m labels: severity: warning annotations: summary: Envoy high downstream request timeout rate (instance {{ $labels.instance }}) description: "Downstream requests are timing out on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.1. PHP-FPM : bakins/php-fpm-exporter (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/php-fpm/bakins-fpm-exporter.yml-
# 4.1.1. PHP-FPM max-children reached
PHP-FPM reached max children - {{ $labels.instance }} [copy] - alert: Php-fpmMax-childrenReached expr: sum(phpfpm_max_children_reached_total) by (instance) > 0 for: 0m labels: severity: warning annotations: summary: PHP-FPM max-children reached (instance {{ $labels.instance }}) description: "PHP-FPM reached max children - {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.2. JVM : java-client (12 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/jvm/jvm-exporter.yml-
# 4.2.1. JVM memory filling up
JVM memory is filling up (> 80%) [copy] - alert: JvmMemoryFillingUp expr: (sum by (instance)(jvm_memory_used_bytes{area="heap"}) / sum by (instance)(jvm_memory_max_bytes{area="heap"})) * 100 > 80 for: 2m labels: severity: warning annotations: summary: JVM memory filling up (instance {{ $labels.instance }}) description: "JVM memory is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.2. JVM non-heap memory filling up
JVM non-heap memory (metaspace/code cache) is filling up (> 80%) [copy] # Many JVM configurations leave metaspace unbounded, in which case jvm_memory_max_bytes{area="nonheap"} is -1 and this alert will not fire. # The query filters out max_bytes <= 0 to avoid false negatives. - alert: JvmNon-heapMemoryFillingUp expr: (sum by (instance)(jvm_memory_used_bytes{area="nonheap"}) / (sum by (instance)(jvm_memory_max_bytes{area="nonheap"}) > 0)) * 100 > 80 for: 2m labels: severity: warning annotations: summary: JVM non-heap memory filling up (instance {{ $labels.instance }}) description: "JVM non-heap memory (metaspace/code cache) is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.3. JVM GC time too high
JVM is spending too much time in garbage collection (> 5% of wall clock time) [copy] - alert: JvmGcTimeTooHigh expr: sum by (instance)(rate(jvm_gc_collection_seconds_sum[5m])) > 0.05 for: 5m labels: severity: warning annotations: summary: JVM GC time too high (instance {{ $labels.instance }}) description: "JVM is spending too much time in garbage collection (> 5% of wall clock time)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.4. JVM threads deadlocked
JVM has deadlocked threads [copy] - alert: JvmThreadsDeadlocked expr: jvm_threads_deadlocked > 0 for: 1m labels: severity: critical annotations: summary: JVM threads deadlocked (instance {{ $labels.instance }}) description: "JVM has deadlocked threads\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.5. JVM thread count high
JVM thread count is high (> 300), potential thread leak [copy] - alert: JvmThreadCountHigh expr: jvm_threads_current > 300 for: 5m labels: severity: warning annotations: summary: JVM thread count high (instance {{ $labels.instance }}) description: "JVM thread count is high (> 300), potential thread leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.6. JVM threads BLOCKED
JVM has high number of BLOCKED threads, indicating lock contention [copy] - alert: JvmThreadsBlocked expr: jvm_threads_state{state="BLOCKED"} > 50 for: 5m labels: severity: warning annotations: summary: JVM threads BLOCKED (instance {{ $labels.instance }}) description: "JVM has high number of BLOCKED threads, indicating lock contention\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.7. JVM old gen GC frequency
Frequent old/major GC cycles, indicating memory pressure [copy] # This regex matches CMS, G1, and Parallel collector names. It will not match ZGC or Shenandoah cycle names. # Adjust the gc label filter if you use a different collector. - alert: JvmOldGenGcFrequency expr: rate(jvm_gc_collection_seconds_count{gc=~".*old.*|.*major.*"}[5m]) > 0.3 for: 5m labels: severity: warning annotations: summary: JVM old gen GC frequency (instance {{ $labels.instance }}) description: "Frequent old/major GC cycles, indicating memory pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.8. JVM direct buffer pool filling up
JVM direct buffer pool is filling up (> 90%) [copy] - alert: JvmDirectBufferPoolFillingUp expr: (jvm_buffer_pool_used_bytes / jvm_buffer_pool_capacity_bytes) * 100 > 90 for: 5m labels: severity: warning annotations: summary: JVM direct buffer pool filling up (instance {{ $labels.instance }}) description: "JVM direct buffer pool is filling up (> 90%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.9. JVM objects pending finalization
JVM has objects pending finalization, potential memory leak [copy] - alert: JvmObjectsPendingFinalization expr: jvm_memory_objects_pending_finalization > 1000 for: 5m labels: severity: warning annotations: summary: JVM objects pending finalization (instance {{ $labels.instance }}) description: "JVM has objects pending finalization, potential memory leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.10. JVM file descriptors exhaustion
JVM process is running out of file descriptors (> 90% used) [copy] # process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not JVM-specific. # This alert will also fire for Go, Python, or any process exposing these metrics. - alert: JvmFileDescriptorsExhaustion expr: (process_open_fds / process_max_fds) * 100 > 90 for: 5m labels: severity: warning annotations: summary: JVM file descriptors exhaustion (instance {{ $labels.instance }}) description: "JVM process is running out of file descriptors (> 90% used)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.11. JVM class loading anomaly
Rapid class loading detected, potential classloader leak [copy] - alert: JvmClassLoadingAnomaly expr: rate(jvm_classes_loaded_total[5m]) > 100 for: 5m labels: severity: warning annotations: summary: JVM class loading anomaly (instance {{ $labels.instance }}) description: "Rapid class loading detected, potential classloader leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.2.12. JVM compilation time spike
Excessive JIT compilation time consuming CPU [copy] - alert: JvmCompilationTimeSpike expr: rate(jvm_compilation_time_seconds_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: JVM compilation time spike (instance {{ $labels.instance }}) description: "Excessive JIT compilation time consuming CPU\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.3. Golang : client_golang (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/golang/golang-exporter.yml-
# 4.3.1. Go goroutine count high
Go application has too many goroutines (> 1000), potential goroutine leak [copy] # Threshold is a rough default. High-concurrency servers may legitimately run thousands of goroutines. Adjust to match your baseline. - alert: GoGoroutineCountHigh expr: go_goroutines > 1000 for: 5m labels: severity: warning annotations: summary: Go goroutine count high (instance {{ $labels.instance }}) description: "Go application has too many goroutines (> 1000), potential goroutine leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.2. Go GC duration high
Go GC pause duration is too high (max > 1s) [copy] # quantile="1" is the maximum observed GC pause in the current summary window, not p99. # A single outlier pause can push this above 1s. The for: 5m ensures the max stays elevated. - alert: GoGcDurationHigh expr: go_gc_duration_seconds{quantile="1"} > 1 for: 5m labels: severity: warning annotations: summary: Go GC duration high (instance {{ $labels.instance }}) description: "Go GC pause duration is too high (max > 1s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.3. Go memory usage high
Go heap allocation is using most of the runtime's reserved memory (> 90%), indicating the process may need more memory or has a leak [copy] # go_memstats_sys_bytes is the total memory obtained from the OS by the Go runtime, not total host memory. # This ratio measures Go-internal memory utilization, not system-level memory pressure. - alert: GoMemoryUsageHigh expr: (go_memstats_heap_alloc_bytes / go_memstats_sys_bytes) * 100 > 90 for: 5m labels: severity: warning annotations: summary: Go memory usage high (instance {{ $labels.instance }}) description: "Go heap allocation is using most of the runtime's reserved memory (> 90%), indicating the process may need more memory or has a leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.4. Go thread count high
Go OS thread count is high (> 50), potential blocking syscall or CGo leak [copy] # Threshold is workload-dependent. Applications with heavy CGo or blocking I/O may legitimately use more OS threads. Adjust to match your baseline. - alert: GoThreadCountHigh expr: go_threads > 50 for: 5m labels: severity: warning annotations: summary: Go thread count high (instance {{ $labels.instance }}) description: "Go OS thread count is high (> 50), potential blocking syscall or CGo leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.5. Go heap objects count high
Go heap has too many live objects (> 10M), high GC pressure [copy] # Threshold is a rough default. Adjust based on your application's normal object count. - alert: GoHeapObjectsCountHigh expr: go_memstats_heap_objects > 10000000 for: 5m labels: severity: warning annotations: summary: Go heap objects count high (instance {{ $labels.instance }}) description: "Go heap has too many live objects (> 10M), high GC pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.6. Go GC CPU fraction high
Go GC is consuming too much CPU (> 5%) [copy] # go_memstats_gc_cpu_fraction is deprecated since Go 1.20 and may return 0 in newer versions. # Consider using runtime/metrics-based alternatives if running Go >= 1.20. - alert: GoGcCpuFractionHigh expr: go_memstats_gc_cpu_fraction > 0.05 for: 5m labels: severity: warning annotations: summary: Go GC CPU fraction high (instance {{ $labels.instance }}) description: "Go GC is consuming too much CPU (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.7. Go goroutine spike
Go goroutine count is growing rapidly [copy] - alert: GoGoroutineSpike expr: deriv(go_goroutines[5m]) > 100 for: 5m labels: severity: warning annotations: summary: Go goroutine spike (instance {{ $labels.instance }}) description: "Go goroutine count is growing rapidly\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.8. Go heap fragmentation
Go heap has high idle ratio (> 90%), indicating memory fragmentation [copy] - alert: GoHeapFragmentation expr: go_memstats_heap_idle_bytes / go_memstats_heap_sys_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: Go heap fragmentation (instance {{ $labels.instance }}) description: "Go heap has high idle ratio (> 90%), indicating memory fragmentation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.9. Go memory leak
Go application has sustained high allocation rate (> 1GB/s), potential memory leak [copy] - alert: GoMemoryLeak expr: rate(go_memstats_alloc_bytes_total[5m]) > 1e9 for: 5m labels: severity: warning annotations: summary: Go memory leak (instance {{ $labels.instance }}) description: "Go application has sustained high allocation rate (> 1GB/s), potential memory leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.3.10. Go stack memory high
Go stack memory usage is high (> 1GB), likely excessive goroutines or deep recursion [copy] - alert: GoStackMemoryHigh expr: go_memstats_stack_inuse_bytes > 1e9 for: 5m labels: severity: warning annotations: summary: Go stack memory high (instance {{ $labels.instance }}) description: "Go stack memory usage is high (> 1GB), likely excessive goroutines or deep recursion\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.4. Ruby : prometheus_exporter (5 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ruby/ruby-exporter.yml-
# 4.4.1. Ruby heap live slots high
Ruby heap has too many live slots (> 500k), heap bloat [copy] # Threshold is a rough default. Adjust based on your application's normal heap size. - alert: RubyHeapLiveSlotsHigh expr: ruby_heap_live_slots > 500000 for: 5m labels: severity: warning annotations: summary: Ruby heap live slots high (instance {{ $labels.instance }}) description: "Ruby heap has too many live slots (> 500k), heap bloat\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.4.2. Ruby heap free slots high
Ruby heap has too many free slots (> 500k), memory fragmentation after large allocations [copy] - alert: RubyHeapFreeSlotsHigh expr: ruby_heap_free_slots > 500000 for: 5m labels: severity: warning annotations: summary: Ruby heap free slots high (instance {{ $labels.instance }}) description: "Ruby heap has too many free slots (> 500k), memory fragmentation after large allocations\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.4.3. Ruby major GC rate high
Ruby is performing too many major GC cycles, indicating memory pressure [copy] - alert: RubyMajorGcRateHigh expr: rate(ruby_major_gc_ops_total[5m]) > 5 for: 5m labels: severity: warning annotations: summary: Ruby major GC rate high (instance {{ $labels.instance }}) description: "Ruby is performing too many major GC cycles, indicating memory pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.4.4. Ruby RSS high
Ruby process RSS is high (> 1GB) [copy] - alert: RubyRssHigh expr: ruby_rss > 1e9 for: 5m labels: severity: warning annotations: summary: Ruby RSS high (instance {{ $labels.instance }}) description: "Ruby process RSS is high (> 1GB)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.4.5. Ruby allocated objects spike
Ruby is allocating objects at a high rate [copy] - alert: RubyAllocatedObjectsSpike expr: rate(ruby_allocated_objects_total[5m]) > 100000 for: 5m labels: severity: warning annotations: summary: Ruby allocated objects spike (instance {{ $labels.instance }}) description: "Ruby is allocating objects at a high rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.5. Python : client_python (5 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/python/python-exporter.yml-
# 4.5.1. Python GC objects uncollectable
Python has uncollectable objects, potential memory leak via reference cycles [copy] - alert: PythonGcObjectsUncollectable expr: increase(python_gc_objects_uncollectable_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Python GC objects uncollectable (instance {{ $labels.instance }}) description: "Python has uncollectable objects, potential memory leak via reference cycles\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.5.2. Python GC collections high
Python GC is collecting too many objects (> 10k/s), high allocation pressure [copy] - alert: PythonGcCollectionsHigh expr: rate(python_gc_objects_collected_total[5m]) > 10000 for: 5m labels: severity: warning annotations: summary: Python GC collections high (instance {{ $labels.instance }}) description: "Python GC is collecting too many objects (> 10k/s), high allocation pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.5.3. Python file descriptors exhaustion
Python process is running out of file descriptors (> 90% used) [copy] # process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not Python-specific. - alert: PythonFileDescriptorsExhaustion expr: (process_open_fds / process_max_fds) * 100 > 90 for: 5m labels: severity: warning annotations: summary: Python file descriptors exhaustion (instance {{ $labels.instance }}) description: "Python process is running out of file descriptors (> 90% used)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.5.4. Python GC generation 2 collections high
Python full GC (generation 2) is running too frequently, indicating memory pressure [copy] - alert: PythonGcGeneration2CollectionsHigh expr: rate(python_gc_collections_total{generation="2"}[5m]) > 1 for: 5m labels: severity: warning annotations: summary: Python GC generation 2 collections high (instance {{ $labels.instance }}) description: "Python full GC (generation 2) is running too frequently, indicating memory pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.5.5. Python virtual memory high
Python process virtual memory is high (> 4GB) [copy] # Threshold is a rough default. Adjust based on your application's expected memory footprint. - alert: PythonVirtualMemoryHigh expr: process_virtual_memory_bytes > 4e9 for: 5m labels: severity: warning annotations: summary: Python virtual memory high (instance {{ $labels.instance }}) description: "Python process virtual memory is high (> 4GB)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.6. Sidekiq : Strech/sidekiq-prometheus-exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/sidekiq/strech-sidekiq-exporter.yml-
# 4.6.1. Sidekiq queue size
Sidekiq queue {{ $labels.name }} is growing [copy] - alert: SidekiqQueueSize expr: sidekiq_queue_size > 100 for: 1m labels: severity: warning annotations: summary: Sidekiq queue size (instance {{ $labels.instance }}) description: "Sidekiq queue {{ $labels.name }} is growing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.6.2. Sidekiq scheduling latency too high
Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing. [copy] - alert: SidekiqSchedulingLatencyTooHigh expr: max(sidekiq_queue_latency) > 60 for: 0m labels: severity: critical annotations: summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }}) description: "Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.7. Apache Flink : Built-in Prometheus reporter (12 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/apache-flink/flink-prometheus-reporter.yml-
# 4.7.1. Flink job is not running
No Flink jobs are currently running. All jobs may have failed or been cancelled. [copy] - alert: FlinkJobIsNotRunning expr: flink_jobmanager_numRunningJobs == 0 for: 1m labels: severity: critical annotations: summary: Flink job is not running (instance {{ $labels.instance }}) description: "No Flink jobs are currently running. All jobs may have failed or been cancelled.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.2. Flink no TaskManagers registered
No TaskManagers are registered with the JobManager. The cluster has no processing capacity. [copy] - alert: FlinkNoTaskmanagersRegistered expr: flink_jobmanager_numRegisteredTaskManagers == 0 for: 1m labels: severity: critical annotations: summary: Flink no TaskManagers registered (instance {{ $labels.instance }}) description: "No TaskManagers are registered with the JobManager. The cluster has no processing capacity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.3. Flink all task slots used
All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled. [copy] # This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity. - alert: FlinkAllTaskSlotsUsed expr: flink_jobmanager_taskSlotsAvailable == 0 for: 5m labels: severity: warning annotations: summary: Flink all task slots used (instance {{ $labels.instance }}) description: "All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.4. Flink job restart increasing
Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes. [copy] - alert: FlinkJobRestartIncreasing expr: increase(flink_jobmanager_job_numRestarts[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Flink job restart increasing (instance {{ $labels.instance }}) description: "Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.5. Flink checkpoint failures
Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes. [copy] - alert: FlinkCheckpointFailures expr: increase(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 0 for: 0m labels: severity: warning annotations: summary: Flink checkpoint failures (instance {{ $labels.instance }}) description: "Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.6. Flink checkpoint duration high
Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete. [copy] # Threshold is 60 seconds. Adjust based on your checkpoint interval and state size. - alert: FlinkCheckpointDurationHigh expr: flink_jobmanager_job_lastCheckpointDuration > 60000 for: 5m labels: severity: warning annotations: summary: Flink checkpoint duration high (instance {{ $labels.instance }}) description: "Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.7. Flink task backpressured
Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured. [copy] - alert: FlinkTaskBackpressured expr: flink_taskmanager_job_task_isBackPressured == 1 for: 5m labels: severity: warning annotations: summary: Flink task backpressured (instance {{ $labels.instance }}) description: "Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.8. Flink task high backpressure time
Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure. [copy] # Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate. - alert: FlinkTaskHighBackpressureTime expr: flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500 for: 5m labels: severity: warning annotations: summary: Flink task high backpressure time (instance {{ $labels.instance }}) description: "Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.9. Flink TaskManager heap memory high
Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%. [copy] - alert: FlinkTaskmanagerHeapMemoryHigh expr: flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9 for: 5m labels: severity: warning annotations: summary: Flink TaskManager heap memory high (instance {{ $labels.instance }}) description: "Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.10. Flink JobManager heap memory high
Flink JobManager {{ $labels.instance }} heap memory usage is above 90%. [copy] - alert: FlinkJobmanagerHeapMemoryHigh expr: flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9 for: 5m labels: severity: warning annotations: summary: Flink JobManager heap memory high (instance {{ $labels.instance }}) description: "Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.11. Flink TaskManager GC time high
Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection. [copy] # Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload. - alert: FlinkTaskmanagerGcTimeHigh expr: rate(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100 for: 5m labels: severity: warning annotations: summary: Flink TaskManager GC time high (instance {{ $labels.instance }}) description: "Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.7.12. Flink no records processed
Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes. [copy] # Only fires for tasks that have previously received records, to avoid false positives during startup. - alert: FlinkNoRecordsProcessed expr: rate(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0 for: 5m labels: severity: warning annotations: summary: Flink no records processed (instance {{ $labels.instance }}) description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 4.8. Apache Spark : Built-in Prometheus (PrometheusServlet + PrometheusResource) (8 rules) [copy section]
Spark exposes metrics via two built-in endpoints:
- PrometheusServlet: master/worker/driver metrics at /metrics/prometheus/ (ports 8080, 8081, 4040)
- PrometheusResource: executor metrics at /metrics/executors/prometheus/ (port 4040, requires spark.ui.prometheus.enabled=true in Spark 3.x)
Metric names from PrometheusServlet include a dynamic namespace (application ID), making static PromQL queries challenging.
Configuration: spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/apache-spark/spark-prometheus.yml-
# 4.8.1. Spark no alive workers
No Spark workers are alive. The cluster has no processing capacity. [copy] - alert: SparkNoAliveWorkers expr: metrics_master_aliveWorkers_Value == 0 for: 1m labels: severity: critical annotations: summary: Spark no alive workers (instance {{ $labels.instance }}) description: "No Spark workers are alive. The cluster has no processing capacity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.2. Spark too many waiting apps
Spark has {{ $value }} applications waiting for resources. [copy] # Adjust the threshold based on your cluster's typical queuing behavior. - alert: SparkTooManyWaitingApps expr: metrics_master_waitingApps_Value > 10 for: 5m labels: severity: warning annotations: summary: Spark too many waiting apps (instance {{ $labels.instance }}) description: "Spark has {{ $value }} applications waiting for resources.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.3. Spark worker memory exhausted
Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free). [copy] - alert: SparkWorkerMemoryExhausted expr: metrics_worker_memFree_MB_Value == 0 for: 2m labels: severity: warning annotations: summary: Spark worker memory exhausted (instance {{ $labels.instance }}) description: "Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.4. Spark worker cores exhausted
Spark worker {{ $labels.instance }} has no free cores. [copy] # Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues. - alert: SparkWorkerCoresExhausted expr: metrics_worker_coresFree_Value == 0 for: 5m labels: severity: warning annotations: summary: Spark worker cores exhausted (instance {{ $labels.instance }}) description: "Spark worker {{ $labels.instance }} has no free cores.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.5. Spark executor high GC time
Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC. [copy] # Fires when more than 10% of executor time is spent in garbage collection. # This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/). - alert: SparkExecutorHighGcTime expr: metrics_executor_totalGCTime / (metrics_executor_totalDuration > 0) > 0.1 for: 5m labels: severity: warning annotations: summary: Spark executor high GC time (instance {{ $labels.instance }}) description: "Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.6. Spark executor all tasks failing
Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed). [copy] - alert: SparkExecutorAllTasksFailing expr: metrics_executor_failedTasks > 0 and metrics_executor_completedTasks == 0 for: 5m labels: severity: critical annotations: summary: Spark executor all tasks failing (instance {{ $labels.instance }}) description: "Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.7. Spark executor high task failure rate
Spark executor {{ $labels.executor_id }} has a task failure rate above 10%. [copy] - alert: SparkExecutorHighTaskFailureRate expr: metrics_executor_failedTasks / (metrics_executor_totalTasks > 0) > 0.1 for: 5m labels: severity: warning annotations: summary: Spark executor high task failure rate (instance {{ $labels.instance }}) description: "Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 4.8.8. Spark executor high disk spill
Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory. [copy] # Disk spilling indicates insufficient memory for the workload. - alert: SparkExecutorHighDiskSpill expr: rate(metrics_executor_diskUsed_bytes[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Spark executor high disk spill (instance {{ $labels.instance }}) description: "Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.1. Kubernetes : kube-state-metrics (37 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/kubernetes/kubestate-exporter.yml-
# 5.1.1. Kubernetes Node not ready
Node {{ $labels.node }} has been unready for a long time [copy] - alert: KubernetesNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 10m labels: severity: critical annotations: summary: Kubernetes Node not ready (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.2. Kubernetes Node scheduling disabled
Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes. [copy] # Kubernetes Node with disabled schedules are fine. # This alarm can be useful to get warned if there are nodes which are longer unscheduled. - alert: KubernetesNodeSchedulingDisabled expr: kube_node_spec_taint{key="node.kubernetes.io/unschedulable"} == 1 for: 30m labels: severity: warning annotations: summary: Kubernetes Node scheduling disabled (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.3. Kubernetes Node memory pressure
Node {{ $labels.node }} has MemoryPressure condition [copy] - alert: KubernetesNodeMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes Node memory pressure (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.4. Kubernetes Node disk pressure
Node {{ $labels.node }} has DiskPressure condition [copy] - alert: KubernetesNodeDiskPressure expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes Node disk pressure (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.5. Kubernetes Node network unavailable
Node {{ $labels.node }} has NetworkUnavailable condition [copy] - alert: KubernetesNodeNetworkUnavailable expr: kube_node_status_condition{condition="NetworkUnavailable",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes Node network unavailable (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} has NetworkUnavailable condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.6. Kubernetes Node out of pod capacity
Node {{ $labels.node }} is out of pod capacity [copy] - alert: KubernetesNodeOutOfPodCapacity expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid, instance) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90 for: 2m labels: severity: warning annotations: summary: Kubernetes Node out of pod capacity (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} is out of pod capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.7. Kubernetes Container oom killer
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes. [copy] - alert: KubernetesContainerOomKiller expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1 for: 0m labels: severity: warning annotations: summary: Kubernetes Container oom killer (instance {{ $labels.instance }}) description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.8. Kubernetes Job failed
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete [copy] - alert: KubernetesJobFailed expr: kube_job_status_failed > 0 for: 0m labels: severity: warning annotations: summary: Kubernetes Job failed (instance {{ $labels.instance }}) description: "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.9. Kubernetes Job not starting
Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes [copy] - alert: KubernetesJobNotStarting expr: kube_job_status_active == 0 and kube_job_status_failed == 0 and kube_job_status_succeeded == 0 and (time() - kube_job_status_start_time) > 600 for: 0m labels: severity: warning annotations: summary: Kubernetes Job not starting (instance {{ $labels.instance }}) description: "Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.10. Kubernetes CronJob failing
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is failing [copy] - alert: KubernetesCronjobFailing expr: (kube_cronjob_status_last_schedule_time > kube_cronjob_status_last_successful_time) AND (kube_cronjob_status_active == 0) AND (kube_cronjob_spec_suspend == 0) for: 0m labels: severity: critical annotations: summary: Kubernetes CronJob failing (instance {{ $labels.instance }}) description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is failing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.11. Kubernetes CronJob suspended
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended [copy] - alert: KubernetesCronjobSuspended expr: kube_cronjob_spec_suspend != 0 for: 0m labels: severity: warning annotations: summary: Kubernetes CronJob suspended (instance {{ $labels.instance }}) description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.12. Kubernetes PersistentVolumeClaim pending
PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending [copy] - alert: KubernetesPersistentvolumeclaimPending expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1 for: 2m labels: severity: warning annotations: summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }}) description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.13. Kubernetes Volume out of disk space
Volume is almost full (< 10% left) [copy] - alert: KubernetesVolumeOutOfDiskSpace expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 for: 2m labels: severity: warning annotations: summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }}) description: "Volume is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.14. Kubernetes Volume full in four days
Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available. [copy] - alert: KubernetesVolumeFullInFourDays expr: predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600) < 0 for: 0m labels: severity: critical annotations: summary: Kubernetes Volume full in four days (instance {{ $labels.instance }}) description: "Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.15. Kubernetes PersistentVolume error
Persistent volume {{ $labels.persistentvolume }} is in bad state [copy] - alert: KubernetesPersistentvolumeError expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0 for: 0m labels: severity: critical annotations: summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }}) description: "Persistent volume {{ $labels.persistentvolume }} is in bad state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.16. Kubernetes StatefulSet down
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down [copy] - alert: KubernetesStatefulsetDown expr: kube_statefulset_replicas != kube_statefulset_status_replicas_ready > 0 for: 1m labels: severity: critical annotations: summary: Kubernetes StatefulSet down (instance {{ $labels.instance }}) description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.17. Kubernetes HPA scale inability
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale [copy] - alert: KubernetesHpaScaleInability expr: (kube_horizontalpodautoscaler_spec_max_replicas - kube_horizontalpodautoscaler_status_desired_replicas) * on (horizontalpodautoscaler,namespace) (kube_horizontalpodautoscaler_status_condition{condition="ScalingLimited", status="true"} == 1) == 0 for: 2m labels: severity: warning annotations: summary: Kubernetes HPA scale inability (instance {{ $labels.instance }}) description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.18. Kubernetes HPA metrics unavailability
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics [copy] - alert: KubernetesHpaMetricsUnavailability expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="ScalingActive"} == 1 for: 0m labels: severity: warning annotations: summary: Kubernetes HPA metrics unavailability (instance {{ $labels.instance }}) description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.19. Kubernetes HPA scale maximum
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods [copy] - alert: KubernetesHpaScaleMaximum expr: (kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas) for: 2m labels: severity: info annotations: summary: Kubernetes HPA scale maximum (instance {{ $labels.instance }}) description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.20. Kubernetes HPA underutilized
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here. [copy] - alert: KubernetesHpaUnderutilized expr: max(quantile_over_time(0.5, kube_horizontalpodautoscaler_status_desired_replicas[1d]) == kube_horizontalpodautoscaler_spec_min_replicas) by (horizontalpodautoscaler) > 3 for: 0m labels: severity: info annotations: summary: Kubernetes HPA underutilized (instance {{ $labels.instance }}) description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.21. Kubernetes Pod not healthy
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes. [copy] - alert: KubernetesPodNotHealthy expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0 for: 15m labels: severity: critical annotations: summary: Kubernetes Pod not healthy (instance {{ $labels.instance }}) description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.22. Kubernetes pod crash looping
Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping [copy] - alert: KubernetesPodCrashLooping expr: increase(kube_pod_container_status_restarts_total[1m]) > 3 for: 2m labels: severity: warning annotations: summary: Kubernetes pod crash looping (instance {{ $labels.instance }}) description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.23. Kubernetes ReplicaSet replicas mismatch
ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch [copy] - alert: KubernetesReplicasetReplicasMismatch expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas for: 10m labels: severity: warning annotations: summary: Kubernetes ReplicaSet replicas mismatch (instance {{ $labels.instance }}) description: "ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.24. Kubernetes Deployment replicas mismatch
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch [copy] - alert: KubernetesDeploymentReplicasMismatch expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 10m labels: severity: warning annotations: summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }}) description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.25. Kubernetes StatefulSet replicas mismatch
StatefulSet does not match the expected number of replicas. [copy] - alert: KubernetesStatefulsetReplicasMismatch expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas for: 10m labels: severity: warning annotations: summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }}) description: "StatefulSet does not match the expected number of replicas.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.26. Kubernetes Deployment generation mismatch
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back. [copy] - alert: KubernetesDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 10m labels: severity: critical annotations: summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }}) description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.27. Kubernetes StatefulSet generation mismatch
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back. [copy] - alert: KubernetesStatefulsetGenerationMismatch expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation for: 10m labels: severity: critical annotations: summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }}) description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.28. Kubernetes StatefulSet update not rolled out
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. [copy] - alert: KubernetesStatefulsetUpdateNotRolledOut expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated) for: 10m labels: severity: warning annotations: summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }}) description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.29. Kubernetes DaemonSet rollout stuck
Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready [copy] - alert: KubernetesDaemonsetRolloutStuck expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0 for: 10m labels: severity: warning annotations: summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }}) description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.30. Kubernetes DaemonSet misscheduled
Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run [copy] - alert: KubernetesDaemonsetMisscheduled expr: kube_daemonset_status_number_misscheduled > 0 for: 1m labels: severity: critical annotations: summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }}) description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.31. Kubernetes CronJob too long
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. [copy] # Threshold should be customized for each cronjob name. - alert: KubernetesCronjobTooLong expr: kube_job_status_start_time > 0 and absent(kube_job_status_completion_time) and (time() - kube_job_status_start_time) > 3600 for: 0m labels: severity: warning annotations: summary: Kubernetes CronJob too long (instance {{ $labels.instance }}) description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.32. Kubernetes Job slow completion
Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time. [copy] - alert: KubernetesJobSlowCompletion expr: kube_job_spec_completions - kube_job_status_succeeded - kube_job_status_failed > 0 for: 12h labels: severity: critical annotations: summary: Kubernetes Job slow completion (instance {{ $labels.instance }}) description: "Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.33. Kubernetes API server errors
Kubernetes API server is experiencing high error rate [copy] - alert: KubernetesApiServerErrors expr: sum(rate(apiserver_request_total{job="apiserver",code=~"(?:5..)"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) by (instance, job) * 100 > 3 for: 2m labels: severity: critical annotations: summary: Kubernetes API server errors (instance {{ $labels.instance }}) description: "Kubernetes API server is experiencing high error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.34. Kubernetes API client errors
Kubernetes API client is experiencing high error rate [copy] - alert: KubernetesApiClientErrors expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1 for: 2m labels: severity: critical annotations: summary: Kubernetes API client errors (instance {{ $labels.instance }}) description: "Kubernetes API client is experiencing high error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.35. Kubernetes client certificate expires next week
A client certificate used to authenticate to the apiserver is expiring next week. [copy] - alert: KubernetesClientCertificateExpiresNextWeek expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60 for: 0m labels: severity: warning annotations: summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }}) description: "A client certificate used to authenticate to the apiserver is expiring next week.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.36. Kubernetes client certificate expires soon
A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. [copy] - alert: KubernetesClientCertificateExpiresSoon expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60 for: 0m labels: severity: critical annotations: summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }}) description: "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.1.37. Kubernetes API server latency
Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. [copy] - alert: KubernetesApiServerLatency expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~"(?:CONNECT|WATCHLIST|WATCH|PROXY)"} [10m])) WITHOUT (subresource)) > 1 for: 2m labels: severity: warning annotations: summary: Kubernetes API server latency (instance {{ $labels.instance }}) description: "Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.2. Nomad : Embedded exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/nomad/embedded-exporter.yml-
# 5.2.1. Nomad job failed
Nomad job failed [copy] - alert: NomadJobFailed expr: nomad_nomad_job_summary_failed > 0 for: 0m labels: severity: warning annotations: summary: Nomad job failed (instance {{ $labels.instance }}) description: "Nomad job failed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.2.2. Nomad job lost
Nomad job lost [copy] - alert: NomadJobLost expr: nomad_nomad_job_summary_lost > 0 for: 0m labels: severity: warning annotations: summary: Nomad job lost (instance {{ $labels.instance }}) description: "Nomad job lost\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.2.3. Nomad job queued
Nomad job queued [copy] - alert: NomadJobQueued expr: nomad_nomad_job_summary_queued > 0 for: 2m labels: severity: warning annotations: summary: Nomad job queued (instance {{ $labels.instance }}) description: "Nomad job queued\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.2.4. Nomad blocked evaluation
Nomad blocked evaluation [copy] - alert: NomadBlockedEvaluation expr: nomad_nomad_blocked_evals_total_blocked > 0 for: 0m labels: severity: warning annotations: summary: Nomad blocked evaluation (instance {{ $labels.instance }}) description: "Nomad blocked evaluation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.3. Consul : prometheus/consul_exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/consul/consul-exporter.yml-
# 5.3.1. Consul service healthcheck failed
Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}` [copy] - alert: ConsulServiceHealthcheckFailed expr: consul_catalog_service_node_healthy == 0 for: 1m labels: severity: critical annotations: summary: Consul service healthcheck failed (instance {{ $labels.instance }}) description: "Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.3.2. Consul missing master node
Numbers of consul raft peers should be 3, in order to preserve quorum. [copy] - alert: ConsulMissingMasterNode expr: consul_raft_peers < 3 for: 0m labels: severity: critical annotations: summary: Consul missing master node (instance {{ $labels.instance }}) description: "Numbers of consul raft peers should be 3, in order to preserve quorum.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.3.3. Consul agent unhealthy
A Consul agent is down [copy] - alert: ConsulAgentUnhealthy expr: consul_health_node_status{status="critical"} == 1 for: 0m labels: severity: critical annotations: summary: Consul agent unhealthy (instance {{ $labels.instance }}) description: "A Consul agent is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.4. Etcd : Embedded exporter (13 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/etcd/embedded-exporter.yml-
# 5.4.1. Etcd insufficient Members
Etcd cluster should have an odd number of members [copy] - alert: EtcdInsufficientMembers expr: count(etcd_server_id) % 2 == 0 for: 0m labels: severity: critical annotations: summary: Etcd insufficient Members (instance {{ $labels.instance }}) description: "Etcd cluster should have an odd number of members\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.2. Etcd no Leader
Etcd cluster have no leader [copy] - alert: EtcdNoLeader expr: etcd_server_has_leader == 0 for: 0m labels: severity: critical annotations: summary: Etcd no Leader (instance {{ $labels.instance }}) description: "Etcd cluster have no leader\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.3. Etcd high number of leader changes
Etcd leader changed more than 2 times during 10 minutes [copy] - alert: EtcdHighNumberOfLeaderChanges expr: increase(etcd_server_leader_changes_seen_total[10m]) > 2 for: 0m labels: severity: warning annotations: summary: Etcd high number of leader changes (instance {{ $labels.instance }}) description: "Etcd leader changed more than 2 times during 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.4. Etcd high number of failed GRPC requests
More than 1% GRPC request failure detected in Etcd [copy] - alert: EtcdHighNumberOfFailedGrpcRequests expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01 for: 2m labels: severity: warning annotations: summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) description: "More than 1% GRPC request failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.5. Etcd high number of failed GRPC requests
More than 5% GRPC request failure detected in Etcd [copy] - alert: EtcdHighNumberOfFailedGrpcRequests expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05 for: 2m labels: severity: critical annotations: summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) description: "More than 5% GRPC request failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.6. Etcd GRPC requests slow
GRPC requests slowing down, 99th percentile is over 0.15s [copy] - alert: EtcdGrpcRequestsSlow expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[1m])) by (grpc_service, grpc_method, le)) > 0.15 for: 2m labels: severity: warning annotations: summary: Etcd GRPC requests slow (instance {{ $labels.instance }}) description: "GRPC requests slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.7. Etcd high number of failed HTTP requests
More than 1% HTTP failure detected in Etcd [copy] - alert: EtcdHighNumberOfFailedHttpRequests expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01 for: 2m labels: severity: warning annotations: summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }}) description: "More than 1% HTTP failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.8. Etcd high number of failed HTTP requests
More than 5% HTTP failure detected in Etcd [copy] - alert: EtcdHighNumberOfFailedHttpRequests expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05 for: 2m labels: severity: critical annotations: summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }}) description: "More than 5% HTTP failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.9. Etcd HTTP requests slow
HTTP requests slowing down, 99th percentile is over 0.15s [copy] - alert: EtcdHttpRequestsSlow expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15 for: 2m labels: severity: warning annotations: summary: Etcd HTTP requests slow (instance {{ $labels.instance }}) description: "HTTP requests slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.10. Etcd member communication slow
Etcd member communication slowing down, 99th percentile is over 0.15s [copy] - alert: EtcdMemberCommunicationSlow expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15 for: 2m labels: severity: warning annotations: summary: Etcd member communication slow (instance {{ $labels.instance }}) description: "Etcd member communication slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.11. Etcd high number of failed proposals
Etcd server got more than 5 failed proposals past hour [copy] - alert: EtcdHighNumberOfFailedProposals expr: increase(etcd_server_proposals_failed_total[1h]) > 5 for: 2m labels: severity: warning annotations: summary: Etcd high number of failed proposals (instance {{ $labels.instance }}) description: "Etcd server got more than 5 failed proposals past hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.12. Etcd high fsync durations
Etcd WAL fsync duration increasing, 99th percentile is over 0.5s [copy] - alert: EtcdHighFsyncDurations expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5 for: 2m labels: severity: warning annotations: summary: Etcd high fsync durations (instance {{ $labels.instance }}) description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.4.13. Etcd high commit durations
Etcd commit duration increasing, 99th percentile is over 0.25s [copy] - alert: EtcdHighCommitDurations expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25 for: 2m labels: severity: warning annotations: summary: Etcd high commit durations (instance {{ $labels.instance }}) description: "Etcd commit duration increasing, 99th percentile is over 0.25s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.5. Linkerd : Embedded exporter (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/linkerd/embedded-exporter.yml-
# 5.5.1. Linkerd high error rate
Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10% [copy] - alert: LinkerdHighErrorRate expr: sum(rate(request_errors_total[1m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10 for: 1m labels: severity: warning annotations: summary: Linkerd high error rate (instance {{ $labels.instance }}) description: "Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.6. Istio : Embedded exporter (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/istio/embedded-exporter.yml-
# 5.6.1. Istio Kubernetes gateway availability drop
Gateway pods have dropped. Inbound traffic will likely be affected. [copy] - alert: IstioKubernetesGatewayAvailabilityDrop expr: min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2 for: 1m labels: severity: warning annotations: summary: Istio Kubernetes gateway availability drop (instance {{ $labels.instance }}) description: "Gateway pods have dropped. Inbound traffic will likely be affected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.2. Istio Pilot high total request rate
Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration. [copy] - alert: IstioPilotHighTotalRequestRate expr: sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5 for: 1m labels: severity: warning annotations: summary: Istio Pilot high total request rate (instance {{ $labels.instance }}) description: "Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.3. Istio Mixer Prometheus dispatches low
Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly. [copy] - alert: IstioMixerPrometheusDispatchesLow expr: sum(rate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[1m])) < 180 for: 1m labels: severity: warning annotations: summary: Istio Mixer Prometheus dispatches low (instance {{ $labels.instance }}) description: "Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.4. Istio high total request rate
Global request rate in the service mesh is unusually high. [copy] - alert: IstioHighTotalRequestRate expr: sum(rate(istio_requests_total{reporter="destination"}[5m])) > 1000 for: 2m labels: severity: warning annotations: summary: Istio high total request rate (instance {{ $labels.instance }}) description: "Global request rate in the service mesh is unusually high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.5. Istio low total request rate
Global request rate in the service mesh is unusually low. [copy] - alert: IstioLowTotalRequestRate expr: sum(rate(istio_requests_total{reporter="destination"}[5m])) < 100 for: 2m labels: severity: warning annotations: summary: Istio low total request rate (instance {{ $labels.instance }}) description: "Global request rate in the service mesh is unusually low.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.6. Istio high 4xx error rate
High percentage of HTTP 4xx responses in Istio (> 5%). [copy] - alert: IstioHigh4xxErrorRate expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5 for: 1m labels: severity: warning annotations: summary: Istio high 4xx error rate (instance {{ $labels.instance }}) description: "High percentage of HTTP 4xx responses in Istio (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.7. Istio high 5xx error rate
High percentage of HTTP 5xx responses in Istio (> 5%). [copy] - alert: IstioHigh5xxErrorRate expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5 for: 1m labels: severity: warning annotations: summary: Istio high 5xx error rate (instance {{ $labels.instance }}) description: "High percentage of HTTP 5xx responses in Istio (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.8. Istio high request latency
Istio average requests execution is longer than 100ms. [copy] - alert: IstioHighRequestLatency expr: rate(istio_request_duration_milliseconds_sum{reporter="destination"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter="destination"}[1m]) > 100 for: 1m labels: severity: warning annotations: summary: Istio high request latency (instance {{ $labels.instance }}) description: "Istio average requests execution is longer than 100ms.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.9. Istio latency 99 percentile
Istio 1% slowest requests are longer than 1000ms. [copy] - alert: IstioLatency99Percentile expr: histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000 for: 1m labels: severity: warning annotations: summary: Istio latency 99 percentile (instance {{ $labels.instance }}) description: "Istio 1% slowest requests are longer than 1000ms.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.6.10. Istio Pilot Duplicate Entry
Istio pilot duplicate entry error. [copy] - alert: IstioPilotDuplicateEntry expr: sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0 for: 0m labels: severity: critical annotations: summary: Istio Pilot Duplicate Entry (instance {{ $labels.instance }}) description: "Istio pilot duplicate entry error.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.7. ArgoCD : Embedded exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/argocd/embedded-exporter.yml-
# 5.7.1. ArgoCD service not synced
Service {{ $labels.name }} run by argo is currently not in sync. [copy] - alert: ArgocdServiceNotSynced expr: argocd_app_info{sync_status!="Synced"} != 0 for: 15m labels: severity: warning annotations: summary: ArgoCD service not synced (instance {{ $labels.instance }}) description: "Service {{ $labels.name }} run by argo is currently not in sync.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.7.2. ArgoCD service unhealthy
Service {{ $labels.name }} run by argo is currently not healthy. [copy] - alert: ArgocdServiceUnhealthy expr: argocd_app_info{health_status!="Healthy"} != 0 for: 15m labels: severity: warning annotations: summary: ArgoCD service unhealthy (instance {{ $labels.instance }}) description: "Service {{ $labels.name }} run by argo is currently not healthy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.8. FluxCD : Embedded exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/fluxcd/embedded-exporter.yml-
# 5.8.1. Flux Kustomization Failure
The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready. [copy] - alert: FluxKustomizationFailure expr: gotk_resource_info{ready="False", customresource_kind="Kustomization"} > 0 for: 15m labels: severity: warning annotations: summary: Flux Kustomization Failure (instance {{ $labels.instance }}) description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.8.2. Flux HelmRelease Failure
The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready. [copy] - alert: FluxHelmreleaseFailure expr: gotk_resource_info{ready="False", customresource_kind="HelmRelease"} > 0 for: 15m labels: severity: warning annotations: summary: Flux HelmRelease Failure (instance {{ $labels.instance }}) description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.8.3. Flux Source Issue
Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s). [copy] - alert: FluxSourceIssue expr: gotk_resource_info{ready="False", customresource_kind=~"GitRepository|HelmRepository|Bucket|OCIRepository"} > 0 for: 15m labels: severity: warning annotations: summary: Flux Source Issue (instance {{ $labels.instance }}) description: "Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.8.4. Flux Image Issue
The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready. [copy] - alert: FluxImageIssue expr: gotk_resource_info{ready="False", customresource_kind=~"ImagePolicy|ImageRepository|ImageUpdateAutomation"} > 0 for: 15m labels: severity: warning annotations: summary: Flux Image Issue (instance {{ $labels.instance }}) description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.9. OpenStack : openstack-exporter/openstack-exporter (20 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/openstack/openstack-exporter.yml-
# 5.9.1. OpenStack exporter down
The OpenStack exporter is down. OpenStack cloud metrics are no longer being collected. [copy] - alert: OpenstackExporterDown expr: up{job=~".*openstack.*"} == 0 for: 2m labels: severity: critical annotations: summary: OpenStack exporter down (instance {{ $labels.instance }}) description: "The OpenStack exporter is down. OpenStack cloud metrics are no longer being collected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.2. OpenStack Nova agent down
Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }} [copy] - alert: OpenstackNovaAgentDown expr: openstack_nova_agent_state{adminState="enabled"} == 0 for: 2m labels: severity: critical annotations: summary: OpenStack Nova agent down (instance {{ $labels.instance }}) description: "Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.3. OpenStack Neutron agent down
Neutron agent {{ $labels.hostname }} ({{ $labels.service }}) is down [copy] - alert: OpenstackNeutronAgentDown expr: openstack_neutron_agent_state{adminState="enabled"} == 0 for: 2m labels: severity: critical annotations: summary: OpenStack Neutron agent down (instance {{ $labels.instance }}) description: "Neutron agent {{ $labels.hostname }} ({{ $labels.service }}) is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.4. OpenStack Cinder agent down
Cinder agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }} [copy] - alert: OpenstackCinderAgentDown expr: openstack_cinder_agent_state{adminState="enabled"} == 0 for: 2m labels: severity: critical annotations: summary: OpenStack Cinder agent down (instance {{ $labels.instance }}) description: "Cinder agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.5. OpenStack hypervisor high vCPU usage
Hypervisor {{ $labels.hostname }} vCPU usage is above 90% [copy] # The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns. - alert: OpenstackHypervisorHighVcpuUsage expr: openstack_nova_vcpus_used / openstack_nova_vcpus_available > 0.9 and openstack_nova_vcpus_available > 0 for: 5m labels: severity: warning annotations: summary: OpenStack hypervisor high vCPU usage (instance {{ $labels.instance }}) description: "Hypervisor {{ $labels.hostname }} vCPU usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.6. OpenStack hypervisor high memory usage
Hypervisor {{ $labels.hostname }} memory usage is above 90% [copy] # The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns. - alert: OpenstackHypervisorHighMemoryUsage expr: openstack_nova_memory_used_bytes / openstack_nova_memory_available_bytes > 0.9 and openstack_nova_memory_available_bytes > 0 for: 5m labels: severity: warning annotations: summary: OpenStack hypervisor high memory usage (instance {{ $labels.instance }}) description: "Hypervisor {{ $labels.hostname }} memory usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.7. OpenStack hypervisor high disk usage
Hypervisor {{ $labels.hostname }} local disk usage is above 90% [copy] - alert: OpenstackHypervisorHighDiskUsage expr: openstack_nova_local_storage_used_bytes / openstack_nova_local_storage_available_bytes > 0.9 and openstack_nova_local_storage_available_bytes > 0 for: 5m labels: severity: warning annotations: summary: OpenStack hypervisor high disk usage (instance {{ $labels.instance }}) description: "Hypervisor {{ $labels.hostname }} local disk usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.8. OpenStack Nova tenant vCPU quota nearly exhausted
Tenant {{ $labels.tenant }} has used over 90% of its vCPU quota [copy] # A value of -1 for limits_vcpus_max means unlimited quota (no limit set). - alert: OpenstackNovaTenantVcpuQuotaNearlyExhausted expr: openstack_nova_limits_vcpus_used / openstack_nova_limits_vcpus_max > 0.9 and openstack_nova_limits_vcpus_max > 0 for: 0m labels: severity: warning annotations: summary: OpenStack Nova tenant vCPU quota nearly exhausted (instance {{ $labels.instance }}) description: "Tenant {{ $labels.tenant }} has used over 90% of its vCPU quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.9. OpenStack Nova tenant memory quota nearly exhausted
Tenant {{ $labels.tenant }} has used over 90% of its memory quota [copy] - alert: OpenstackNovaTenantMemoryQuotaNearlyExhausted expr: openstack_nova_limits_memory_used / openstack_nova_limits_memory_max > 0.9 and openstack_nova_limits_memory_max > 0 for: 0m labels: severity: warning annotations: summary: OpenStack Nova tenant memory quota nearly exhausted (instance {{ $labels.instance }}) description: "Tenant {{ $labels.tenant }} has used over 90% of its memory quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.10. OpenStack Nova tenant instance quota nearly exhausted
Tenant {{ $labels.tenant }} has used over 90% of its instance quota [copy] - alert: OpenstackNovaTenantInstanceQuotaNearlyExhausted expr: openstack_nova_limits_instances_used / openstack_nova_limits_instances_max > 0.9 and openstack_nova_limits_instances_max > 0 for: 0m labels: severity: warning annotations: summary: OpenStack Nova tenant instance quota nearly exhausted (instance {{ $labels.instance }}) description: "Tenant {{ $labels.tenant }} has used over 90% of its instance quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.11. OpenStack Cinder tenant volume quota nearly exhausted
Tenant {{ $labels.tenant }} has used over 90% of its volume storage quota [copy] - alert: OpenstackCinderTenantVolumeQuotaNearlyExhausted expr: openstack_cinder_limits_volume_used_gb / openstack_cinder_limits_volume_max_gb > 0.9 and openstack_cinder_limits_volume_max_gb > 0 for: 0m labels: severity: warning annotations: summary: OpenStack Cinder tenant volume quota nearly exhausted (instance {{ $labels.instance }}) description: "Tenant {{ $labels.tenant }} has used over 90% of its volume storage quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.12. OpenStack Cinder pool low free capacity
Cinder storage pool {{ $labels.name }} has less than 10% free capacity [copy] - alert: OpenstackCinderPoolLowFreeCapacity expr: openstack_cinder_pool_capacity_free_gb / openstack_cinder_pool_capacity_total_gb < 0.1 and openstack_cinder_pool_capacity_total_gb > 0 for: 5m labels: severity: warning annotations: summary: OpenStack Cinder pool low free capacity (instance {{ $labels.instance }}) description: "Cinder storage pool {{ $labels.name }} has less than 10% free capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.13. OpenStack Neutron floating IPs associated but not active
{{ $value }} floating IPs are associated to a private IP but are not in ACTIVE state [copy] - alert: OpenstackNeutronFloatingIpsAssociatedButNotActive expr: openstack_neutron_floating_ips_associated_not_active > 0 for: 5m labels: severity: warning annotations: summary: OpenStack Neutron floating IPs associated but not active (instance {{ $labels.instance }}) description: "{{ $value }} floating IPs are associated to a private IP but are not in ACTIVE state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.14. OpenStack Neutron routers not active
{{ $value }} Neutron routers are not in ACTIVE state [copy] - alert: OpenstackNeutronRoutersNotActive expr: openstack_neutron_routers_not_active > 0 for: 5m labels: severity: warning annotations: summary: OpenStack Neutron routers not active (instance {{ $labels.instance }}) description: "{{ $value }} Neutron routers are not in ACTIVE state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.15. OpenStack Neutron subnet IP pool exhaustion
Subnet {{ $labels.subnet_name }} on network {{ $labels.network_name }} has used over 90% of its IP pool [copy] - alert: OpenstackNeutronSubnetIpPoolExhaustion expr: openstack_neutron_network_ip_availabilities_used / openstack_neutron_network_ip_availabilities_total > 0.9 and openstack_neutron_network_ip_availabilities_total > 0 for: 0m labels: severity: warning annotations: summary: OpenStack Neutron subnet IP pool exhaustion (instance {{ $labels.instance }}) description: "Subnet {{ $labels.subnet_name }} on network {{ $labels.network_name }} has used over 90% of its IP pool\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.16. OpenStack Neutron ports without IPs
{{ $value }} active ports have no IP addresses assigned [copy] - alert: OpenstackNeutronPortsWithoutIps expr: openstack_neutron_ports_no_ips > 0 for: 5m labels: severity: warning annotations: summary: OpenStack Neutron ports without IPs (instance {{ $labels.instance }}) description: "{{ $value }} active ports have no IP addresses assigned\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.17. OpenStack load balancer not online
Load balancer {{ $labels.name }} ({{ $labels.id }}) operating status is {{ $labels.operating_status }} [copy] - alert: OpenstackLoadBalancerNotOnline expr: openstack_loadbalancer_loadbalancer_status{operating_status!="ONLINE"} > 0 for: 5m labels: severity: warning annotations: summary: OpenStack load balancer not online (instance {{ $labels.instance }}) description: "Load balancer {{ $labels.name }} ({{ $labels.id }}) operating status is {{ $labels.operating_status }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.18. OpenStack Nova instances in ERROR state
{{ $value }} Nova instances are in ERROR state [copy] - alert: OpenstackNovaInstancesInErrorState expr: sum(openstack_nova_server_status{status="ERROR"}) > 0 for: 5m labels: severity: warning annotations: summary: OpenStack Nova instances in ERROR state (instance {{ $labels.instance }}) description: "{{ $value }} Nova instances are in ERROR state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.19. OpenStack Cinder volumes in error state
{{ $value }} Cinder volumes are in an error state [copy] - alert: OpenstackCinderVolumesInErrorState expr: openstack_cinder_volume_status_counter{status=~"error.*"} > 0 for: 5m labels: severity: warning annotations: summary: OpenStack Cinder volumes in error state (instance {{ $labels.instance }}) description: "{{ $value }} Cinder volumes are in an error state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.9.20. OpenStack placement resource high usage
Resource {{ $labels.resourcetype }} on host {{ $labels.hostname }} usage exceeds 90% of its allocation [copy] # This alert factors in the allocation ratio to compute effective capacity. # The threshold of 90% is a rough default. Adjust based on your allocation ratios and workload patterns. - alert: OpenstackPlacementResourceHighUsage expr: openstack_placement_resource_usage / (openstack_placement_resource_total * openstack_placement_resource_allocation_ratio) > 0.9 and openstack_placement_resource_total > 0 for: 5m labels: severity: warning annotations: summary: OpenStack placement resource high usage (instance {{ $labels.instance }}) description: "Resource {{ $labels.resourcetype }} on host {{ $labels.hostname }} usage exceeds 90% of its allocation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 5.10. Spinnaker : Embedded exporter (12 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/spinnaker/embedded-exporter.yml-
# 5.10.1. Spinnaker circuit breaker open
Circuit breaker {{ $labels.name }} is open on {{ $labels.instance }}, indicating repeated downstream failures. [copy] - alert: SpinnakerCircuitBreakerOpen expr: resilience4j_circuitbreaker_state{state="open"} == 1 for: 5m labels: severity: warning annotations: summary: Spinnaker circuit breaker open (instance {{ $labels.instance }}) description: "Circuit breaker {{ $labels.name }} is open on {{ $labels.instance }}, indicating repeated downstream failures.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.2. Spinnaker Orca queue backing up
Orca work queue has {{ $value }} messages ready for delivery but not yet picked up. Pipeline executions may be delayed. [copy] # In a healthy Spinnaker, queue_ready_depth should stay at or near 0. # Sustained non-zero values indicate Orca cannot keep up with incoming work. - alert: SpinnakerOrcaQueueBackingUp expr: queue_ready_depth > 0 for: 5m labels: severity: warning annotations: summary: Spinnaker Orca queue backing up (instance {{ $labels.instance }}) description: "Orca work queue has {{ $value }} messages ready for delivery but not yet picked up. Pipeline executions may be delayed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.3. Spinnaker Orca queue message lag high
Orca queue message lag is {{ $value }}s. Pipeline stages are waiting too long before being processed. [copy] # The 30s threshold is a rough default. Adjust based on your pipeline SLOs. - alert: SpinnakerOrcaQueueMessageLagHigh expr: rate(queue_message_lag_seconds_sum[5m]) / rate(queue_message_lag_seconds_count[5m]) > 30 for: 5m labels: severity: warning annotations: summary: Spinnaker Orca queue message lag high (instance {{ $labels.instance }}) description: "Orca queue message lag is {{ $value }}s. Pipeline stages are waiting too long before being processed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.4. Spinnaker dead messages
Orca is producing dead-lettered messages ({{ $value }} per second). These are tasks that exhausted all retries and will not be executed. [copy] - alert: SpinnakerDeadMessages expr: rate(queue_dead_messages_total[5m]) > 0 for: 2m labels: severity: critical annotations: summary: Spinnaker dead messages (instance {{ $labels.instance }}) description: "Orca is producing dead-lettered messages ({{ $value }} per second). These are tasks that exhausted all retries and will not be executed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.5. Spinnaker zombie executions
{{ $value }} zombie pipeline executions detected. These are executions with no corresponding queue messages. [copy] # Zombies are pipeline executions that are running but have lost their queue entry. # See https://spinnaker.io/docs/guides/runbooks/orca-zombie-executions/ - alert: SpinnakerZombieExecutions expr: rate(queue_zombies_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Spinnaker zombie executions (instance {{ $labels.instance }}) description: "{{ $value }} zombie pipeline executions detected. These are executions with no corresponding queue messages.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.6. Spinnaker thread pool exhaustion
Orca message handler thread pool has {{ $value }} blocked threads on {{ $labels.instance }}. Pipeline execution throughput is degraded. [copy] - alert: SpinnakerThreadPoolExhaustion expr: threadpool_blockingQueueSize > 0 for: 5m labels: severity: warning annotations: summary: Spinnaker thread pool exhaustion (instance {{ $labels.instance }}) description: "Orca message handler thread pool has {{ $value }} blocked threads on {{ $labels.instance }}. Pipeline execution throughput is degraded.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.7. Spinnaker polling monitor items over threshold
Igor polling monitor {{ $labels.monitor }} for {{ $labels.partition }} has exceeded its item threshold, preventing pipeline triggers. [copy] # When this threshold is exceeded, Igor stops triggering pipelines for the affected monitor. # See https://kb.armory.io/s/article/Hitting-Igor-s-caching-thresholds - alert: SpinnakerPollingMonitorItemsOverThreshold expr: sum by (monitor, partition) (pollingMonitor_itemsOverThreshold) > 0 for: 5m labels: severity: critical annotations: summary: Spinnaker polling monitor items over threshold (instance {{ $labels.instance }}) description: "Igor polling monitor {{ $labels.monitor }} for {{ $labels.partition }} has exceeded its item threshold, preventing pipeline triggers.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.8. Spinnaker polling monitor failures
Igor polling monitor is experiencing failures ({{ $value }} per second). CI/SCM integrations may not trigger pipelines. [copy] - alert: SpinnakerPollingMonitorFailures expr: rate(pollingMonitor_failed_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Spinnaker polling monitor failures (instance {{ $labels.instance }}) description: "Igor polling monitor is experiencing failures ({{ $value }} per second). CI/SCM integrations may not trigger pipelines.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.9. Spinnaker high API error rate
Spinnaker API 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. [copy] # The 5% threshold is a rough default. Adjust based on your traffic patterns. - alert: SpinnakerHighApiErrorRate expr: sum by (instance) (rate(controller_invocations_total{status="5xx"}[5m])) / sum by (instance) (rate(controller_invocations_total[5m])) > 0.05 and sum by (instance) (rate(controller_invocations_total[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Spinnaker high API error rate (instance {{ $labels.instance }}) description: "Spinnaker API 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.10. Spinnaker API rate limit throttling
Gate is actively throttling API requests on {{ $labels.instance }} ({{ $value }} throttled requests per second). [copy] - alert: SpinnakerApiRateLimitThrottling expr: rate(rateLimitThrottling_total[5m]) > 0 for: 2m labels: severity: warning annotations: summary: Spinnaker API rate limit throttling (instance {{ $labels.instance }}) description: "Gate is actively throttling API requests on {{ $labels.instance }} ({{ $value }} throttled requests per second).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.11. Spinnaker Clouddriver high error rate
Clouddriver 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Cloud operations may be failing. [copy] - alert: SpinnakerClouddriverHighErrorRate expr: sum by (instance) (rate(controller_invocations_total{status="5xx", job=~".*clouddriver.*"}[5m])) / sum by (instance) (rate(controller_invocations_total{job=~".*clouddriver.*"}[5m])) > 0.05 and sum by (instance) (rate(controller_invocations_total{job=~".*clouddriver.*"}[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Spinnaker Clouddriver high error rate (instance {{ $labels.instance }}) description: "Clouddriver 5xx error rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}. Cloud operations may be failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 5.10.12. Spinnaker AWS rate limiting
Clouddriver is being rate-limited by AWS on {{ $labels.instance }} ({{ $value }}ms delay). Cloud operations will be slower. [copy] # This metric is specific to AWS cloud providers in Clouddriver. # The 1000ms threshold is a rough default. Adjust based on your AWS usage patterns. - alert: SpinnakerAwsRateLimiting expr: amazonClientProvider_rateLimitDelayMil > 1000 for: 5m labels: severity: warning annotations: summary: Spinnaker AWS rate limiting (instance {{ $labels.instance }}) description: "Clouddriver is being rate-limited by AWS on {{ $labels.instance }} ({{ $value }}ms delay). Cloud operations will be slower.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.1. Ceph : Embedded exporter (13 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ceph/embedded-exporter.yml-
# 6.1.1. Ceph State
Ceph instance unhealthy [copy] - alert: CephState expr: ceph_health_status != 0 for: 0m labels: severity: critical annotations: summary: Ceph State (instance {{ $labels.instance }}) description: "Ceph instance unhealthy\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.2. Ceph monitor clock skew
Ceph monitor clock skew detected. Please check ntp and hardware clock settings [copy] - alert: CephMonitorClockSkew expr: abs(ceph_monitor_clock_skew_seconds) > 0.2 for: 2m labels: severity: warning annotations: summary: Ceph monitor clock skew (instance {{ $labels.instance }}) description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.3. Ceph monitor low space
Ceph monitor storage is low. [copy] - alert: CephMonitorLowSpace expr: ceph_monitor_avail_percent < 10 for: 2m labels: severity: warning annotations: summary: Ceph monitor low space (instance {{ $labels.instance }}) description: "Ceph monitor storage is low.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.4. Ceph OSD Down
Ceph Object Storage Daemon Down [copy] - alert: CephOsdDown expr: ceph_osd_up == 0 for: 0m labels: severity: critical annotations: summary: Ceph OSD Down (instance {{ $labels.instance }}) description: "Ceph Object Storage Daemon Down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.5. Ceph high OSD latency
Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state. [copy] - alert: CephHighOsdLatency expr: ceph_osd_perf_apply_latency_seconds > 5 for: 1m labels: severity: warning annotations: summary: Ceph high OSD latency (instance {{ $labels.instance }}) description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.6. Ceph OSD low space
Ceph Object Storage Daemon is going out of space. Please add more disks. [copy] - alert: CephOsdLowSpace expr: ceph_osd_utilization > 90 for: 2m labels: severity: warning annotations: summary: Ceph OSD low space (instance {{ $labels.instance }}) description: "Ceph Object Storage Daemon is going out of space. Please add more disks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.7. Ceph OSD reweighted
Ceph Object Storage Daemon takes too much time to resize. [copy] - alert: CephOsdReweighted expr: ceph_osd_weight < 1 for: 2m labels: severity: warning annotations: summary: Ceph OSD reweighted (instance {{ $labels.instance }}) description: "Ceph Object Storage Daemon takes too much time to resize.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.8. Ceph PG down
Some Ceph placement groups are down. Please ensure that all the data are available. [copy] - alert: CephPgDown expr: ceph_pg_down > 0 for: 0m labels: severity: critical annotations: summary: Ceph PG down (instance {{ $labels.instance }}) description: "Some Ceph placement groups are down. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.9. Ceph PG incomplete
Some Ceph placement groups are incomplete. Please ensure that all the data are available. [copy] - alert: CephPgIncomplete expr: ceph_pg_incomplete > 0 for: 0m labels: severity: critical annotations: summary: Ceph PG incomplete (instance {{ $labels.instance }}) description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.10. Ceph PG inconsistent
Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes. [copy] - alert: CephPgInconsistent expr: ceph_pg_inconsistent > 0 for: 0m labels: severity: warning annotations: summary: Ceph PG inconsistent (instance {{ $labels.instance }}) description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.11. Ceph PG activation long
Some Ceph placement groups are too long to activate. [copy] - alert: CephPgActivationLong expr: ceph_pg_activating > 0 for: 2m labels: severity: warning annotations: summary: Ceph PG activation long (instance {{ $labels.instance }}) description: "Some Ceph placement groups are too long to activate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.12. Ceph PG backfill full
Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules. [copy] - alert: CephPgBackfillFull expr: ceph_pg_backfill_toofull > 0 for: 2m labels: severity: warning annotations: summary: Ceph PG backfill full (instance {{ $labels.instance }}) description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.1.13. Ceph PG unavailable
Some Ceph placement groups are unavailable. [copy] - alert: CephPgUnavailable expr: ceph_pg_total - ceph_pg_active > 0 for: 0m labels: severity: critical annotations: summary: Ceph PG unavailable (instance {{ $labels.instance }}) description: "Some Ceph placement groups are unavailable.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.2. SpeedTest : Speedtest exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/speedtest/nlamirault-speedtest-exporter.yml-
# 6.2.1. SpeedTest Slow Internet Download
Internet download speed is currently {{humanize $value}} Mbps. [copy] - alert: SpeedtestSlowInternetDownload expr: avg_over_time(speedtest_download[10m]) < 100 for: 0m labels: severity: warning annotations: summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }}) description: "Internet download speed is currently {{humanize $value}} Mbps.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.2.2. SpeedTest Slow Internet Upload
Internet upload speed is currently {{humanize $value}} Mbps. [copy] - alert: SpeedtestSlowInternetUpload expr: avg_over_time(speedtest_upload[10m]) < 20 for: 0m labels: severity: warning annotations: summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }}) description: "Internet upload speed is currently {{humanize $value}} Mbps.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.3.1. ZFS : node-exporter (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/zfs/node-exporter.yml-
# 6.3.1.1. ZFS offline pool
A ZFS zpool is in a unexpected state: {{ $labels.state }}. [copy] - alert: ZfsOfflinePool expr: node_zfs_zpool_state{state!="online"} > 0 for: 1m labels: severity: critical annotations: summary: ZFS offline pool (instance {{ $labels.instance }}) description: "A ZFS zpool is in a unexpected state: {{ $labels.state }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.3.2. ZFS : ZFS exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/zfs/zfs_exporter.yml-
# 6.3.2.1. ZFS pool out of space
Disk is almost full (< 10% left) [copy] - alert: ZfsPoolOutOfSpace expr: zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0 for: 0m labels: severity: warning annotations: summary: ZFS pool out of space (instance {{ $labels.instance }}) description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.3.2.2. ZFS pool unhealthy
ZFS pool state is {{ $value }}. See comments for more information. [copy] # 0: ONLINE # 1: DEGRADED # 2: FAULTED # 3: OFFLINE # 4: UNAVAIL # 5: REMOVED # 6: SUSPENDED - alert: ZfsPoolUnhealthy expr: zfs_pool_health > 0 for: 0m labels: severity: critical annotations: summary: ZFS pool unhealthy (instance {{ $labels.instance }}) description: "ZFS pool state is {{ $value }}. See comments for more information.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.3.2.3. ZFS collector failed
ZFS collector for {{ $labels.instance }} has failed to collect information [copy] - alert: ZfsCollectorFailed expr: zfs_scrape_collector_success != 1 for: 0m labels: severity: warning annotations: summary: ZFS collector failed (instance {{ $labels.instance }}) description: "ZFS collector for {{ $labels.instance }} has failed to collect information\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.4. OpenEBS : Embedded exporter (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/openebs/embedded-exporter.yml-
# 6.4.1. OpenEBS used pool capacity
OpenEBS Pool use more than 80% of his capacity [copy] - alert: OpenebsUsedPoolCapacity expr: openebs_used_pool_capacity_percent > 80 for: 2m labels: severity: warning annotations: summary: OpenEBS used pool capacity (instance {{ $labels.instance }}) description: "OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.5. Minio : Embedded exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/minio/embedded-exporter.yml-
# 6.5.1. Minio cluster disk offline
Minio cluster disk is offline [copy] - alert: MinioClusterDiskOffline expr: minio_cluster_drive_offline_total > 0 for: 0m labels: severity: critical annotations: summary: Minio cluster disk offline (instance {{ $labels.instance }}) description: "Minio cluster disk is offline\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.5.2. Minio node disk offline
Minio cluster node disk is offline [copy] - alert: MinioNodeDiskOffline expr: minio_cluster_nodes_offline_total > 0 for: 0m labels: severity: critical annotations: summary: Minio node disk offline (instance {{ $labels.instance }}) description: "Minio cluster node disk is offline\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.5.3. Minio disk space usage
Minio available free space is low (< 10%) [copy] - alert: MinioDiskSpaceUsage expr: minio_cluster_capacity_raw_free_bytes / minio_cluster_capacity_raw_total_bytes * 100 < 10 for: 0m labels: severity: warning annotations: summary: Minio disk space usage (instance {{ $labels.instance }}) description: "Minio available free space is low (< 10%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.6. SSL/TLS : ssl_exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ssl/tls/ribbybibby-ssl-exporter.yml-
# 6.6.1. SSL certificate probe failed
Failed to fetch SSL information {{ $labels.instance }} [copy] - alert: SslCertificateProbeFailed expr: ssl_probe_success == 0 for: 0m labels: severity: critical annotations: summary: SSL certificate probe failed (instance {{ $labels.instance }}) description: "Failed to fetch SSL information {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.6.2. SSL certificate OSCP status unknown
Failed to get the OSCP status {{ $labels.instance }} [copy] - alert: SslCertificateOscpStatusUnknown expr: ssl_ocsp_response_status == 2 for: 0m labels: severity: warning annotations: summary: SSL certificate OSCP status unknown (instance {{ $labels.instance }}) description: "Failed to get the OSCP status {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.6.3. SSL certificate revoked
SSL certificate revoked {{ $labels.instance }} [copy] - alert: SslCertificateRevoked expr: ssl_ocsp_response_status == 1 for: 0m labels: severity: critical annotations: summary: SSL certificate revoked (instance {{ $labels.instance }}) description: "SSL certificate revoked {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.6.4. SSL certificate expiry (< 7 days)
{{ $labels.instance }} Certificate is expiring in 7 days [copy] - alert: SslCertificateExpiry(<7Days) expr: ssl_verified_cert_not_after{chain_no="0"} - time() < 86400 * 7 for: 0m labels: severity: warning annotations: summary: SSL certificate expiry (< 7 days) (instance {{ $labels.instance }}) description: "{{ $labels.instance }} Certificate is expiring in 7 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.7. cert-manager : Embedded exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/cert-manager/embedded-exporter.yml-
# 6.7.1. Cert-Manager absent
Cert-Manager has disappeared from Prometheus service discovery. New certificates will not be able to be minted, and existing ones can't be renewed until cert-manager is back. [copy] - alert: Cert-managerAbsent expr: absent(up{job="cert-manager"}) for: 10m labels: severity: critical annotations: summary: Cert-Manager absent (instance {{ $labels.instance }}) description: "Cert-Manager has disappeared from Prometheus service discovery. New certificates will not be able to be minted, and existing ones can't be renewed until cert-manager is back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.7.2. Cert-Manager certificate expiring soon
The certificate {{ $labels.name }} is expiring in less than 21 days. [copy] # Threshold of 21 days is a rough default. ACME certificates are typically renewed 30 days before expiry, so expiring within 21 days may indicate issuer misconfiguration. - alert: Cert-managerCertificateExpiringSoon expr: avg by (exported_namespace, namespace, name) (certmanager_certificate_expiration_timestamp_seconds - time()) < (21 * 24 * 3600) for: 1h labels: severity: warning annotations: summary: Cert-Manager certificate expiring soon (instance {{ $labels.instance }}) description: "The certificate {{ $labels.name }} is expiring in less than 21 days.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.7.3. Cert-Manager certificate not ready
The certificate {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready to serve traffic. [copy] - alert: Cert-managerCertificateNotReady expr: max by (name, exported_namespace, namespace, condition) (certmanager_certificate_ready_status{condition!="True"} == 1) for: 10m labels: severity: critical annotations: summary: Cert-Manager certificate not ready (instance {{ $labels.instance }}) description: "The certificate {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready to serve traffic.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.7.4. Cert-Manager hitting ACME rate limits
Cert-Manager is being rate-limited by the ACME provider. Certificate issuance and renewal may be blocked for up to a week. [copy] - alert: Cert-managerHittingAcmeRateLimits expr: sum by (host) (rate(certmanager_http_acme_client_request_count{status="429"}[5m])) > 0 for: 5m labels: severity: critical annotations: summary: Cert-Manager hitting ACME rate limits (instance {{ $labels.instance }}) description: "Cert-Manager is being rate-limited by the ACME provider. Certificate issuance and renewal may be blocked for up to a week.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.8. Juniper : czerwonk/junos_exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/juniper/czerwonk-junos-exporter.yml-
# 6.8.1. Juniper switch down
The switch appears to be down [copy] - alert: JuniperSwitchDown expr: junos_up == 0 for: 0m labels: severity: critical annotations: summary: Juniper switch down (instance {{ $labels.instance }}) description: "The switch appears to be down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.8.2. Juniper high Bandwidth Usage 1GiB
Interface is highly saturated. (> 0.90GiB/s) [copy] - alert: JuniperHighBandwidthUsage1gib expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90 for: 1m labels: severity: critical annotations: summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }}) description: "Interface is highly saturated. (> 0.90GiB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.8.3. Juniper high Bandwidth Usage 1GiB
Interface is getting saturated. (> 0.80GiB/s) [copy] - alert: JuniperHighBandwidthUsage1gib expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80 for: 1m labels: severity: warning annotations: summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }}) description: "Interface is getting saturated. (> 0.80GiB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.9. CoreDNS : Embedded exporter (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/coredns/embedded-exporter.yml-
# 6.9.1. CoreDNS Panic Count
Number of CoreDNS panics encountered [copy] - alert: CorednsPanicCount expr: increase(coredns_panics_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: CoreDNS Panic Count (instance {{ $labels.instance }}) description: "Number of CoreDNS panics encountered\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.10. Freeswitch : znerol/prometheus-freeswitch-exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/freeswitch/znerol-freeswitch-exporter.yml-
# 6.10.1. Freeswitch down
Freeswitch is unresponsive [copy] - alert: FreeswitchDown expr: freeswitch_up == 0 for: 0m labels: severity: critical annotations: summary: Freeswitch down (instance {{ $labels.instance }}) description: "Freeswitch is unresponsive\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.10.2. Freeswitch Sessions Warning
High sessions usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: FreeswitchSessionsWarning expr: (freeswitch_session_active * 100 / freeswitch_session_limit) > 80 for: 10m labels: severity: warning annotations: summary: Freeswitch Sessions Warning (instance {{ $labels.instance }}) description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.10.3. Freeswitch Sessions Critical
High sessions usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: FreeswitchSessionsCritical expr: (freeswitch_session_active * 100 / freeswitch_session_limit) > 90 for: 5m labels: severity: critical annotations: summary: Freeswitch Sessions Critical (instance {{ $labels.instance }}) description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.11. Hashicorp Vault : Embedded exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/hashicorp-vault/embedded-exporter.yml-
# 6.11.1. Vault sealed
Vault instance is sealed on {{ $labels.instance }} [copy] - alert: VaultSealed expr: vault_core_unsealed == 0 for: 0m labels: severity: critical annotations: summary: Vault sealed (instance {{ $labels.instance }}) description: "Vault instance is sealed on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.11.2. Vault too many pending tokens
Too many pending tokens {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: VaultTooManyPendingTokens expr: avg(vault_token_create_count - vault_token_store_count) > 0 for: 5m labels: severity: warning annotations: summary: Vault too many pending tokens (instance {{ $labels.instance }}) description: "Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.11.3. Vault too many infinity tokens
Too many infinity tokens {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: VaultTooManyInfinityTokens expr: vault_token_count_by_ttl{creation_ttl="+Inf"} > 3 for: 5m labels: severity: warning annotations: summary: Vault too many infinity tokens (instance {{ $labels.instance }}) description: "Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.11.4. Vault cluster health
Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy] - alert: VaultClusterHealth expr: sum(vault_core_active) / count(vault_core_active) <= 0.5 for: 0m labels: severity: critical annotations: summary: Vault cluster health (instance {{ $labels.instance }}) description: "Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.12. Keycloak : aerogear/keycloak-metrics-spi (6 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/keycloak/aerogear-keycloak-metrics-spi.yml-
# 6.12.1. Keycloak high login failure rate
More than 5% of login attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf "%.1f" }}%). [copy] # Threshold of 5% is a rough default. Adjust based on your user base and expected error rates. # A spike in failed logins may indicate a brute-force attack or misconfigured client. - alert: KeycloakHighLoginFailureRate expr: (sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])) / (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])))) * 100 > 5 and (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m]))) > 0 for: 5m labels: severity: warning annotations: summary: Keycloak high login failure rate (instance {{ $labels.instance }}) description: "More than 5% of login attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.12.2. Keycloak no successful logins
No successful logins in realm {{ $labels.realm }} for the last 15 minutes. [copy] # Only fires when login attempts exist but none succeed — may indicate an authentication outage. - alert: KeycloakNoSuccessfulLogins expr: sum by (realm) (rate(keycloak_logins_total[15m])) == 0 and (sum by (realm) (rate(keycloak_logins_total[15m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[15m]))) > 0 for: 5m labels: severity: critical annotations: summary: Keycloak no successful logins (instance {{ $labels.instance }}) description: "No successful logins in realm {{ $labels.realm }} for the last 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.12.3. Keycloak high token refresh error rate
More than 10% of token refresh attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf "%.1f" }}%). [copy] # Threshold of 10% is a rough default. High refresh token errors may indicate expired sessions or token store issues. - alert: KeycloakHighTokenRefreshErrorRate expr: (sum by (realm) (rate(keycloak_refresh_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_refresh_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_refresh_tokens_total[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Keycloak high token refresh error rate (instance {{ $labels.instance }}) description: "More than 10% of token refresh attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.12.4. Keycloak high code-to-token exchange error rate
More than 10% of code-to-token exchanges are failing in realm {{ $labels.realm }} (current value: {{ $value | printf "%.1f" }}%). [copy] # Threshold of 10% is a rough default. Code-to-token failures may indicate misconfigured OAuth clients or replay attacks. - alert: KeycloakHighCode-to-tokenExchangeErrorRate expr: (sum by (realm) (rate(keycloak_code_to_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_code_to_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_code_to_tokens_total[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Keycloak high code-to-token exchange error rate (instance {{ $labels.instance }}) description: "More than 10% of code-to-token exchanges are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.12.5. Keycloak high registration failure rate
More than 10% of registration attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf "%.1f" }}%). [copy] # Threshold of 10% is a rough default. - alert: KeycloakHighRegistrationFailureRate expr: (sum by (realm) (rate(keycloak_registrations_errors_total[5m])) / sum by (realm) (rate(keycloak_registrations_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_registrations_total[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Keycloak high registration failure rate (instance {{ $labels.instance }}) description: "More than 10% of registration attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.12.6. Keycloak slow request response time
Keycloak {{ $labels.method }} requests are taking more than 2 seconds on average. [copy] # Threshold of 2 seconds is a rough default. Adjust based on your performance requirements. - alert: KeycloakSlowRequestResponseTime expr: sum by (method) (rate(keycloak_request_duration_sum[5m])) / sum by (method) (rate(keycloak_request_duration_count[5m])) > 2 and sum by (method) (rate(keycloak_request_duration_count[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Keycloak slow request response time (instance {{ $labels.instance }}) description: "Keycloak {{ $labels.method }} requests are taking more than 2 seconds on average.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.13. Cloudflare : lablabs/cloudflare-exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/cloudflare/lablabs-cloudflare-exporter.yml-
# 6.13.1. Cloudflare http 4xx error rate
Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }}) [copy] - alert: CloudflareHttp4xxErrorRate expr: (sum by(zone) (rate(cloudflare_zone_requests_status{status=~"^4.."}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5 for: 0m labels: severity: warning annotations: summary: Cloudflare http 4xx error rate (instance {{ $labels.instance }}) description: "Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.13.2. Cloudflare http 5xx error rate
Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }}) [copy] - alert: CloudflareHttp5xxErrorRate expr: (sum by (zone) (rate(cloudflare_zone_requests_status{status=~"^5.."}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5 for: 0m labels: severity: critical annotations: summary: Cloudflare http 5xx error rate (instance {{ $labels.instance }}) description: "Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.14. SNMP : prometheus/snmp_exporter (7 rules) [copy section]
These rules use standard IF-MIB and SNMPv2-MIB metrics. Metric names depend on your snmp.yml module configuration.
Thresholds for bandwidth and error rates are rough defaults - adjust to your environment.$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/snmp/snmp-exporter.yml-
# 6.14.1. SNMP target down
SNMP device {{ $labels.instance }} is unreachable. [copy] # From the official snmp-mixin. - alert: SnmpTargetDown expr: up{job=~"snmp.*"} == 0 for: 5m labels: severity: critical annotations: summary: SNMP target down (instance {{ $labels.instance }}) description: "SNMP device {{ $labels.instance }} is unreachable.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.14.2. SNMP interface down
Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is operationally down while administratively up. [copy] - alert: SnmpInterfaceDown expr: (ifOperStatus{job=~"snmp.*"} == 2) and on(instance, job, ifIndex) (ifAdminStatus{job=~"snmp.*"} == 1) for: 2m labels: severity: critical annotations: summary: SNMP interface down (instance {{ $labels.instance }}) description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is operationally down while administratively up.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.14.3. SNMP interface high inbound error rate
Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an inbound error rate above 5%. [copy] # Threshold is a rough default. Adjust based on your network environment. - alert: SnmpInterfaceHighInboundErrorRate expr: rate(ifInErrors{job=~"snmp.*"}[5m]) / (rate(ifHCInUcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCInBroadcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCInMulticastPkts{job=~"snmp.*"}[5m])) > 0.05 and (rate(ifHCInUcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCInBroadcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCInMulticastPkts{job=~"snmp.*"}[5m])) > 0 for: 5m labels: severity: warning annotations: summary: SNMP interface high inbound error rate (instance {{ $labels.instance }}) description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an inbound error rate above 5%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.14.4. SNMP interface high outbound error rate
Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an outbound error rate above 5%. [copy] # Threshold is a rough default. Adjust based on your network environment. - alert: SnmpInterfaceHighOutboundErrorRate expr: rate(ifOutErrors{job=~"snmp.*"}[5m]) / (rate(ifHCOutUcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCOutBroadcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCOutMulticastPkts{job=~"snmp.*"}[5m])) > 0.05 and (rate(ifHCOutUcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCOutBroadcastPkts{job=~"snmp.*"}[5m]) + rate(ifHCOutMulticastPkts{job=~"snmp.*"}[5m])) > 0 for: 5m labels: severity: warning annotations: summary: SNMP interface high outbound error rate (instance {{ $labels.instance }}) description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} has an outbound error rate above 5%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.14.5. SNMP interface high bandwidth usage inbound
Interface {{ $labels.ifDescr }} on {{ $labels.instance }} inbound utilization is above 80%. [copy] # Threshold is a rough default. Adjust based on your link capacity and traffic patterns. - alert: SnmpInterfaceHighBandwidthUsageInbound expr: rate(ifHCInOctets{job=~"snmp.*"}[5m]) * 8 / ifSpeed > 0.80 and ifSpeed > 0 for: 15m labels: severity: warning annotations: summary: SNMP interface high bandwidth usage inbound (instance {{ $labels.instance }}) description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} inbound utilization is above 80%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.14.6. SNMP interface high bandwidth usage outbound
Interface {{ $labels.ifDescr }} on {{ $labels.instance }} outbound utilization is above 80%. [copy] # Threshold is a rough default. Adjust based on your link capacity and traffic patterns. - alert: SnmpInterfaceHighBandwidthUsageOutbound expr: rate(ifHCOutOctets{job=~"snmp.*"}[5m]) * 8 / ifSpeed > 0.80 and ifSpeed > 0 for: 15m labels: severity: warning annotations: summary: SNMP interface high bandwidth usage outbound (instance {{ $labels.instance }}) description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} outbound utilization is above 80%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.14.7. SNMP device restarted
SNMP device {{ $labels.instance }} has restarted (uptime < 5 minutes). [copy] # sysUpTime is in centiseconds (hundredths of a second). - alert: SnmpDeviceRestarted expr: sysUpTime / 100 < 300 for: 0m labels: severity: info annotations: summary: SNMP device restarted (instance {{ $labels.instance }}) description: "SNMP device {{ $labels.instance }} has restarted (uptime < 5 minutes).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.15. Cilium : Embedded exporter (31 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/cilium/embedded-exporter.yml-
# 6.15.1. Cilium agent unreachable nodes
Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health. [copy] - alert: CiliumAgentUnreachableNodes expr: sum(cilium_unreachable_nodes{}) by (pod) > 0 for: 15m labels: severity: warning annotations: summary: Cilium agent unreachable nodes (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.2. Cilium agent unreachable health endpoints
Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing. [copy] - alert: CiliumAgentUnreachableHealthEndpoints expr: sum(cilium_unreachable_health_endpoints{}) by (pod) > 0 for: 15m labels: severity: warning annotations: summary: Cilium agent unreachable health endpoints (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.3. Cilium agent failing controllers
Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details. [copy] - alert: CiliumAgentFailingControllers expr: sum(cilium_controllers_failing{}) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent failing controllers (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.4. Cilium agent endpoint failures
Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state. [copy] - alert: CiliumAgentEndpointFailures expr: sum(cilium_endpoint_state{endpoint_state="invalid"}) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent endpoint failures (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.5. Cilium agent endpoint regeneration failures
Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale. [copy] - alert: CiliumAgentEndpointRegenerationFailures expr: sum(rate(cilium_endpoint_regenerations_total{outcome="fail"}[5m])) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent endpoint regeneration failures (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.6. Cilium agent endpoint update failure
Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}). [copy] - alert: CiliumAgentEndpointUpdateFailure expr: sum(rate(cilium_k8s_client_api_calls_total{method=~"(PUT|POST|PATCH)", endpoint="endpoint", return_code!~"2[0-9][0-9]"}[5m])) by (pod, method, return_code) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent endpoint update failure (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.7. Cilium agent endpoint create failure
Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking. [copy] - alert: CiliumAgentEndpointCreateFailure expr: sum(rate(cilium_api_limiter_processed_requests_total{api_call=~"endpoint-create", outcome="fail"}[1m])) by (pod, api_call) > 0 for: 5m labels: severity: info annotations: summary: Cilium agent endpoint create failure (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.8. Cilium agent map operation failures
Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded. [copy] - alert: CiliumAgentMapOperationFailures expr: sum(rate(cilium_bpf_map_ops_total{outcome="fail"}[5m])) by (map_name, pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent map operation failures (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.9. Cilium agent BPF map pressure
Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full. [copy] # Map pressure is a ratio from 0 to 1. At 1.0, the map is full and new entries will be dropped. - alert: CiliumAgentBpfMapPressure expr: cilium_bpf_map_pressure{} > 0.9 for: 5m labels: severity: warning annotations: summary: Cilium agent BPF map pressure (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.10. Cilium agent conntrack table full
Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks. [copy] - alert: CiliumAgentConntrackTableFull expr: sum(rate(cilium_drop_count_total{reason="CT: Map insertion failed"}[5m])) by (pod) > 0 for: 5m labels: severity: critical annotations: summary: Cilium agent conntrack table full (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.11. Cilium agent conntrack failed garbage collection
Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate. [copy] - alert: CiliumAgentConntrackFailedGarbageCollection expr: sum(rate(cilium_datapath_conntrack_gc_runs_total{status="uncompleted"}[5m])) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent conntrack failed garbage collection (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.12. Cilium agent NAT table full
Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate. [copy] - alert: CiliumAgentNatTableFull expr: sum(rate(cilium_drop_count_total{reason="No mapping for NAT masquerade"}[1m])) by (pod) > 0 for: 5m labels: severity: critical annotations: summary: Cilium agent NAT table full (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.13. Cilium agent high denied rate
Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct. [copy] # Policy denials may be expected behavior. Investigate only if unexpected traffic is being blocked. - alert: CiliumAgentHighDeniedRate expr: sum(rate(cilium_drop_count_total{reason="Policy denied"}[1m])) by (pod) > 0 for: 10m labels: severity: info annotations: summary: Cilium agent high denied rate (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.14. Cilium agent high drop rate
Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues. [copy] - alert: CiliumAgentHighDropRate expr: sum(rate(cilium_drop_count_total{reason!~"Policy denied"}[5m])) by (pod, reason) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent high drop rate (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.15. Cilium agent policy map pressure
Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply. [copy] - alert: CiliumAgentPolicyMapPressure expr: sum(cilium_bpf_map_pressure{map_name=~"cilium_policy_.*"}) by (pod) > 0.9 for: 5m labels: severity: warning annotations: summary: Cilium agent policy map pressure (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.16. Cilium agent policy import errors
Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete. [copy] - alert: CiliumAgentPolicyImportErrors expr: sum(rate(cilium_policy_change_total{outcome="fail"}[5m])) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent policy import errors (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.17. Cilium agent policy implementation delay
Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies. [copy] # Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity. - alert: CiliumAgentPolicyImplementationDelay expr: histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay[5m])) by (le, pod)) > 60 for: 5m labels: severity: warning annotations: summary: Cilium agent policy implementation delay (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.18. Cilium node-local high identity allocation
Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit. [copy] - alert: CiliumNode-localHighIdentityAllocation expr: (sum(cilium_identity{type="node_local"}) by (pod) / (2^16-1)) > 0.8 for: 5m labels: severity: warning annotations: summary: Cilium node-local high identity allocation (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.19. Cilium cluster high identity allocation
Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit. [copy] - alert: CiliumClusterHighIdentityAllocation expr: (sum(cilium_identity{type="cluster_local"}) by () / (2^16-256)) > 0.8 for: 5m labels: severity: warning annotations: summary: Cilium cluster high identity allocation (instance {{ $labels.instance }}) description: "Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.20. Cilium operator exhausted IPAM IPs
Cilium operator has no available IPAM IPs. New pods will fail to schedule networking. [copy] - alert: CiliumOperatorExhaustedIpamIps expr: sum(cilium_operator_ipam_ips{type="available"}) by () <= 0 for: 5m labels: severity: critical annotations: summary: Cilium operator exhausted IPAM IPs (instance {{ $labels.instance }}) description: "Cilium operator has no available IPAM IPs. New pods will fail to schedule networking.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.21. Cilium operator low available IPAM IPs
Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion. [copy] # Threshold of 90% is a rough default. Adjust based on your pod churn rate and IP pool size. - alert: CiliumOperatorLowAvailableIpamIps expr: sum(cilium_operator_ipam_ips{type!="available"}) by () / sum(cilium_operator_ipam_ips) by () > 0.9 and sum(cilium_operator_ipam_ips) by () > 0 for: 5m labels: severity: warning annotations: summary: Cilium operator low available IPAM IPs (instance {{ $labels.instance }}) description: "Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.22. Cilium operator IPAM interface creation failures
Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted. [copy] - alert: CiliumOperatorIpamInterfaceCreationFailures expr: sum(rate(cilium_operator_ipam_interface_creation_ops{status!="success"}[5m])) by () > 0 for: 10m labels: severity: warning annotations: summary: Cilium operator IPAM interface creation failures (instance {{ $labels.instance }}) description: "Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.23. Cilium agent API errors
Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy. [copy] - alert: CiliumAgentApiErrors expr: sum(rate(cilium_agent_api_process_time_seconds_count{return_code=~"5[0-9][0-9]"}[5m])) by (pod, return_code) > 0 for: 5m labels: severity: warning annotations: summary: Cilium agent API errors (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.24. Cilium agent Kubernetes client errors
Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}). [copy] - alert: CiliumAgentKubernetesClientErrors expr: sum(rate(cilium_k8s_client_api_calls_total{endpoint!="metrics", return_code!~"2[0-9][0-9]"}[5m])) by (pod, endpoint, return_code) > 0 for: 5m labels: severity: info annotations: summary: Cilium agent Kubernetes client errors (instance {{ $labels.instance }}) description: "Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.25. Cilium ClusterMesh remote cluster not ready
Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}. [copy] - alert: CiliumClustermeshRemoteClusterNotReady expr: count(cilium_clustermesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0 for: 5m labels: severity: critical annotations: summary: Cilium ClusterMesh remote cluster not ready (instance {{ $labels.instance }}) description: "Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.26. Cilium ClusterMesh remote cluster failing
Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing. [copy] - alert: CiliumClustermeshRemoteClusterFailing expr: sum(rate(cilium_clustermesh_remote_cluster_failures[5m])) by (source_cluster, target_cluster) > 0 for: 5m labels: severity: critical annotations: summary: Cilium ClusterMesh remote cluster failing (instance {{ $labels.instance }}) description: "Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.27. Cilium KVStoreMesh remote cluster not ready
Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}. [copy] - alert: CiliumKvstoremeshRemoteClusterNotReady expr: count(cilium_kvstoremesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0 for: 5m labels: severity: critical annotations: summary: Cilium KVStoreMesh remote cluster not ready (instance {{ $labels.instance }}) description: "Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.28. Cilium KVStoreMesh remote cluster failing
Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures. [copy] - alert: CiliumKvstoremeshRemoteClusterFailing expr: sum(rate(cilium_kvstoremesh_remote_cluster_failures[5m])) by (source_cluster, target_cluster) > 0 for: 5m labels: severity: critical annotations: summary: Cilium KVStoreMesh remote cluster failing (instance {{ $labels.instance }}) description: "Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.29. Cilium KVStoreMesh sync errors
Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors. [copy] - alert: CiliumKvstoremeshSyncErrors expr: sum(rate(cilium_kvstoremesh_kvstore_sync_errors_total[5m])) by (source_cluster) > 0 for: 5m labels: severity: critical annotations: summary: Cilium KVStoreMesh sync errors (instance {{ $labels.instance }}) description: "Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.30. Cilium Hubble lost events
Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete. [copy] - alert: CiliumHubbleLostEvents expr: sum(rate(hubble_lost_events_total[5m])) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium Hubble lost events (instance {{ $labels.instance }}) description: "Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.15.31. Cilium Hubble high DNS error rate
Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses. [copy] # Threshold of 10% is a rough default. Some DNS errors may be normal depending on your workload. - alert: CiliumHubbleHighDnsErrorRate expr: sum(rate(hubble_dns_responses_total{rcode!="No Error"}[5m])) by (pod) / sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0.1 and sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0 for: 5m labels: severity: warning annotations: summary: Cilium Hubble high DNS error rate (instance {{ $labels.instance }}) description: "Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 6.16. WireGuard : MindFlavor/prometheus_wireguard_exporter (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/wireguard/mindflavor-prometheus-wireguard-exporter.yml-
# 6.16.1. WireGuard peer handshake too old
WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has not had a handshake for over 5 minutes. The tunnel may be down. [copy] # The threshold of 300 seconds (5 minutes) is a rough default. WireGuard peers that are idle but reachable # typically re-handshake every 2 minutes. Adjust based on your keepalive interval. # The `> 0` guard excludes peers that have never completed a handshake (covered by a separate rule). - alert: WireguardPeerHandshakeTooOld expr: time() - wireguard_latest_handshake_seconds > 300 and wireguard_latest_handshake_seconds > 0 for: 2m labels: severity: warning annotations: summary: WireGuard peer handshake too old (instance {{ $labels.instance }}) description: "WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has not had a handshake for over 5 minutes. The tunnel may be down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.16.2. WireGuard peer handshake never established
WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has never completed a handshake. Check peer configuration and network connectivity. [copy] - alert: WireguardPeerHandshakeNeverEstablished expr: wireguard_latest_handshake_seconds == 0 for: 5m labels: severity: critical annotations: summary: WireGuard peer handshake never established (instance {{ $labels.instance }}) description: "WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has never completed a handshake. Check peer configuration and network connectivity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 6.16.3. WireGuard no traffic on peer
WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has had no traffic for 15 minutes despite an active handshake. [copy] # This alert fires when a peer has a recent handshake but zero traffic flow. # May indicate routing issues or a misconfigured allowed-ips. # Only useful if you expect continuous traffic on all peers. - alert: WireguardNoTrafficOnPeer expr: (rate(wireguard_sent_bytes_total[15m]) + rate(wireguard_received_bytes_total[15m])) == 0 and wireguard_latest_handshake_seconds > 0 and (time() - wireguard_latest_handshake_seconds) < 300 for: 15m labels: severity: warning annotations: summary: WireGuard no traffic on peer (instance {{ $labels.instance }}) description: "WireGuard peer {{ $labels.public_key }} on interface {{ $labels.interface }} has had no traffic for 15 minutes despite an active handshake.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 7.1. AWS CloudWatch : prometheus/cloudwatch_exporter (13 rules) [copy section]
CloudWatch metrics are exported as aws_{namespace}_{metric_name}_{statistic} gauges.
The rules below cover both exporter health and common AWS service alerts.
Adjust thresholds and label filters to match your CloudWatch exporter configuration.$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/aws-cloudwatch/prometheus-cloudwatch-exporter.yml-
# 7.1.1. CloudWatch exporter scrape error
CloudWatch exporter on {{ $labels.instance }} failed to scrape metrics from AWS CloudWatch API. [copy] - alert: CloudwatchExporterScrapeError expr: cloudwatch_exporter_scrape_error > 0 for: 5m labels: severity: warning annotations: summary: CloudWatch exporter scrape error (instance {{ $labels.instance }}) description: "CloudWatch exporter on {{ $labels.instance }} failed to scrape metrics from AWS CloudWatch API.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.2. CloudWatch exporter slow scrape
CloudWatch exporter on {{ $labels.instance }} scrape is taking more than 5 minutes ({{ $value }}s). Consider reducing the number of metrics or splitting across multiple exporters. [copy] - alert: CloudwatchExporterSlowScrape expr: cloudwatch_exporter_scrape_duration_seconds > 300 for: 5m labels: severity: warning annotations: summary: CloudWatch exporter slow scrape (instance {{ $labels.instance }}) description: "CloudWatch exporter on {{ $labels.instance }} scrape is taking more than 5 minutes ({{ $value }}s). Consider reducing the number of metrics or splitting across multiple exporters.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.3. CloudWatch API high request rate
CloudWatch exporter on {{ $labels.instance }} is making {{ $value }} API calls per minute to namespace {{ $labels.namespace }}. This can lead to high AWS costs. [copy] # CloudWatch API calls cost money (~$0.01 per 1000 GetMetricData requests). # 100 requests/minute ≈ $45/month. Adjust the threshold based on your budget. - alert: CloudwatchApiHighRequestRate expr: sum by (instance, namespace) (rate(cloudwatch_requests_total[5m])) * 60 > 100 for: 0m labels: severity: warning annotations: summary: CloudWatch API high request rate (instance {{ $labels.instance }}) description: "CloudWatch exporter on {{ $labels.instance }} is making {{ $value }} API calls per minute to namespace {{ $labels.namespace }}. This can lead to high AWS costs.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.4. AWS EC2 high CPU utilization
EC2 instance {{ $labels.instance_id }} CPU utilization is above 90% ({{ $value }}%). [copy] # Requires EC2 CPUUtilization metric configured in the CloudWatch exporter. - alert: AwsEc2HighCpuUtilization expr: aws_ec2_cpuutilization_average > 90 for: 15m labels: severity: warning annotations: summary: AWS EC2 high CPU utilization (instance {{ $labels.instance }}) description: "EC2 instance {{ $labels.instance_id }} CPU utilization is above 90% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.5. AWS RDS low free storage space
RDS instance {{ $labels.dbinstance_identifier }} has less than 2GB free storage ({{ $value }} bytes remaining). [copy] # Requires RDS FreeStorageSpace metric. The threshold of 2GB is a rough default. # Adjust based on your database size. - alert: AwsRdsLowFreeStorageSpace expr: aws_rds_free_storage_space_average < 2000000000 for: 5m labels: severity: warning annotations: summary: AWS RDS low free storage space (instance {{ $labels.instance }}) description: "RDS instance {{ $labels.dbinstance_identifier }} has less than 2GB free storage ({{ $value }} bytes remaining).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.6. AWS RDS high CPU utilization
RDS instance {{ $labels.dbinstance_identifier }} CPU utilization is above 90% ({{ $value }}%). [copy] # Requires RDS CPUUtilization metric configured in the CloudWatch exporter. - alert: AwsRdsHighCpuUtilization expr: aws_rds_cpuutilization_average > 90 for: 15m labels: severity: warning annotations: summary: AWS RDS high CPU utilization (instance {{ $labels.instance }}) description: "RDS instance {{ $labels.dbinstance_identifier }} CPU utilization is above 90% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.7. AWS RDS high database connections
RDS instance {{ $labels.dbinstance_identifier }} has {{ $value }} active connections. [copy] # The threshold depends on the RDS instance class. Adjust based on your # instance type's max_connections parameter. - alert: AwsRdsHighDatabaseConnections expr: aws_rds_database_connections_average > 100 for: 5m labels: severity: warning annotations: summary: AWS RDS high database connections (instance {{ $labels.instance }}) description: "RDS instance {{ $labels.dbinstance_identifier }} has {{ $value }} active connections.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.8. AWS SQS queue messages visible
SQS queue {{ $labels.queue_name }} has {{ $value }} messages waiting to be processed. [copy] # Requires SQS ApproximateNumberOfMessagesVisible metric. The threshold of 1000 # is a rough default. Adjust based on your expected queue depth. - alert: AwsSqsQueueMessagesVisible expr: aws_sqs_approximate_number_of_messages_visible_average > 1000 for: 10m labels: severity: warning annotations: summary: AWS SQS queue messages visible (instance {{ $labels.instance }}) description: "SQS queue {{ $labels.queue_name }} has {{ $value }} messages waiting to be processed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.9. AWS SQS message age too old
SQS queue {{ $labels.queue_name }} has messages older than 1 hour ({{ $value }}s). [copy] # Requires SQS ApproximateAgeOfOldestMessage metric. - alert: AwsSqsMessageAgeTooOld expr: aws_sqs_approximate_age_of_oldest_message_maximum > 3600 for: 0m labels: severity: warning annotations: summary: AWS SQS message age too old (instance {{ $labels.instance }}) description: "SQS queue {{ $labels.queue_name }} has messages older than 1 hour ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.10. AWS ALB unhealthy targets
ALB {{ $labels.load_balancer }} has {{ $value }} unhealthy target(s) in target group {{ $labels.target_group }}. [copy] # Requires ApplicationELB UnHealthyHostCount metric. - alert: AwsAlbUnhealthyTargets expr: aws_applicationelb_unhealthy_host_count_average > 0 for: 5m labels: severity: critical annotations: summary: AWS ALB unhealthy targets (instance {{ $labels.instance }}) description: "ALB {{ $labels.load_balancer }} has {{ $value }} unhealthy target(s) in target group {{ $labels.target_group }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.11. AWS ALB high 5xx error rate
ALB {{ $labels.load_balancer }} 5xx error rate is above 5% ({{ $value }}%). [copy] # Requires ApplicationELB HTTPCode_ELB_5XX_Count and RequestCount metrics. - alert: AwsAlbHigh5xxErrorRate expr: (aws_applicationelb_httpcode_elb_5_xx_count_sum / aws_applicationelb_request_count_sum) * 100 > 5 for: 5m labels: severity: critical annotations: summary: AWS ALB high 5xx error rate (instance {{ $labels.instance }}) description: "ALB {{ $labels.load_balancer }} 5xx error rate is above 5% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.12. AWS ALB high target response time
ALB {{ $labels.load_balancer }} average target response time is above 2 seconds ({{ $value }}s). [copy] # Requires ApplicationELB TargetResponseTime metric. - alert: AwsAlbHighTargetResponseTime expr: aws_applicationelb_target_response_time_average > 2 for: 5m labels: severity: warning annotations: summary: AWS ALB high target response time (instance {{ $labels.instance }}) description: "ALB {{ $labels.load_balancer }} average target response time is above 2 seconds ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.1.13. AWS Lambda high error rate
Lambda function {{ $labels.function_name }} error rate is above 5% ({{ $value }}%). [copy] # Requires Lambda Errors and Invocations metrics. - alert: AwsLambdaHighErrorRate expr: (aws_lambda_errors_sum / aws_lambda_invocations_sum) * 100 > 5 for: 5m labels: severity: warning annotations: summary: AWS Lambda high error rate (instance {{ $labels.instance }}) description: "Lambda function {{ $labels.function_name }} error rate is above 5% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 7.2. Google Cloud Stackdriver : prometheus-community/stackdriver_exporter (5 rules) [copy section]
Self-monitoring metrics use the stackdriver_monitoring_* prefix.
All self-monitoring metrics include a project_id label.$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/google-cloud-stackdriver/stackdriver-exporter.yml-
# 7.2.1. Stackdriver exporter scrape error
Stackdriver exporter failed to scrape metrics from Google Cloud Monitoring API for project {{ $labels.project_id }}. [copy] - alert: StackdriverExporterScrapeError expr: stackdriver_monitoring_last_scrape_error > 0 for: 5m labels: severity: warning annotations: summary: Stackdriver exporter scrape error (instance {{ $labels.instance }}) description: "Stackdriver exporter failed to scrape metrics from Google Cloud Monitoring API for project {{ $labels.project_id }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.2.2. Stackdriver exporter slow scrape
Stackdriver exporter scrape for project {{ $labels.project_id }} is taking more than 5 minutes ({{ $value }}s). [copy] - alert: StackdriverExporterSlowScrape expr: stackdriver_monitoring_last_scrape_duration_seconds > 300 for: 5m labels: severity: warning annotations: summary: Stackdriver exporter slow scrape (instance {{ $labels.instance }}) description: "Stackdriver exporter scrape for project {{ $labels.project_id }} is taking more than 5 minutes ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.2.3. Stackdriver exporter scrape errors increasing
Stackdriver exporter has had {{ $value }} scrape errors in the last 15 minutes for project {{ $labels.project_id }}. [copy] - alert: StackdriverExporterScrapeErrorsIncreasing expr: increase(stackdriver_monitoring_scrape_errors_total[15m]) > 5 for: 0m labels: severity: warning annotations: summary: Stackdriver exporter scrape errors increasing (instance {{ $labels.instance }}) description: "Stackdriver exporter has had {{ $value }} scrape errors in the last 15 minutes for project {{ $labels.project_id }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.2.4. Stackdriver exporter high API calls
Stackdriver exporter is making {{ $value }} API calls per minute for project {{ $labels.project_id }}. This may hit Google Cloud Monitoring API quotas. [copy] - alert: StackdriverExporterHighApiCalls expr: rate(stackdriver_monitoring_api_calls_total[5m]) * 60 > 100 for: 0m labels: severity: warning annotations: summary: Stackdriver exporter high API calls (instance {{ $labels.instance }}) description: "Stackdriver exporter is making {{ $value }} API calls per minute for project {{ $labels.project_id }}. This may hit Google Cloud Monitoring API quotas.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.2.5. Stackdriver exporter scrape stale
Stackdriver exporter has not successfully scraped metrics for project {{ $labels.project_id }} in the last 10 minutes. [copy] - alert: StackdriverExporterScrapeStale expr: time() - stackdriver_monitoring_last_scrape_timestamp > 600 for: 0m labels: severity: warning annotations: summary: Stackdriver exporter scrape stale (instance {{ $labels.instance }}) description: "Stackdriver exporter has not successfully scraped metrics for project {{ $labels.project_id }} in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 7.3. DigitalOcean : metalmatze/digitalocean_exporter (10 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/digitalocean/digitalocean-exporter.yml-
# 7.3.1. DigitalOcean droplet down
DigitalOcean droplet {{ $labels.name }} ({{ $labels.id }}) in {{ $labels.region }} is not running. [copy] - alert: DigitaloceanDropletDown expr: digitalocean_droplet_up == 0 for: 5m labels: severity: critical annotations: summary: DigitalOcean droplet down (instance {{ $labels.instance }}) description: "DigitalOcean droplet {{ $labels.name }} ({{ $labels.id }}) in {{ $labels.region }} is not running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.2. DigitalOcean account not active
DigitalOcean account is not active. It may be suspended or locked. [copy] - alert: DigitaloceanAccountNotActive expr: digitalocean_account_active != 1 for: 0m labels: severity: critical annotations: summary: DigitalOcean account not active (instance {{ $labels.instance }}) description: "DigitalOcean account is not active. It may be suspended or locked.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.3. DigitalOcean database down
DigitalOcean managed database {{ $labels.name }} ({{ $labels.engine }}) in {{ $labels.region }} is offline. [copy] - alert: DigitaloceanDatabaseDown expr: digitalocean_database_status == 0 for: 2m labels: severity: critical annotations: summary: DigitalOcean database down (instance {{ $labels.instance }}) description: "DigitalOcean managed database {{ $labels.name }} ({{ $labels.engine }}) in {{ $labels.region }} is offline.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.4. DigitalOcean Kubernetes cluster down
DigitalOcean Kubernetes cluster {{ $labels.name }} ({{ $labels.version }}) in {{ $labels.region }} is not running. [copy] - alert: DigitaloceanKubernetesClusterDown expr: digitalocean_kubernetes_cluster_up == 0 for: 5m labels: severity: critical annotations: summary: DigitalOcean Kubernetes cluster down (instance {{ $labels.instance }}) description: "DigitalOcean Kubernetes cluster {{ $labels.name }} ({{ $labels.version }}) in {{ $labels.region }} is not running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.5. DigitalOcean load balancer down
DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) is not active. [copy] - alert: DigitaloceanLoadBalancerDown expr: digitalocean_loadbalancer_status == 0 for: 2m labels: severity: critical annotations: summary: DigitalOcean load balancer down (instance {{ $labels.instance }}) description: "DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) is not active.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.6. DigitalOcean load balancer no backends
DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) has no droplets attached. [copy] - alert: DigitaloceanLoadBalancerNoBackends expr: digitalocean_loadbalancer_droplets == 0 for: 0m labels: severity: warning annotations: summary: DigitalOcean load balancer no backends (instance {{ $labels.instance }}) description: "DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) has no droplets attached.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.7. DigitalOcean floating IP not assigned
DigitalOcean floating IP {{ $labels.ipv4 }} in {{ $labels.region }} is not assigned to any droplet. [copy] - alert: DigitaloceanFloatingIpNotAssigned expr: digitalocean_floating_ipv4_active == 0 for: 0m labels: severity: warning annotations: summary: DigitalOcean floating IP not assigned (instance {{ $labels.instance }}) description: "DigitalOcean floating IP {{ $labels.ipv4 }} in {{ $labels.region }} is not assigned to any droplet.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.8. DigitalOcean active incidents
DigitalOcean platform has {{ $value }} active incident(s). [copy] - alert: DigitaloceanActiveIncidents expr: digitalocean_incidents_total > 0 for: 0m labels: severity: warning annotations: summary: DigitalOcean active incidents (instance {{ $labels.instance }}) description: "DigitalOcean platform has {{ $value }} active incident(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.9. DigitalOcean exporter collection errors
DigitalOcean exporter {{ $labels.collector }} collector has {{ $value }} errors. [copy] - alert: DigitaloceanExporterCollectionErrors expr: increase(digitalocean_errors_total[5m]) > 0 for: 0m labels: severity: warning annotations: summary: DigitalOcean exporter collection errors (instance {{ $labels.instance }}) description: "DigitalOcean exporter {{ $labels.collector }} collector has {{ $value }} errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.3.10. DigitalOcean droplet limit approaching
DigitalOcean account is using {{ $value }}% of its droplet quota. [copy] # Fires when more than 80% of the account's droplet limit is in use. - alert: DigitaloceanDropletLimitApproaching expr: (count(digitalocean_droplet_up) / digitalocean_account_droplet_limit) * 100 > 80 for: 0m labels: severity: warning annotations: summary: DigitalOcean droplet limit approaching (instance {{ $labels.instance }}) description: "DigitalOcean account is using {{ $value }}% of its droplet quota.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 7.4. Azure : webdevops/azure-metrics-exporter (5 rules) [copy section]
The exporter uses azurerm_resource_metric as the default metric name for forwarded Azure Monitor metrics.
The metric name can be customized via the name parameter in probe configuration.
Self-monitoring metrics use the azurerm_stats_* and azurerm_api_* prefixes.$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/azure/azure-metrics-exporter.yml-
# 7.4.1. Azure exporter request errors
Azure metrics exporter on {{ $labels.instance }} has {{ $value }} API request errors in the last 15 minutes. [copy] - alert: AzureExporterRequestErrors expr: increase(azurerm_stats_metric_requests{result="error"}[15m]) > 5 for: 0m labels: severity: warning annotations: summary: Azure exporter request errors (instance {{ $labels.instance }}) description: "Azure metrics exporter on {{ $labels.instance }} has {{ $value }} API request errors in the last 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.4.2. Azure exporter high error rate
Azure metrics exporter on {{ $labels.instance }} has an error rate above 10% ({{ $value }}%). [copy] - alert: AzureExporterHighErrorRate expr: sum by (instance) (rate(azurerm_stats_metric_requests{result="error"}[5m])) / sum by (instance) (rate(azurerm_stats_metric_requests[5m])) * 100 > 10 for: 5m labels: severity: warning annotations: summary: Azure exporter high error rate (instance {{ $labels.instance }}) description: "Azure metrics exporter on {{ $labels.instance }} has an error rate above 10% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.4.3. Azure API read rate limit approaching
Azure API read rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining). [copy] # Azure Resource Manager enforces rate limits per subscription. # The threshold of 100 remaining calls is a rough default. Adjust based on your # scrape interval and number of monitored resources. - alert: AzureApiReadRateLimitApproaching expr: azurerm_api_ratelimit{type="read"} < 100 for: 0m labels: severity: warning annotations: summary: Azure API read rate limit approaching (instance {{ $labels.instance }}) description: "Azure API read rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.4.4. Azure API write rate limit approaching
Azure API write rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining). [copy] - alert: AzureApiWriteRateLimitApproaching expr: azurerm_api_ratelimit{type="write"} < 50 for: 0m labels: severity: warning annotations: summary: Azure API write rate limit approaching (instance {{ $labels.instance }}) description: "Azure API write rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 7.4.5. Azure exporter slow collection
Azure metrics exporter on {{ $labels.instance }} metric collection is taking more than 5 minutes ({{ $value }}s). [copy] - alert: AzureExporterSlowCollection expr: azurerm_stats_metric_collecttime > 300 for: 5m labels: severity: warning annotations: summary: Azure exporter slow collection (instance {{ $labels.instance }}) description: "Azure metrics exporter on {{ $labels.instance }} metric collection is taking more than 5 minutes ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.1. Thanos : Thanos Compactor (5 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-compactor.yml-
# 8.1.1.1. Thanos Compactor Multiple Running
No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running. [copy] - alert: ThanosCompactorMultipleRunning expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1 for: 5m labels: severity: warning annotations: summary: Thanos Compactor Multiple Running (instance {{ $labels.instance }}) description: "No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.1.2. Thanos Compactor Halted
Thanos Compact {{$labels.job}} has failed to run and now is halted. [copy] - alert: ThanosCompactorHalted expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1 for: 5m labels: severity: warning annotations: summary: Thanos Compactor Halted (instance {{ $labels.instance }}) description: "Thanos Compact {{$labels.job}} has failed to run and now is halted.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.1.3. Thanos Compactor High Compaction Failures
Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions. [copy] - alert: ThanosCompactorHighCompactionFailures expr: (sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5) for: 15m labels: severity: warning annotations: summary: Thanos Compactor High Compaction Failures (instance {{ $labels.instance }}) description: "Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.1.4. Thanos Compact Bucket High Operation Failures
Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. [copy] - alert: ThanosCompactBucketHighOperationFailures expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5) for: 15m labels: severity: warning annotations: summary: Thanos Compact Bucket High Operation Failures (instance {{ $labels.instance }}) description: "Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.1.5. Thanos Compact Has Not Run
Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours. [copy] - alert: ThanosCompactHasNotRun expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24 for: 0m labels: severity: warning annotations: summary: Thanos Compact Has Not Run (instance {{ $labels.instance }}) description: "Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.2. Thanos : Thanos Query (8 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-query.yml-
# 8.1.2.1. Thanos Query Http Request Query Error Rate High
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests. [copy] - alert: ThanosQueryHttpRequestQueryErrorRateHigh expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query\" requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.2. Thanos Query Http Request Query Range Error Rate High
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests. [copy] - alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query_range\" requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.3. Thanos Query Grpc Server Error Rate
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy] - alert: ThanosQueryGrpcServerErrorRate expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))/ sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) * 100 > 5) for: 5m labels: severity: warning annotations: summary: Thanos Query Grpc Server Error Rate (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.4. Thanos Query Grpc Client Error Rate
Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests. [copy] - alert: ThanosQueryGrpcClientErrorRate expr: (sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))) * 100 > 5 for: 5m labels: severity: warning annotations: summary: Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.5. Thanos Query High D N S Failures
Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints. [copy] - alert: ThanosQueryHighDNSFailures expr: (sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))) * 100 > 1 for: 15m labels: severity: warning annotations: summary: Thanos Query High D N S Failures (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.6. Thanos Query Instant Latency High
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries. [copy] - alert: ThanosQueryInstantLatencyHigh expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0) for: 10m labels: severity: critical annotations: summary: Thanos Query Instant Latency High (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.7. Thanos Query Range Latency High
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries. [copy] - alert: ThanosQueryRangeLatencyHigh expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0) for: 10m labels: severity: critical annotations: summary: Thanos Query Range Latency High (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.2.8. Thanos Query Overload
Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support. [copy] - alert: ThanosQueryOverload expr: (max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1) for: 15m labels: severity: warning annotations: summary: Thanos Query Overload (instance {{ $labels.instance }}) description: "Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.3. Thanos : Thanos Receiver (7 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-receiver.yml-
# 8.1.3.1. Thanos Receive Http Request Error Rate High
Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy] - alert: ThanosReceiveHttpRequestErrorRateHigh expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Thanos Receive Http Request Error Rate High (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.3.2. Thanos Receive Http Request Latency High
Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests. [copy] - alert: ThanosReceiveHttpRequestLatencyHigh expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0) for: 10m labels: severity: critical annotations: summary: Thanos Receive Http Request Latency High (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.3.3. Thanos Receive High Replication Failures
Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests. [copy] - alert: ThanosReceiveHighReplicationFailures expr: thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}))) * 100 for: 5m labels: severity: warning annotations: summary: Thanos Receive High Replication Failures (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.3.4. Thanos Receive High Forward Request Failures
Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests. [copy] - alert: ThanosReceiveHighForwardRequestFailures expr: (sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))/ sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))) * 100 > 20 for: 5m labels: severity: info annotations: summary: Thanos Receive High Forward Request Failures (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.3.5. Thanos Receive High Hashring File Refresh Failures
Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed. [copy] - alert: ThanosReceiveHighHashringFileRefreshFailures expr: (sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) > 0) for: 15m labels: severity: warning annotations: summary: Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.3.6. Thanos Receive Config Reload Failure
Thanos Receive {{$labels.job}} has not been able to reload hashring configurations. [copy] - alert: ThanosReceiveConfigReloadFailure expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) != 1 for: 5m labels: severity: warning annotations: summary: Thanos Receive Config Reload Failure (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.3.7. Thanos Receive No Upload
Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage. [copy] - alert: ThanosReceiveNoUpload expr: (up{job=~".*thanos-receive.*"} - 1) + on (job, instance) (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0) for: 3h labels: severity: critical annotations: summary: Thanos Receive No Upload (instance {{ $labels.instance }}) description: "Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.4. Thanos : Thanos Sidecar (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-sidecar.yml-
# 8.1.4.1. Thanos Sidecar Bucket Operations Failed
Thanos Sidecar {{$labels.instance}} bucket operations are failing [copy] - alert: ThanosSidecarBucketOperationsFailed expr: sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0 for: 5m labels: severity: critical annotations: summary: Thanos Sidecar Bucket Operations Failed (instance {{ $labels.instance }}) description: "Thanos Sidecar {{$labels.instance}} bucket operations are failing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.4.2. Thanos Sidecar No Connection To Started Prometheus
Thanos Sidecar {{$labels.instance}} is unhealthy. [copy] - alert: ThanosSidecarNoConnectionToStartedPrometheus expr: thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 and on (namespace, pod)prometheus_tsdb_data_replay_duration_seconds != 0 for: 5m labels: severity: critical annotations: summary: Thanos Sidecar No Connection To Started Prometheus (instance {{ $labels.instance }}) description: "Thanos Sidecar {{$labels.instance}} is unhealthy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.5. Thanos : Thanos Store (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-store.yml-
# 8.1.5.1. Thanos Store Grpc Error Rate
Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy] - alert: ThanosStoreGrpcErrorRate expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))/ sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) * 100 > 5) for: 5m labels: severity: warning annotations: summary: Thanos Store Grpc Error Rate (instance {{ $labels.instance }}) description: "Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.5.2. Thanos Store Series Gate Latency High
Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests. [copy] - alert: ThanosStoreSeriesGateLatencyHigh expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0) for: 10m labels: severity: warning annotations: summary: Thanos Store Series Gate Latency High (instance {{ $labels.instance }}) description: "Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.5.3. Thanos Store Bucket High Operation Failures
Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. [copy] - alert: ThanosStoreBucketHighOperationFailures expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) * 100 > 5) for: 15m labels: severity: warning annotations: summary: Thanos Store Bucket High Operation Failures (instance {{ $labels.instance }}) description: "Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.5.4. Thanos Store Objstore Operation Latency High
Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations. [copy] - alert: ThanosStoreObjstoreOperationLatencyHigh expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0) for: 10m labels: severity: warning annotations: summary: Thanos Store Objstore Operation Latency High (instance {{ $labels.instance }}) description: "Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.6. Thanos : Thanos Ruler (11 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-ruler.yml-
# 8.1.6.1. Thanos Rule Queue Is Dropping Alerts
Thanos Rule {{$labels.instance}} is failing to queue alerts. [copy] - alert: ThanosRuleQueueIsDroppingAlerts expr: sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0 for: 5m labels: severity: critical annotations: summary: Thanos Rule Queue Is Dropping Alerts (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} is failing to queue alerts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.2. Thanos Rule Sender Is Failing Alerts
Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager. [copy] - alert: ThanosRuleSenderIsFailingAlerts expr: sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0 for: 5m labels: severity: critical annotations: summary: Thanos Rule Sender Is Failing Alerts (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.3. Thanos Rule High Rule Evaluation Failures
Thanos Rule {{$labels.instance}} is failing to evaluate rules. [copy] - alert: ThanosRuleHighRuleEvaluationFailures expr: (sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5) for: 5m labels: severity: critical annotations: summary: Thanos Rule High Rule Evaluation Failures (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} is failing to evaluate rules.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.4. Thanos Rule High Rule Evaluation Warnings
Thanos Rule {{$labels.instance}} has high number of evaluation warnings. [copy] - alert: ThanosRuleHighRuleEvaluationWarnings expr: sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0 for: 15m labels: severity: info annotations: summary: Thanos Rule High Rule Evaluation Warnings (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} has high number of evaluation warnings.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.5. Thanos Rule Rule Evaluation Latency High
Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}. [copy] - alert: ThanosRuleRuleEvaluationLatencyHigh expr: (sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})) for: 5m labels: severity: warning annotations: summary: Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.6. Thanos Rule Grpc Error Rate
Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy] - alert: ThanosRuleGrpcErrorRate expr: (sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))/ sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5) for: 5m labels: severity: warning annotations: summary: Thanos Rule Grpc Error Rate (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.7. Thanos Rule Config Reload Failure
Thanos Rule {{$labels.job}} has not been able to reload its configuration. [copy] - alert: ThanosRuleConfigReloadFailure expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) != 1 for: 5m labels: severity: info annotations: summary: Thanos Rule Config Reload Failure (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.job}} has not been able to reload its configuration.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.8. Thanos Rule Query High D N S Failures
Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints. [copy] - alert: ThanosRuleQueryHighDNSFailures expr: (sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1) for: 15m labels: severity: warning annotations: summary: Thanos Rule Query High D N S Failures (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.9. Thanos Rule Alertmanager High D N S Failures
Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints. [copy] - alert: ThanosRuleAlertmanagerHighDNSFailures expr: (sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1) for: 15m labels: severity: warning annotations: summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.10. Thanos Rule No Evaluation For10 Intervals
Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval. [copy] - alert: ThanosRuleNoEvaluationFor10Intervals expr: time() - max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}) for: 5m labels: severity: info annotations: summary: Thanos Rule No Evaluation For10 Intervals (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.6.11. Thanos No Rule Evaluations
Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes. [copy] - alert: ThanosNoRuleEvaluations expr: sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0 and sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0 for: 5m labels: severity: critical annotations: summary: Thanos No Rule Evaluations (instance {{ $labels.instance }}) description: "Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.7. Thanos : Thanos Bucket Replicate (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-bucket-replicate.yml-
# 8.1.7.1. Thanos Bucket Replicate Error Rate
Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed. [copy] - alert: ThanosBucketReplicateErrorRate expr: (sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m]))/ on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))) * 100 >= 10 for: 5m labels: severity: critical annotations: summary: Thanos Bucket Replicate Error Rate (instance {{ $labels.instance }}) description: "Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.7.2. Thanos Bucket Replicate Run Latency
Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations. [copy] - alert: ThanosBucketReplicateRunLatency expr: (histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 and sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0) for: 5m labels: severity: critical annotations: summary: Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }}) description: "Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.1.8. Thanos : Thanos Component Absent (6 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-component-absent.yml-
# 8.1.8.1. Thanos Compact Is Down
ThanosCompact has disappeared. Prometheus target for the component cannot be discovered. [copy] - alert: ThanosCompactIsDown expr: absent(up{job=~".*thanos-compact.*"} == 1) for: 5m labels: severity: critical annotations: summary: Thanos Compact Is Down (instance {{ $labels.instance }}) description: "ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.8.2. Thanos Query Is Down
ThanosQuery has disappeared. Prometheus target for the component cannot be discovered. [copy] - alert: ThanosQueryIsDown expr: absent(up{job=~".*thanos-query.*"} == 1) for: 5m labels: severity: critical annotations: summary: Thanos Query Is Down (instance {{ $labels.instance }}) description: "ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.8.3. Thanos Receive Is Down
ThanosReceive has disappeared. Prometheus target for the component cannot be discovered. [copy] - alert: ThanosReceiveIsDown expr: absent(up{job=~".*thanos-receive.*"} == 1) for: 5m labels: severity: critical annotations: summary: Thanos Receive Is Down (instance {{ $labels.instance }}) description: "ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.8.4. Thanos Rule Is Down
ThanosRule has disappeared. Prometheus target for the component cannot be discovered. [copy] - alert: ThanosRuleIsDown expr: absent(up{job=~".*thanos-rule.*"} == 1) for: 5m labels: severity: critical annotations: summary: Thanos Rule Is Down (instance {{ $labels.instance }}) description: "ThanosRule has disappeared. Prometheus target for the component cannot be discovered.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.8.5. Thanos Sidecar Is Down
ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered. [copy] - alert: ThanosSidecarIsDown expr: absent(up{job=~".*thanos-sidecar.*"} == 1) for: 5m labels: severity: critical annotations: summary: Thanos Sidecar Is Down (instance {{ $labels.instance }}) description: "ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.1.8.6. Thanos Store Is Down
ThanosStore has disappeared. Prometheus target for the component cannot be discovered. [copy] - alert: ThanosStoreIsDown expr: absent(up{job=~".*thanos-store.*"} == 1) for: 5m labels: severity: critical annotations: summary: Thanos Store Is Down (instance {{ $labels.instance }}) description: "ThanosStore has disappeared. Prometheus target for the component cannot be discovered.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.2. Loki : Embedded exporter (4 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/loki/embedded-exporter.yml-
# 8.2.1. Loki process too many restarts
A loki process had too many restarts (target {{ $labels.instance }}) [copy] - alert: LokiProcessTooManyRestarts expr: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2 for: 0m labels: severity: warning annotations: summary: Loki process too many restarts (instance {{ $labels.instance }}) description: "A loki process had too many restarts (target {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.2.2. Loki request errors
The {{ $labels.job }} and {{ $labels.route }} are experiencing errors [copy] - alert: LokiRequestErrors expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10 for: 15m labels: severity: critical annotations: summary: Loki request errors (instance {{ $labels.instance }}) description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.2.3. Loki request panic
The {{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panics [copy] - alert: LokiRequestPanic expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0 for: 5m labels: severity: critical annotations: summary: Loki request panic (instance {{ $labels.instance }}) description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.2.4. Loki request latency
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency [copy] - alert: LokiRequestLatency expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le))) > 1 for: 5m labels: severity: critical annotations: summary: Loki request latency (instance {{ $labels.instance }}) description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.3. Promtail : Embedded exporter (2 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/promtail/embedded-exporter.yml-
# 8.3.1. Promtail request errors
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors. [copy] - alert: PromtailRequestErrors expr: 100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 10 for: 5m labels: severity: critical annotations: summary: Promtail request errors (instance {{ $labels.instance }}) description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}% errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.3.2. Promtail request latency
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency. [copy] - alert: PromtailRequestLatency expr: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1 for: 5m labels: severity: critical annotations: summary: Promtail request latency (instance {{ $labels.instance }}) description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.4. Cortex : Embedded exporter (6 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/cortex/embedded-exporter.yml-
# 8.4.1. Cortex ruler configuration reload failure
Cortex ruler configuration reload failure (instance {{ $labels.instance }}) [copy] - alert: CortexRulerConfigurationReloadFailure expr: cortex_ruler_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Cortex ruler configuration reload failure (instance {{ $labels.instance }}) description: "Cortex ruler configuration reload failure (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.4.2. Cortex not connected to Alertmanager
Cortex not connected to Alertmanager (instance {{ $labels.instance }}) [copy] - alert: CortexNotConnectedToAlertmanager expr: cortex_prometheus_notifications_alertmanagers_discovered < 1 for: 0m labels: severity: critical annotations: summary: Cortex not connected to Alertmanager (instance {{ $labels.instance }}) description: "Cortex not connected to Alertmanager (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.4.3. Cortex notification are being dropped
Cortex notification are being dropped due to errors (instance {{ $labels.instance }}) [copy] - alert: CortexNotificationAreBeingDropped expr: rate(cortex_prometheus_notifications_dropped_total[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Cortex notification are being dropped (instance {{ $labels.instance }}) description: "Cortex notification are being dropped due to errors (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.4.4. Cortex notification error
Cortex is failing when sending alert notifications (instance {{ $labels.instance }}) [copy] - alert: CortexNotificationError expr: rate(cortex_prometheus_notifications_errors_total[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Cortex notification error (instance {{ $labels.instance }}) description: "Cortex is failing when sending alert notifications (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.4.5. Cortex ingester unhealthy
Cortex has an unhealthy ingester [copy] - alert: CortexIngesterUnhealthy expr: cortex_ring_members{state="Unhealthy", name="ingester"} > 0 for: 0m labels: severity: critical annotations: summary: Cortex ingester unhealthy (instance {{ $labels.instance }}) description: "Cortex has an unhealthy ingester\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.4.6. Cortex frontend queries stuck
There are queued up queries in query-frontend. [copy] - alert: CortexFrontendQueriesStuck expr: sum by (job) (cortex_query_frontend_queue_length) > 0 for: 5m labels: severity: critical annotations: summary: Cortex frontend queries stuck (instance {{ $labels.instance }}) description: "There are queued up queries in query-frontend.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.5. Grafana Tempo : Embedded exporter (18 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/grafana-tempo/embedded-exporter.yml-
# 8.5.1. Tempo distributor unhealthy
Tempo has {{ $value }} unhealthy distributor(s). [copy] - alert: TempoDistributorUnhealthy expr: max by (job) (tempo_ring_members{state="Unhealthy", name="distributor"}) > 0 for: 15m labels: severity: warning annotations: summary: Tempo distributor unhealthy (instance {{ $labels.instance }}) description: "Tempo has {{ $value }} unhealthy distributor(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.2. Tempo live store unhealthy
Tempo has {{ $value }} unhealthy live store(s). [copy] - alert: TempoLiveStoreUnhealthy expr: max by (job) (tempo_ring_members{state="Unhealthy", name="live-store"}) > 0 for: 15m labels: severity: critical annotations: summary: Tempo live store unhealthy (instance {{ $labels.instance }}) description: "Tempo has {{ $value }} unhealthy live store(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.3. Tempo metrics generator unhealthy
Tempo has {{ $value }} unhealthy metrics generator(s). [copy] - alert: TempoMetricsGeneratorUnhealthy expr: max by (job) (tempo_ring_members{state="Unhealthy", name="metrics-generator"}) > 0 for: 15m labels: severity: critical annotations: summary: Tempo metrics generator unhealthy (instance {{ $labels.instance }}) description: "Tempo has {{ $value }} unhealthy metrics generator(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.4. Tempo compactions failing
Greater than 2 compactions have failed in the past hour. [copy] # Uses a two-window approach: 1h for historical count and 5m to confirm the issue is ongoing. - alert: TempoCompactionsFailing expr: sum by (job) (increase(tempodb_compaction_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_compaction_errors_total[5m])) > 0 for: 1h labels: severity: critical annotations: summary: Tempo compactions failing (instance {{ $labels.instance }}) description: "Greater than 2 compactions have failed in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.5. Tempo polls failing
Greater than 2 blocklist polls have failed in the past hour. [copy] - alert: TempoPollsFailing expr: sum by (job) (increase(tempodb_blocklist_poll_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_poll_errors_total[5m])) > 0 for: 0m labels: severity: critical annotations: summary: Tempo polls failing (instance {{ $labels.instance }}) description: "Greater than 2 blocklist polls have failed in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.6. Tempo tenant index failures
Greater than 2 tenant index failures in the past hour. [copy] - alert: TempoTenantIndexFailures expr: sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[5m])) > 0 for: 0m labels: severity: critical annotations: summary: Tempo tenant index failures (instance {{ $labels.instance }}) description: "Greater than 2 tenant index failures in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.7. Tempo no tenant index builders
No tenant index builders for tenant {{ $labels.tenant }}. Tenant index will quickly become stale. [copy] - alert: TempoNoTenantIndexBuilders expr: sum by (tenant) (tempodb_blocklist_tenant_index_builder) == 0 and on() max(tempodb_blocklist_length) > 0 for: 5m labels: severity: critical annotations: summary: Tempo no tenant index builders (instance {{ $labels.instance }}) description: "No tenant index builders for tenant {{ $labels.tenant }}. Tenant index will quickly become stale.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.8. Tempo tenant index too old
Tenant index for {{ $labels.tenant }} is {{ $value }}s old. [copy] # Threshold of 600s (10 minutes). Adjust based on your tenant index build interval. - alert: TempoTenantIndexTooOld expr: max by (tenant) (tempodb_blocklist_tenant_index_age_seconds) > 600 for: 5m labels: severity: critical annotations: summary: Tempo tenant index too old (instance {{ $labels.instance }}) description: "Tenant index for {{ $labels.tenant }} is {{ $value }}s old.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.9. Tempo block list rising quickly
Tempo blocklist length is up {{ printf "%.0f" $value }}% over the last 7 days. Consider scaling compactors. [copy] # Fires when the blocklist grows more than 40% over 7 days. - alert: TempoBlockListRisingQuickly expr: (avg(tempodb_blocklist_length) / avg(tempodb_blocklist_length offset 7d) - 1) * 100 > 40 for: 15m labels: severity: critical annotations: summary: Tempo block list rising quickly (instance {{ $labels.instance }}) description: "Tempo blocklist length is up {{ printf \"%.0f\" $value }}% over the last 7 days. Consider scaling compactors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.10. Tempo bad overrides
{{ $labels.job }} failed to reload runtime overrides. [copy] - alert: TempoBadOverrides expr: sum by (job) (tempo_runtime_config_last_reload_successful == 0) > 0 for: 15m labels: severity: critical annotations: summary: Tempo bad overrides (instance {{ $labels.instance }}) description: "{{ $labels.job }} failed to reload runtime overrides.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.11. Tempo user configurable overrides reload failing
Greater than 5 user-configurable overrides reloads have failed in the past hour. [copy] - alert: TempoUserConfigurableOverridesReloadFailing expr: sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[1h])) > 5 and sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[5m])) > 0 for: 0m labels: severity: critical annotations: summary: Tempo user configurable overrides reload failing (instance {{ $labels.instance }}) description: "Greater than 5 user-configurable overrides reloads have failed in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.12. Tempo compaction too many outstanding blocks warning
There are too many outstanding compaction blocks for {{ $labels.instance }}. Consider increasing compactor resources. [copy] # Threshold of 100 blocks per compactor instance. Adjust based on your environment. - alert: TempoCompactionTooManyOutstandingBlocksWarning expr: sum by (instance) (tempodb_compaction_outstanding_blocks) > 100 for: 6h labels: severity: warning annotations: summary: Tempo compaction too many outstanding blocks warning (instance {{ $labels.instance }}) description: "There are too many outstanding compaction blocks for {{ $labels.instance }}. Consider increasing compactor resources.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.13. Tempo compaction too many outstanding blocks critical
There are too many outstanding compaction blocks for {{ $labels.instance }}. Increase compactor resources immediately. [copy] - alert: TempoCompactionTooManyOutstandingBlocksCritical expr: sum by (instance) (tempodb_compaction_outstanding_blocks) > 250 for: 24h labels: severity: critical annotations: summary: Tempo compaction too many outstanding blocks critical (instance {{ $labels.instance }}) description: "There are too many outstanding compaction blocks for {{ $labels.instance }}. Increase compactor resources immediately.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.14. Tempo distributor usage tracker errors
Tempo distributor usage tracker errors for {{ $labels.job }} (reason {{ $labels.reason }}). [copy] - alert: TempoDistributorUsageTrackerErrors expr: sum by (job, reason) (rate(tempo_distributor_usage_tracker_errors_total[5m])) > 0 for: 30m labels: severity: critical annotations: summary: Tempo distributor usage tracker errors (instance {{ $labels.instance }}) description: "Tempo distributor usage tracker errors for {{ $labels.job }} (reason {{ $labels.reason }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.15. Tempo metrics generator processor updates failing
Tempo metrics generator processor updates are failing for {{ $labels.job }}. [copy] - alert: TempoMetricsGeneratorProcessorUpdatesFailing expr: sum by (job) (increase(tempo_metrics_generator_active_processors_update_failed_total[5m])) > 0 for: 15m labels: severity: critical annotations: summary: Tempo metrics generator processor updates failing (instance {{ $labels.instance }}) description: "Tempo metrics generator processor updates are failing for {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.16. Tempo metrics generator service graphs dropping spans
Tempo metrics generator is dropping {{ printf "%.2f" $value }}% of spans in service graphs for {{ $labels.job }}. [copy] - alert: TempoMetricsGeneratorServiceGraphsDroppingSpans expr: 100 * sum by (job) (rate(tempo_metrics_generator_processor_service_graphs_dropped_spans[5m])) / sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0.5 for: 15m labels: severity: warning annotations: summary: Tempo metrics generator service graphs dropping spans (instance {{ $labels.instance }}) description: "Tempo metrics generator is dropping {{ printf \"%.2f\" $value }}% of spans in service graphs for {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.17. Tempo metrics generator collections failing
Tempo metrics generator collections are failing for {{ $labels.job }}. [copy] - alert: TempoMetricsGeneratorCollectionsFailing expr: sum by (job) (increase(tempo_metrics_generator_registry_collections_failed_total[5m])) > 2 for: 5m labels: severity: critical annotations: summary: Tempo metrics generator collections failing (instance {{ $labels.instance }}) description: "Tempo metrics generator collections are failing for {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.5.18. Tempo memcached errors elevated
Tempo memcached error rate is {{ printf "%.2f" $value }}% for {{ $labels.name }} in {{ $labels.job }}. [copy] # Fires when the memcached error rate exceeds 20%. Only relevant if Tempo is configured with memcached caching. - alert: TempoMemcachedErrorsElevated expr: 100 * sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count{status_code="500"}[5m])) / sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 20 for: 10m labels: severity: warning annotations: summary: Tempo memcached errors elevated (instance {{ $labels.instance }}) description: "Tempo memcached error rate is {{ printf \"%.2f\" $value }}% for {{ $labels.name }} in {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.6. Grafana Mimir : Embedded exporter (49 rules) [copy section]
Mimir uses the `cortex_` metric prefix for backward compatibility with Cortex. This is intentional and expected.$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/grafana-mimir/embedded-exporter.yml-
# 8.6.1. Mimir ingester unhealthy
Mimir has {{ $value }} unhealthy ingester(s) in the ring. [copy] - alert: MimirIngesterUnhealthy expr: min by (job) (cortex_ring_members{state="Unhealthy", name="ingester"}) > 0 for: 15m labels: severity: critical annotations: summary: Mimir ingester unhealthy (instance {{ $labels.instance }}) description: "Mimir has {{ $value }} unhealthy ingester(s) in the ring.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.2. Mimir request errors
Mimir {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors. [copy] - alert: MimirRequestErrors expr: 100 * sum by (job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..", route!~"ready|debug_pprof"}[5m])) / sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[5m])) > 1 for: 15m labels: severity: critical annotations: summary: Mimir request errors (instance {{ $labels.instance }}) description: "Mimir {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}% errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.3. Mimir inconsistent runtime config
An inconsistent runtime config file is used across Mimir instances. [copy] - alert: MimirInconsistentRuntimeConfig expr: count(count by (job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1 for: 1h labels: severity: critical annotations: summary: Mimir inconsistent runtime config (instance {{ $labels.instance }}) description: "An inconsistent runtime config file is used across Mimir instances.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.4. Mimir bad runtime config
{{ $labels.job }} failed to reload runtime config. [copy] - alert: MimirBadRuntimeConfig expr: sum by (job) (cortex_runtime_config_last_reload_successful == 0) > 0 for: 5m labels: severity: critical annotations: summary: Mimir bad runtime config (instance {{ $labels.instance }}) description: "{{ $labels.job }} failed to reload runtime config.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.5. Mimir scheduler queries stuck
There are {{ $value }} queued up queries in {{ $labels.job }}. [copy] - alert: MimirSchedulerQueriesStuck expr: sum by (job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0 for: 7m labels: severity: critical annotations: summary: Mimir scheduler queries stuck (instance {{ $labels.instance }}) description: "There are {{ $value }} queued up queries in {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.6. Mimir cache request errors
Mimir cache {{ $labels.name }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation. [copy] - alert: MimirCacheRequestErrors expr: (sum by (name, operation, job) (rate(thanos_cache_operation_failures_total[5m])) / sum by (name, operation, job) (rate(thanos_cache_operations_total[5m]))) * 100 > 5 for: 5m labels: severity: warning annotations: summary: Mimir cache request errors (instance {{ $labels.instance }}) description: "Mimir cache {{ $labels.name }} is experiencing {{ printf \"%.2f\" $value }}% errors for {{ $labels.operation }} operation.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.7. Mimir KV store failure
Mimir {{ $labels.job }} KV store {{ $labels.kv_name }} is failing with 100% error rate. [copy] - alert: MimirKvStoreFailure expr: (sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.."}[5m])) / sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m]))) == 1 for: 5m labels: severity: critical annotations: summary: Mimir KV store failure (instance {{ $labels.instance }}) description: "Mimir {{ $labels.job }} KV store {{ $labels.kv_name }} is failing with 100% error rate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.8. Mimir memory map areas too high
Mimir {{ $labels.job }} is using {{ printf "%.0f" $value }}% of its memory map area limit. [copy] - alert: MimirMemoryMapAreasTooHigh expr: process_memory_map_areas{job=~".*(ingester|cortex|mimir|store-gateway).*"} / process_memory_map_areas_limit{job=~".*(ingester|cortex|mimir|store-gateway).*"} * 100 > 80 for: 5m labels: severity: critical annotations: summary: Mimir memory map areas too high (instance {{ $labels.instance }}) description: "Mimir {{ $labels.job }} is using {{ printf \"%.0f\" $value }}% of its memory map area limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.9. Mimir ingester instance has no tenants
Mimir ingester {{ $labels.instance }} has no tenants assigned. [copy] - alert: MimirIngesterInstanceHasNoTenants expr: (cortex_ingester_memory_users == 0) and on (instance) (cortex_ingester_memory_users offset 1h > 0) for: 1h labels: severity: warning annotations: summary: Mimir ingester instance has no tenants (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} has no tenants assigned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.10. Mimir ruler instance has no rule groups
Mimir ruler {{ $labels.instance }} has no rule groups assigned. [copy] - alert: MimirRulerInstanceHasNoRuleGroups expr: (cortex_ruler_managers_total == 0) and on (instance) (cortex_ruler_managers_total offset 1h > 0) for: 1h labels: severity: warning annotations: summary: Mimir ruler instance has no rule groups (instance {{ $labels.instance }}) description: "Mimir ruler {{ $labels.instance }} has no rule groups assigned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.11. Mimir ingested data too far in the future
Mimir ingester {{ $labels.job }} has ingested samples with timestamps more than 1 hour in the future. [copy] - alert: MimirIngestedDataTooFarInTheFuture expr: max by (job) (cortex_ingester_tsdb_head_max_timestamp_seconds - time() and cortex_ingester_tsdb_head_max_timestamp_seconds > 0) > 3600 for: 5m labels: severity: warning annotations: summary: Mimir ingested data too far in the future (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.job }} has ingested samples with timestamps more than 1 hour in the future.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.12. Mimir store gateway too many failed operations
Mimir store-gateway {{ $labels.job }} bucket operations are failing. [copy] - alert: MimirStoreGatewayTooManyFailedOperations expr: sum by (job) (rate(thanos_objstore_bucket_operation_failures_total[5m])) > 0 for: 5m labels: severity: warning annotations: summary: Mimir store gateway too many failed operations (instance {{ $labels.instance }}) description: "Mimir store-gateway {{ $labels.job }} bucket operations are failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.13. Mimir ring members mismatch
Mimir {{ $labels.name }} ring has inconsistent member counts across instances. [copy] - alert: MimirRingMembersMismatch expr: max by (name, job) (sum by (name, job, instance) (cortex_ring_members)) != min by (name, job) (sum by (name, job, instance) (cortex_ring_members)) for: 15m labels: severity: warning annotations: summary: Mimir ring members mismatch (instance {{ $labels.instance }}) description: "Mimir {{ $labels.name }} ring has inconsistent member counts across instances.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.14. Mimir ingester reaching series limit warning
Mimir ingester {{ $labels.instance }} has reached {{ printf "%.0f" $value }}% of its series limit. [copy] - alert: MimirIngesterReachingSeriesLimitWarning expr: (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"} * 100 > 80) and cortex_ingester_instance_limits{limit="max_series"} > 0 for: 3h labels: severity: warning annotations: summary: Mimir ingester reaching series limit warning (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its series limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.15. Mimir ingester reaching series limit critical
Mimir ingester {{ $labels.instance }} has reached {{ printf "%.0f" $value }}% of its series limit. [copy] - alert: MimirIngesterReachingSeriesLimitCritical expr: (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"} * 100 > 90) and cortex_ingester_instance_limits{limit="max_series"} > 0 for: 5m labels: severity: critical annotations: summary: Mimir ingester reaching series limit critical (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its series limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.16. Mimir ingester reaching tenants limit warning
Mimir ingester {{ $labels.instance }} has reached {{ printf "%.0f" $value }}% of its tenants limit. [copy] - alert: MimirIngesterReachingTenantsLimitWarning expr: (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"} * 100 > 70) and cortex_ingester_instance_limits{limit="max_tenants"} > 0 for: 5m labels: severity: warning annotations: summary: Mimir ingester reaching tenants limit warning (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its tenants limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.17. Mimir ingester reaching tenants limit critical
Mimir ingester {{ $labels.instance }} has reached {{ printf "%.0f" $value }}% of its tenants limit. [copy] - alert: MimirIngesterReachingTenantsLimitCritical expr: (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"} * 100 > 80) and cortex_ingester_instance_limits{limit="max_tenants"} > 0 for: 5m labels: severity: critical annotations: summary: Mimir ingester reaching tenants limit critical (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its tenants limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.18. Mimir reaching TCP connections limit
Mimir instance {{ $labels.instance }} is using {{ printf "%.0f" $value }}% of its TCP connections limit. [copy] - alert: MimirReachingTcpConnectionsLimit expr: cortex_tcp_connections / cortex_tcp_connections_limit * 100 > 80 and cortex_tcp_connections_limit > 0 for: 5m labels: severity: critical annotations: summary: Mimir reaching TCP connections limit (instance {{ $labels.instance }}) description: "Mimir instance {{ $labels.instance }} is using {{ printf \"%.0f\" $value }}% of its TCP connections limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.19. Mimir distributor inflight requests high
Mimir distributor {{ $labels.instance }} is using {{ printf "%.0f" $value }}% of its inflight push requests limit. [copy] - alert: MimirDistributorInflightRequestsHigh expr: (cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit="max_inflight_push_requests"} * 100 > 80) and cortex_distributor_instance_limits{limit="max_inflight_push_requests"} > 0 for: 5m labels: severity: critical annotations: summary: Mimir distributor inflight requests high (instance {{ $labels.instance }}) description: "Mimir distributor {{ $labels.instance }} is using {{ printf \"%.0f\" $value }}% of its inflight push requests limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.20. Mimir ingester TSDB head compaction failed
Mimir ingester {{ $labels.instance }} is failing to compact TSDB head. [copy] - alert: MimirIngesterTsdbHeadCompactionFailed expr: rate(cortex_ingester_tsdb_compactions_failed_total[5m]) > 0 for: 15m labels: severity: critical annotations: summary: Mimir ingester TSDB head compaction failed (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} is failing to compact TSDB head.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.21. Mimir ingester TSDB head truncation failed
Mimir ingester {{ $labels.instance }} is failing to truncate TSDB head. [copy] - alert: MimirIngesterTsdbHeadTruncationFailed expr: rate(cortex_ingester_tsdb_head_truncations_failed_total[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Mimir ingester TSDB head truncation failed (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} is failing to truncate TSDB head.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.22. Mimir ingester TSDB checkpoint creation failed
Mimir ingester {{ $labels.instance }} is failing to create TSDB checkpoints. [copy] - alert: MimirIngesterTsdbCheckpointCreationFailed expr: rate(cortex_ingester_tsdb_checkpoint_creations_failed_total[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Mimir ingester TSDB checkpoint creation failed (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} is failing to create TSDB checkpoints.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.23. Mimir ingester TSDB checkpoint deletion failed
Mimir ingester {{ $labels.instance }} is failing to delete TSDB checkpoints. [copy] - alert: MimirIngesterTsdbCheckpointDeletionFailed expr: rate(cortex_ingester_tsdb_checkpoint_deletions_failed_total[5m]) > 0 for: 0m labels: severity: critical annotations: summary: Mimir ingester TSDB checkpoint deletion failed (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} is failing to delete TSDB checkpoints.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.24. Mimir ingester TSDB WAL truncation failed
Mimir ingester {{ $labels.instance }} is failing to truncate TSDB WAL. [copy] - alert: MimirIngesterTsdbWalTruncationFailed expr: rate(cortex_ingester_tsdb_wal_truncations_failed_total[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Mimir ingester TSDB WAL truncation failed (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} is failing to truncate TSDB WAL.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.25. Mimir ingester TSDB WAL writes failed
Mimir ingester {{ $labels.instance }} is failing to write to TSDB WAL. [copy] - alert: MimirIngesterTsdbWalWritesFailed expr: rate(cortex_ingester_tsdb_wal_writes_failed_total[1m]) > 0 for: 3m labels: severity: critical annotations: summary: Mimir ingester TSDB WAL writes failed (instance {{ $labels.instance }}) description: "Mimir ingester {{ $labels.instance }} is failing to write to TSDB WAL.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.26. Mimir store gateway has not synced bucket
Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 10 minutes. [copy] - alert: MimirStoreGatewayHasNotSyncedBucket expr: (time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 600) and cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 0 for: 5m labels: severity: critical annotations: summary: Mimir store gateway has not synced bucket (instance {{ $labels.instance }}) description: "Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.27. Mimir store gateway no synced tenants
Mimir store-gateway {{ $labels.instance }} has no synced tenants. [copy] - alert: MimirStoreGatewayNoSyncedTenants expr: (min by (instance, job) (cortex_bucket_stores_tenants_synced{component="store-gateway"}) == 0) and on (instance) (cortex_bucket_stores_tenants_synced{component="store-gateway"} offset 1h > 0) for: 1h labels: severity: warning annotations: summary: Mimir store gateway no synced tenants (instance {{ $labels.instance }}) description: "Mimir store-gateway {{ $labels.instance }} has no synced tenants.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.28. Mimir bucket index not updated
Mimir bucket index for tenant {{ $labels.user }} has not been updated for more than 35 minutes. [copy] - alert: MimirBucketIndexNotUpdated expr: min by (user, job) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100 for: 0m labels: severity: critical annotations: summary: Mimir bucket index not updated (instance {{ $labels.instance }}) description: "Mimir bucket index for tenant {{ $labels.user }} has not been updated for more than 35 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.29. Mimir compactor not cleaning up blocks
Mimir compactor {{ $labels.instance }} has not cleaned up blocks in the last 6 hours. [copy] - alert: MimirCompactorNotCleaningUpBlocks expr: (time() - cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 21600) and cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 0 for: 1h labels: severity: critical annotations: summary: Mimir compactor not cleaning up blocks (instance {{ $labels.instance }}) description: "Mimir compactor {{ $labels.instance }} has not cleaned up blocks in the last 6 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.30. Mimir compactor not running compaction
Mimir compactor {{ $labels.instance }} has not run compaction in the last 24 hours. [copy] - alert: MimirCompactorNotRunningCompaction expr: (time() - cortex_compactor_last_successful_run_timestamp_seconds > 86400) and cortex_compactor_last_successful_run_timestamp_seconds > 0 for: 15m labels: severity: critical annotations: summary: Mimir compactor not running compaction (instance {{ $labels.instance }}) description: "Mimir compactor {{ $labels.instance }} has not run compaction in the last 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.31. Mimir compactor has consecutive failures
Mimir compactor {{ $labels.instance }} has had 2+ compaction failures in the last 2 hours. [copy] - alert: MimirCompactorHasConsecutiveFailures expr: increase(cortex_compactor_runs_failed_total[2h]) > 1 for: 0m labels: severity: critical annotations: summary: Mimir compactor has consecutive failures (instance {{ $labels.instance }}) description: "Mimir compactor {{ $labels.instance }} has had 2+ compaction failures in the last 2 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.32. Mimir compactor has run out of disk space
Mimir compactor {{ $labels.instance }} has run out of disk space. [copy] - alert: MimirCompactorHasRunOutOfDiskSpace expr: increase(cortex_compactor_disk_out_of_space_errors_total[24h]) >= 1 for: 0m labels: severity: critical annotations: summary: Mimir compactor has run out of disk space (instance {{ $labels.instance }}) description: "Mimir compactor {{ $labels.instance }} has run out of disk space.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.33. Mimir compactor has not uploaded blocks
Mimir compactor {{ $labels.instance }} has not uploaded any block in the last 24 hours. [copy] - alert: MimirCompactorHasNotUploadedBlocks expr: (time() - thanos_objstore_bucket_last_successful_upload_time{component="compactor"} > 86400) and thanos_objstore_bucket_last_successful_upload_time{component="compactor"} > 0 for: 15m labels: severity: critical annotations: summary: Mimir compactor has not uploaded blocks (instance {{ $labels.instance }}) description: "Mimir compactor {{ $labels.instance }} has not uploaded any block in the last 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.34. Mimir compactor skipped blocks
Mimir compactor has found blocks that cannot be compacted (reason {{ $labels.reason }}). [copy] - alert: MimirCompactorSkippedBlocks expr: increase(cortex_compactor_blocks_marked_for_no_compaction_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Mimir compactor skipped blocks (instance {{ $labels.instance }}) description: "Mimir compactor has found blocks that cannot be compacted (reason {{ $labels.reason }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.35. Mimir ruler too many failed pushes
Mimir ruler {{ $labels.instance }} is failing to push {{ printf "%.2f" $value }}% of write requests. [copy] - alert: MimirRulerTooManyFailedPushes expr: 100 * sum by (instance, job) (rate(cortex_ruler_write_requests_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 1 for: 5m labels: severity: critical annotations: summary: Mimir ruler too many failed pushes (instance {{ $labels.instance }}) description: "Mimir ruler {{ $labels.instance }} is failing to push {{ printf \"%.2f\" $value }}% of write requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.36. Mimir ruler too many failed queries
Mimir ruler {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% of query evaluations. [copy] - alert: MimirRulerTooManyFailedQueries expr: 100 * sum by (instance, job) (rate(cortex_ruler_queries_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 1 for: 5m labels: severity: critical annotations: summary: Mimir ruler too many failed queries (instance {{ $labels.instance }}) description: "Mimir ruler {{ $labels.instance }} is failing {{ printf \"%.2f\" $value }}% of query evaluations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.37. Mimir ruler missed evaluations
Mimir ruler {{ $labels.instance }} is missing {{ printf "%.2f" $value }}% of rule group evaluations. [copy] - alert: MimirRulerMissedEvaluations expr: 100 * sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_missed_total[5m])) / sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 1 for: 5m labels: severity: warning annotations: summary: Mimir ruler missed evaluations (instance {{ $labels.instance }}) description: "Mimir ruler {{ $labels.instance }} is missing {{ printf \"%.2f\" $value }}% of rule group evaluations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.38. Mimir ruler failed ring check
Mimir ruler {{ $labels.job }} is failing ring checks. [copy] - alert: MimirRulerFailedRingCheck expr: sum by (job) (rate(cortex_ruler_ring_check_errors_total[5m])) > 0 for: 5m labels: severity: critical annotations: summary: Mimir ruler failed ring check (instance {{ $labels.instance }}) description: "Mimir ruler {{ $labels.job }} is failing ring checks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.39. Mimir alertmanager sync configs failing
Mimir alertmanager {{ $labels.job }} is failing to sync configs. [copy] - alert: MimirAlertmanagerSyncConfigsFailing expr: rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0 for: 30m labels: severity: critical annotations: summary: Mimir alertmanager sync configs failing (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.job }} is failing to sync configs.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.40. Mimir alertmanager ring check failing
Mimir alertmanager {{ $labels.job }} is failing ring checks. [copy] - alert: MimirAlertmanagerRingCheckFailing expr: rate(cortex_alertmanager_ring_check_errors_total[5m]) > 0 for: 10m labels: severity: critical annotations: summary: Mimir alertmanager ring check failing (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.job }} is failing ring checks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.41. Mimir alertmanager state merge failing
Mimir alertmanager {{ $labels.job }} is failing to merge state updates. [copy] - alert: MimirAlertmanagerStateMergeFailing expr: rate(cortex_alertmanager_partial_state_merges_failed_total[5m]) > 0 for: 10m labels: severity: critical annotations: summary: Mimir alertmanager state merge failing (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.job }} is failing to merge state updates.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.42. Mimir alertmanager replication failing
Mimir alertmanager {{ $labels.job }} is failing to replicate state. [copy] - alert: MimirAlertmanagerReplicationFailing expr: rate(cortex_alertmanager_state_replication_failed_total[5m]) > 0 for: 10m labels: severity: critical annotations: summary: Mimir alertmanager replication failing (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.job }} is failing to replicate state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.43. Mimir alertmanager persist state failing
Mimir alertmanager {{ $labels.job }} is failing to persist state. [copy] - alert: MimirAlertmanagerPersistStateFailing expr: rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0 for: 1h labels: severity: critical annotations: summary: Mimir alertmanager persist state failing (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.job }} is failing to persist state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.44. Mimir alertmanager initial sync failed
Mimir alertmanager {{ $labels.job }} failed initial state sync. [copy] - alert: MimirAlertmanagerInitialSyncFailed expr: increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[1m]) > 0 for: 0m labels: severity: warning annotations: summary: Mimir alertmanager initial sync failed (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.job }} failed initial state sync.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.45. Mimir alertmanager instance has no tenants
Mimir alertmanager {{ $labels.instance }} has no tenants assigned. [copy] - alert: MimirAlertmanagerInstanceHasNoTenants expr: (cortex_alertmanager_tenants_owned == 0) and on (instance) (cortex_alertmanager_tenants_owned offset 1h > 0) for: 1h labels: severity: warning annotations: summary: Mimir alertmanager instance has no tenants (instance {{ $labels.instance }}) description: "Mimir alertmanager {{ $labels.instance }} has no tenants assigned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.46. Mimir gossip members count too high
Mimir gossip cluster has more members than expected. [copy] - alert: MimirGossipMembersCountTooHigh expr: avg(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job) * 1.15 + 10 < max(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job) for: 20m labels: severity: warning annotations: summary: Mimir gossip members count too high (instance {{ $labels.instance }}) description: "Mimir gossip cluster has more members than expected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.47. Mimir gossip members count too low
Mimir gossip cluster has fewer members than expected. [copy] - alert: MimirGossipMembersCountTooLow expr: avg(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job) * 0.5 > min(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job) for: 20m labels: severity: warning annotations: summary: Mimir gossip members count too low (instance {{ $labels.instance }}) description: "Mimir gossip cluster has fewer members than expected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.48. Mimir go threads too high warning
Mimir {{ $labels.instance }} has {{ $value }} Go threads. [copy] # A high number of Go threads may indicate a goroutine leak. - alert: MimirGoThreadsTooHighWarning expr: go_threads{job=~".*(mimir|cortex).*"} > 5000 for: 15m labels: severity: warning annotations: summary: Mimir go threads too high warning (instance {{ $labels.instance }}) description: "Mimir {{ $labels.instance }} has {{ $value }} Go threads.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.6.49. Mimir go threads too high critical
Mimir {{ $labels.instance }} has {{ $value }} Go threads. [copy] - alert: MimirGoThreadsTooHighCritical expr: go_threads{job=~".*(mimir|cortex).*"} > 8000 for: 15m labels: severity: critical annotations: summary: Mimir go threads too high critical (instance {{ $labels.instance }}) description: "Mimir {{ $labels.instance }} has {{ $value }} Go threads.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.7. Grafana Alloy (1 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/grafana-alloy/embedded-exporter.yml-
# 8.7.1. Grafana Alloy service down
Alloy on (instance {{ $labels.instance }}) is not responding or has stopped running. [copy] - alert: GrafanaAlloyServiceDown expr: count by (instance) (alloy_build_info) unless count by (instance) (alloy_build_info offset 2m) for: 0m labels: severity: critical annotations: summary: Grafana Alloy service down (instance {{ $labels.instance }}) description: "Alloy on (instance {{ $labels.instance }}) is not responding or has stopped running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.8. OpenTelemetry Collector : Embedded exporter (12 rules) [copy section]
OpenTelemetry Collector self-monitoring metrics are exposed on port 8888 by default at the /metrics endpoint.
These alerts monitor the collector's health when metrics are ingested via the Prometheus OTLP endpoint or scraped directly.
All collector internal metrics are prefixed with 'otelcol_'.$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/opentelemetry-collector/embedded-exporter.yml-
# 8.8.1. OpenTelemetry Collector down
OpenTelemetry Collector instance has disappeared or is not being scraped [copy] - alert: OpentelemetryCollectorDown expr: up{job=~".*otel.*collector.*"} == 0 for: 1m labels: severity: critical annotations: summary: OpenTelemetry Collector down (instance {{ $labels.instance }}) description: "OpenTelemetry Collector instance has disappeared or is not being scraped\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.2. OpenTelemetry Collector receiver refused spans
OpenTelemetry Collector is refusing spans on {{ $labels.receiver }} [copy] - alert: OpentelemetryCollectorReceiverRefusedSpans expr: rate(otelcol_receiver_refused_spans[5m]) > 0 for: 5m labels: severity: critical annotations: summary: OpenTelemetry Collector receiver refused spans (instance {{ $labels.instance }}) description: "OpenTelemetry Collector is refusing spans on {{ $labels.receiver }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.3. OpenTelemetry Collector receiver refused metric points
OpenTelemetry Collector is refusing metric points on {{ $labels.receiver }} [copy] - alert: OpentelemetryCollectorReceiverRefusedMetricPoints expr: rate(otelcol_receiver_refused_metric_points[5m]) > 0 for: 5m labels: severity: critical annotations: summary: OpenTelemetry Collector receiver refused metric points (instance {{ $labels.instance }}) description: "OpenTelemetry Collector is refusing metric points on {{ $labels.receiver }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.4. OpenTelemetry Collector receiver refused log records
OpenTelemetry Collector is refusing log records on {{ $labels.receiver }} [copy] - alert: OpentelemetryCollectorReceiverRefusedLogRecords expr: rate(otelcol_receiver_refused_log_records[5m]) > 0 for: 5m labels: severity: critical annotations: summary: OpenTelemetry Collector receiver refused log records (instance {{ $labels.instance }}) description: "OpenTelemetry Collector is refusing log records on {{ $labels.receiver }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.5. OpenTelemetry Collector exporter failed spans
OpenTelemetry Collector failing to send spans via {{ $labels.exporter }} [copy] - alert: OpentelemetryCollectorExporterFailedSpans expr: rate(otelcol_exporter_send_failed_spans[5m]) > 0 for: 5m labels: severity: warning annotations: summary: OpenTelemetry Collector exporter failed spans (instance {{ $labels.instance }}) description: "OpenTelemetry Collector failing to send spans via {{ $labels.exporter }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.6. OpenTelemetry Collector exporter failed metric points
OpenTelemetry Collector failing to send metric points via {{ $labels.exporter }} [copy] - alert: OpentelemetryCollectorExporterFailedMetricPoints expr: rate(otelcol_exporter_send_failed_metric_points[5m]) > 0 for: 5m labels: severity: warning annotations: summary: OpenTelemetry Collector exporter failed metric points (instance {{ $labels.instance }}) description: "OpenTelemetry Collector failing to send metric points via {{ $labels.exporter }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.7. OpenTelemetry Collector exporter failed log records
OpenTelemetry Collector failing to send log records via {{ $labels.exporter }} [copy] - alert: OpentelemetryCollectorExporterFailedLogRecords expr: rate(otelcol_exporter_send_failed_log_records[5m]) > 0 for: 5m labels: severity: warning annotations: summary: OpenTelemetry Collector exporter failed log records (instance {{ $labels.instance }}) description: "OpenTelemetry Collector failing to send log records via {{ $labels.exporter }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.8. OpenTelemetry Collector exporter queue nearly full
OpenTelemetry Collector exporter {{ $labels.exporter }} queue is over 80% full [copy] - alert: OpentelemetryCollectorExporterQueueNearlyFull expr: (otelcol_exporter_queue_size / on(instance, job, exporter) otelcol_exporter_queue_capacity) > 0.8 and otelcol_exporter_queue_capacity > 0 for: 0m labels: severity: warning annotations: summary: OpenTelemetry Collector exporter queue nearly full (instance {{ $labels.instance }}) description: "OpenTelemetry Collector exporter {{ $labels.exporter }} queue is over 80% full\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.9. OpenTelemetry Collector processor refused spans
OpenTelemetry Collector processor {{ $labels.processor }} is refusing spans, likely due to backpressure [copy] - alert: OpentelemetryCollectorProcessorRefusedSpans expr: rate(otelcol_processor_refused_spans[5m]) > 0 for: 5m labels: severity: warning annotations: summary: OpenTelemetry Collector processor refused spans (instance {{ $labels.instance }}) description: "OpenTelemetry Collector processor {{ $labels.processor }} is refusing spans, likely due to backpressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.10. OpenTelemetry Collector processor refused metric points
OpenTelemetry Collector processor {{ $labels.processor }} is refusing metric points, likely due to backpressure [copy] - alert: OpentelemetryCollectorProcessorRefusedMetricPoints expr: rate(otelcol_processor_refused_metric_points[5m]) > 0 for: 5m labels: severity: warning annotations: summary: OpenTelemetry Collector processor refused metric points (instance {{ $labels.instance }}) description: "OpenTelemetry Collector processor {{ $labels.processor }} is refusing metric points, likely due to backpressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.11. OpenTelemetry Collector high memory usage
OpenTelemetry Collector memory usage is above 90% [copy] - alert: OpentelemetryCollectorHighMemoryUsage expr: (otelcol_process_runtime_heap_alloc_bytes{job=~".*otel.*collector.*"} / on(instance, job) otelcol_process_runtime_total_sys_memory_bytes{job=~".*otel.*collector.*"}) > 0.9 for: 5m labels: severity: warning annotations: summary: OpenTelemetry Collector high memory usage (instance {{ $labels.instance }}) description: "OpenTelemetry Collector memory usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.8.12. OpenTelemetry Collector OTLP receiver errors
OpenTelemetry Collector OTLP receiver is completely failing - all spans are being refused [copy] - alert: OpentelemetryCollectorOtlpReceiverErrors expr: rate(otelcol_receiver_accepted_spans{receiver=~"otlp"}[5m]) == 0 and rate(otelcol_receiver_refused_spans{receiver=~"otlp"}[5m]) > 0 for: 2m labels: severity: critical annotations: summary: OpenTelemetry Collector OTLP receiver errors (instance {{ $labels.instance }}) description: "OpenTelemetry Collector OTLP receiver is completely failing - all spans are being refused\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.9. Jenkins : Metric plugin (8 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/jenkins/metric-plugin.yml-
# 8.9.1. Jenkins node offline
At least one Jenkins node offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] - alert: JenkinsNodeOffline expr: jenkins_node_offline_value > 0 for: 5m labels: severity: critical annotations: summary: Jenkins node offline (instance {{ $labels.instance }}) description: "At least one Jenkins node offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.2. Jenkins no node online
No Jenkins nodes are online: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] - alert: JenkinsNoNodeOnline expr: jenkins_node_online_value == 0 for: 0m labels: severity: critical annotations: summary: Jenkins no node online (instance {{ $labels.instance }}) description: "No Jenkins nodes are online: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.3. Jenkins healthcheck
Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] - alert: JenkinsHealthcheck expr: jenkins_health_check_score < 1 for: 0m labels: severity: critical annotations: summary: Jenkins healthcheck (instance {{ $labels.instance }}) description: "Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.4. Jenkins outdated plugins
{{ $value }} plugins need update [copy] - alert: JenkinsOutdatedPlugins expr: sum(jenkins_plugins_withUpdate) by (instance) > 3 for: 1d labels: severity: warning annotations: summary: Jenkins outdated plugins (instance {{ $labels.instance }}) description: "{{ $value }} plugins need update\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.5. Jenkins builds health score
Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] - alert: JenkinsBuildsHealthScore expr: default_jenkins_builds_health_score < 1 for: 0m labels: severity: critical annotations: summary: Jenkins builds health score (instance {{ $labels.instance }}) description: "Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.6. Jenkins run failure total
Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] - alert: JenkinsRunFailureTotal expr: delta(jenkins_runs_failure_total[1h]) > 100 for: 0m labels: severity: warning annotations: summary: Jenkins run failure total (instance {{ $labels.instance }}) description: "Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.7. Jenkins build tests failing
Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}}) [copy] - alert: JenkinsBuildTestsFailing expr: default_jenkins_builds_last_build_tests_failing > 0 for: 0m labels: severity: warning annotations: summary: Jenkins build tests failing (instance {{ $labels.instance }}) description: "Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.9.8. Jenkins last build failed
Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}}) [copy] # * RUNNING -1 true - The build had no errors. # * SUCCESS 0 true - The build had no errors. # * UNSTABLE 1 true - The build had some errors but they were not fatal. For example, some tests failed. # * FAILURE 2 false - The build had a fatal error. # * NOT_BUILT 3 false - The module was not built. # * ABORTED 4 false - The build was manually aborted. - alert: JenkinsLastBuildFailed expr: default_jenkins_builds_last_build_result_ordinal == 2 for: 0m labels: severity: warning annotations: summary: Jenkins last build failed (instance {{ $labels.instance }}) description: "Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.10. APC UPS : mdlayher/apcupsd_exporter (6 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/apc-ups/apcupsd_exporter.yml-
# 8.10.1. APC UPS Battery nearly empty
Battery is almost empty (< 10% left) [copy] - alert: ApcUpsBatteryNearlyEmpty expr: apcupsd_battery_charge_percent < 10 for: 0m labels: severity: critical annotations: summary: APC UPS Battery nearly empty (instance {{ $labels.instance }}) description: "Battery is almost empty (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.10.2. APC UPS Less than 15 Minutes of battery time remaining
Battery is almost empty (< 15 Minutes remaining) [copy] - alert: ApcUpsLessThan15MinutesOfBatteryTimeRemaining expr: apcupsd_battery_time_left_seconds < 900 for: 0m labels: severity: critical annotations: summary: APC UPS Less than 15 Minutes of battery time remaining (instance {{ $labels.instance }}) description: "Battery is almost empty (< 15 Minutes remaining)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.10.3. APC UPS AC input outage
UPS now running on battery (since {{$value | humanizeDuration}}) [copy] - alert: ApcUpsAcInputOutage expr: apcupsd_battery_time_on_seconds > 0 for: 0m labels: severity: warning annotations: summary: APC UPS AC input outage (instance {{ $labels.instance }}) description: "UPS now running on battery (since {{$value | humanizeDuration}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.10.4. APC UPS low battery voltage
Battery voltage is lower than nominal (< 95%) [copy] - alert: ApcUpsLowBatteryVoltage expr: (apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95 for: 0m labels: severity: warning annotations: summary: APC UPS low battery voltage (instance {{ $labels.instance }}) description: "Battery voltage is lower than nominal (< 95%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.10.5. APC UPS high temperature
Internal temperature is high ({{$value}}°C) [copy] - alert: ApcUpsHighTemperature expr: apcupsd_internal_temperature_celsius >= 40 for: 2m labels: severity: warning annotations: summary: APC UPS high temperature (instance {{ $labels.instance }}) description: "Internal temperature is high ({{$value}}°C)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.10.6. APC UPS high load
UPS load is > 80% [copy] - alert: ApcUpsHighLoad expr: apcupsd_ups_load_percent > 80 for: 0m labels: severity: warning annotations: summary: APC UPS high load (instance {{ $labels.instance }}) description: "UPS load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.11. Graph Node : Embedded exporter (6 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/graph-node/embedded-exporter.yml-
# 8.11.1. Provider failed because net_version failed
Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] - alert: ProviderFailedBecauseNet_versionFailed expr: eth_rpc_status == 1 for: 0m labels: severity: critical annotations: summary: Provider failed because net_version failed (instance {{ $labels.instance }}) description: "Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.11.2. Provider failed because get genesis failed
Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] - alert: ProviderFailedBecauseGetGenesisFailed expr: eth_rpc_status == 2 for: 0m labels: severity: critical annotations: summary: Provider failed because get genesis failed (instance {{ $labels.instance }}) description: "Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.11.3. Provider failed because net_version timeout
net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] - alert: ProviderFailedBecauseNet_versionTimeout expr: eth_rpc_status == 3 for: 0m labels: severity: critical annotations: summary: Provider failed because net_version timeout (instance {{ $labels.instance }}) description: "net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.11.4. Provider failed because get genesis timeout
Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] - alert: ProviderFailedBecauseGetGenesisTimeout expr: eth_rpc_status == 4 for: 0m labels: severity: critical annotations: summary: Provider failed because get genesis timeout (instance {{ $labels.instance }}) description: "Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.11.5. Store connection is too slow
Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` [copy] - alert: StoreConnectionIsTooSlow expr: store_connection_wait_time_ms > 10 for: 0m labels: severity: warning annotations: summary: Store connection is too slow (instance {{ $labels.instance }}) description: "Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.11.6. Store connection is too slow
Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` [copy] - alert: StoreConnectionIsTooSlow expr: store_connection_wait_time_ms > 20 for: 0m labels: severity: critical annotations: summary: Store connection is too slow (instance {{ $labels.instance }}) description: "Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.12.1. GitLab : GitLab built-in exporter (21 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/gitlab/gitlab-built-in-exporter.yml-
# 8.12.1.1. GitLab Puma high queued connections
GitLab Puma has {{ $value }} queued connections on {{ $labels.instance }}. Requests are waiting for an available worker thread. [copy] # Queued connections indicate Puma workers are saturated. # Consider increasing puma['worker_processes'] or puma['max_threads'] in gitlab.rb. - alert: GitlabPumaHighQueuedConnections expr: avg_over_time(puma_queued_connections[5m]) > 5 for: 5m labels: severity: warning annotations: summary: GitLab Puma high queued connections (instance {{ $labels.instance }}) description: "GitLab Puma has {{ $value }} queued connections on {{ $labels.instance }}. Requests are waiting for an available worker thread.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.2. GitLab Puma no available pool capacity
GitLab Puma pool capacity on {{ $labels.instance }} has been at 0 for 5 minutes. All threads are busy. [copy] - alert: GitlabPumaNoAvailablePoolCapacity expr: puma_pool_capacity == 0 for: 5m labels: severity: critical annotations: summary: GitLab Puma no available pool capacity (instance {{ $labels.instance }}) description: "GitLab Puma pool capacity on {{ $labels.instance }} has been at 0 for 5 minutes. All threads are busy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.3. GitLab Puma workers not running
GitLab Puma on {{ $labels.instance }} has {{ $value }} running workers out of expected total. [copy] - alert: GitlabPumaWorkersNotRunning expr: puma_running_workers < puma_workers for: 5m labels: severity: warning annotations: summary: GitLab Puma workers not running (instance {{ $labels.instance }}) description: "GitLab Puma on {{ $labels.instance }} has {{ $value }} running workers out of expected total.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.4. GitLab high HTTP error rate
GitLab is returning more than 5% HTTP 5xx errors on {{ $labels.instance }}. [copy] # Threshold is 5% of all requests returning server errors. # Check GitLab logs at /var/log/gitlab/ for root cause. - alert: GitlabHighHttpErrorRate expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 for: 5m labels: severity: critical annotations: summary: GitLab high HTTP error rate (instance {{ $labels.instance }}) description: "GitLab is returning more than 5% HTTP 5xx errors on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.5. GitLab high HTTP request latency
GitLab p95 HTTP request latency on {{ $labels.instance }} is above 10 seconds. [copy] # Threshold of 10s may need adjustment based on your instance size and workload. - alert: GitlabHighHttpRequestLatency expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 10 for: 5m labels: severity: warning annotations: summary: GitLab high HTTP request latency (instance {{ $labels.instance }}) description: "GitLab p95 HTTP request latency on {{ $labels.instance }} is above 10 seconds.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.6. GitLab Sidekiq jobs failing
GitLab Sidekiq jobs are failing at a rate of {{ $value }} per second on {{ $labels.instance }}. [copy] # This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled. # A sustained failure rate indicates background processing issues. - alert: GitlabSidekiqJobsFailing expr: rate(sidekiq_jobs_failed_total[5m]) > 0 for: 10m labels: severity: warning annotations: summary: GitLab Sidekiq jobs failing (instance {{ $labels.instance }}) description: "GitLab Sidekiq jobs are failing at a rate of {{ $value }} per second on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.7. GitLab Sidekiq queue too large
GitLab Sidekiq has {{ $value }} running jobs, approaching concurrency limit on {{ $labels.instance }}. [copy] # When running jobs approach the concurrency limit, new jobs will queue up. # Consider scaling Sidekiq workers or increasing concurrency. - alert: GitlabSidekiqQueueTooLarge expr: sum(sidekiq_running_jobs) >= sum(sidekiq_concurrency) * 0.9 for: 10m labels: severity: warning annotations: summary: GitLab Sidekiq queue too large (instance {{ $labels.instance }}) description: "GitLab Sidekiq has {{ $value }} running jobs, approaching concurrency limit on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.8. GitLab Sidekiq high job completion time
GitLab Sidekiq job average completion time on {{ $labels.instance }} is above 5 minutes. [copy] # This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled. - alert: GitlabSidekiqHighJobCompletionTime expr: histogram_quantile(0.95, sum(rate(sidekiq_jobs_completion_seconds_bucket[5m])) by (le, worker)) > 300 for: 10m labels: severity: warning annotations: summary: GitLab Sidekiq high job completion time (instance {{ $labels.instance }}) description: "GitLab Sidekiq job average completion time on {{ $labels.instance }} is above 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.9. GitLab Sidekiq high queue latency
GitLab Sidekiq jobs on {{ $labels.instance }} are waiting more than 60 seconds before being processed. [copy] # This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled. # High queue latency means jobs are stuck waiting. Check Sidekiq concurrency and queue sizes. - alert: GitlabSidekiqHighQueueLatency expr: histogram_quantile(0.95, sum(rate(sidekiq_jobs_queue_duration_seconds_bucket[5m])) by (le)) > 60 for: 5m labels: severity: warning annotations: summary: GitLab Sidekiq high queue latency (instance {{ $labels.instance }}) description: "GitLab Sidekiq jobs on {{ $labels.instance }} are waiting more than 60 seconds before being processed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.10. GitLab database connection pool saturation
GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) is {{ $value }}% busy. [copy] # When the pool is near saturation, requests may block waiting for a connection. # Increase db_pool_size in gitlab.rb or investigate slow queries. - alert: GitlabDatabaseConnectionPoolSaturation expr: gitlab_database_connection_pool_busy / gitlab_database_connection_pool_size * 100 > 90 for: 5m labels: severity: warning annotations: summary: GitLab database connection pool saturation (instance {{ $labels.instance }}) description: "GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) is {{ $value }}% busy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.11. GitLab database connection pool dead connections
GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) has {{ $value }} dead connections. [copy] - alert: GitlabDatabaseConnectionPoolDeadConnections expr: gitlab_database_connection_pool_dead > 0 for: 5m labels: severity: warning annotations: summary: GitLab database connection pool dead connections (instance {{ $labels.instance }}) description: "GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) has {{ $value }} dead connections.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.12. GitLab database connection pool waiting
GitLab on {{ $labels.instance }} has {{ $value }} threads waiting for a database connection. [copy] - alert: GitlabDatabaseConnectionPoolWaiting expr: gitlab_database_connection_pool_waiting > 0 for: 5m labels: severity: warning annotations: summary: GitLab database connection pool waiting (instance {{ $labels.instance }}) description: "GitLab on {{ $labels.instance }} has {{ $value }} threads waiting for a database connection.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.13. GitLab CI pipeline creation slow
GitLab CI pipeline creation p95 latency on {{ $labels.instance }} is above 30 seconds. [copy] - alert: GitlabCiPipelineCreationSlow expr: histogram_quantile(0.95, sum(rate(gitlab_ci_pipeline_creation_duration_seconds_bucket[5m])) by (le)) > 30 for: 5m labels: severity: warning annotations: summary: GitLab CI pipeline creation slow (instance {{ $labels.instance }}) description: "GitLab CI pipeline creation p95 latency on {{ $labels.instance }} is above 30 seconds.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.14. GitLab CI pipeline failures increasing
GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s). [copy] - alert: GitlabCiPipelineFailuresIncreasing expr: rate(gitlab_ci_pipeline_failure_reasons[5m]) > 0 for: 10m labels: severity: warning annotations: summary: GitLab CI pipeline failures increasing (instance {{ $labels.instance }}) description: "GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.15. GitLab CI runner authentication failures
GitLab CI runners are experiencing authentication failures on {{ $labels.instance }} ({{ $value }} failures). [copy] # Frequent runner auth failures may indicate expired tokens or misconfigured runners. - alert: GitlabCiRunnerAuthenticationFailures expr: increase(gitlab_ci_runner_authentication_failure_total[5m]) > 5 for: 5m labels: severity: warning annotations: summary: GitLab CI runner authentication failures (instance {{ $labels.instance }}) description: "GitLab CI runners are experiencing authentication failures on {{ $labels.instance }} ({{ $value }} failures).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.16. GitLab high memory usage
GitLab process on {{ $labels.instance }} is using {{ $value | humanize1024 }}B of RSS memory. [copy] # Threshold of 2GB may need adjustment based on your instance size. # High memory usage can lead to OOM kills and service disruptions. - alert: GitlabHighMemoryUsage expr: process_resident_memory_bytes{job=~".*gitlab.*"} > 2e+9 for: 10m labels: severity: warning annotations: summary: GitLab high memory usage (instance {{ $labels.instance }}) description: "GitLab process on {{ $labels.instance }} is using {{ $value | humanize1024 }}B of RSS memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.17. GitLab Ruby heap fragmentation
GitLab Ruby heap fragmentation on {{ $labels.instance }} is {{ $value }}. High fragmentation wastes memory. [copy] # Heap fragmentation above 50% means a significant amount of memory is wasted. # A Puma worker restart may help reclaim memory. - alert: GitlabRubyHeapFragmentation expr: ruby_gc_stat_ext_heap_fragmentation{job=~".*gitlab.*"} > 0.5 for: 15m labels: severity: warning annotations: summary: GitLab Ruby heap fragmentation (instance {{ $labels.instance }}) description: "GitLab Ruby heap fragmentation on {{ $labels.instance }} is {{ $value }}. High fragmentation wastes memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.18. GitLab rack uncaught errors
GitLab is experiencing uncaught errors in the Rack layer on {{ $labels.instance }} ({{ $value }}/s). [copy] - alert: GitlabRackUncaughtErrors expr: rate(rack_uncaught_errors_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: GitLab rack uncaught errors (instance {{ $labels.instance }}) description: "GitLab is experiencing uncaught errors in the Rack layer on {{ $labels.instance }} ({{ $value }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.19. GitLab version mismatch
Multiple GitLab versions are running across the fleet. [copy] # This may happen during a rolling deployment. If it persists, investigate incomplete upgrades. - alert: GitlabVersionMismatch expr: count(count by (version) (deployments{version!=""})) > 1 for: 0m labels: severity: warning annotations: summary: GitLab version mismatch (instance {{ $labels.instance }}) description: "Multiple GitLab versions are running across the fleet.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.20. GitLab high file descriptor usage
GitLab on {{ $labels.instance }} is using {{ $value }}% of available file descriptors. [copy] - alert: GitlabHighFileDescriptorUsage expr: process_open_fds{job=~".*gitlab.*"} / process_max_fds * 100 > 80 for: 5m labels: severity: warning annotations: summary: GitLab high file descriptor usage (instance {{ $labels.instance }}) description: "GitLab on {{ $labels.instance }} is using {{ $value }}% of available file descriptors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.1.21. GitLab Ruby threads saturated
GitLab running threads on {{ $labels.instance }} have exceeded the expected maximum ({{ $value }}). [copy] - alert: GitlabRubyThreadsSaturated expr: sum by (instance) (gitlab_ruby_threads_running_threads) > on(instance) gitlab_ruby_threads_max_expected_threads * 1.5 for: 10m labels: severity: warning annotations: summary: GitLab Ruby threads saturated (instance {{ $labels.instance }}) description: "GitLab running threads on {{ $labels.instance }} have exceeded the expected maximum ({{ $value }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.12.2. GitLab : Workhorse (3 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/gitlab/workhorse.yml-
# 8.12.2.1. GitLab Workhorse high error rate
GitLab Workhorse on {{ $labels.instance }} is returning more than 10% HTTP 5xx errors. [copy] # Workhorse sits in front of Puma and handles Git HTTP, file uploads, and proxying. # Threshold from GitLab Omnibus default rules: 10% for high-traffic instances. - alert: GitlabWorkhorseHighErrorRate expr: sum(rate(gitlab_workhorse_http_request_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) * 100 > 10 for: 5m labels: severity: critical annotations: summary: GitLab Workhorse high error rate (instance {{ $labels.instance }}) description: "GitLab Workhorse on {{ $labels.instance }} is returning more than 10% HTTP 5xx errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.2.2. GitLab Workhorse high latency
GitLab Workhorse on {{ $labels.instance }} p95 request latency is above 10 seconds. [copy] - alert: GitlabWorkhorseHighLatency expr: histogram_quantile(0.95, sum(rate(gitlab_workhorse_http_request_duration_seconds_bucket[5m])) by (le)) > 10 for: 5m labels: severity: warning annotations: summary: GitLab Workhorse high latency (instance {{ $labels.instance }}) description: "GitLab Workhorse on {{ $labels.instance }} p95 request latency is above 10 seconds.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.2.3. GitLab Workhorse high in-flight requests
GitLab Workhorse on {{ $labels.instance }} has {{ $value }} in-flight requests. [copy] # Threshold of 100 may need adjustment based on instance size. - alert: GitlabWorkhorseHighIn-flightRequests expr: gitlab_workhorse_http_in_flight_requests > 100 for: 5m labels: severity: warning annotations: summary: GitLab Workhorse high in-flight requests (instance {{ $labels.instance }}) description: "GitLab Workhorse on {{ $labels.instance }} has {{ $value }} in-flight requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.12.3. GitLab : Gitaly (6 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/gitlab/gitaly.yml-
# 8.12.3.1. GitLab Gitaly high gRPC error rate
Gitaly on {{ $labels.instance }} is returning more than 5% gRPC errors. [copy] - alert: GitlabGitalyHighGrpcErrorRate expr: sum(rate(grpc_server_handled_total{job="gitaly",grpc_code!="OK"}[5m])) / sum(rate(grpc_server_handled_total{job="gitaly"}[5m])) * 100 > 5 for: 5m labels: severity: warning annotations: summary: GitLab Gitaly high gRPC error rate (instance {{ $labels.instance }}) description: "Gitaly on {{ $labels.instance }} is returning more than 5% gRPC errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.3.2. GitLab Gitaly resource exhausted
Gitaly on {{ $labels.instance }} is returning ResourceExhausted errors, indicating overload ({{ $value }}%). [copy] # ResourceExhausted errors from Gitaly mean Git operations are being rejected due to # concurrency limits. This directly impacts users trying to push, pull, or clone. # This alert is derived from the GitLab Omnibus default rules. - alert: GitlabGitalyResourceExhausted expr: sum(rate(grpc_server_handled_total{job="gitaly",grpc_code="ResourceExhausted"}[5m])) / sum(rate(grpc_server_handled_total{job="gitaly"}[5m])) * 100 > 1 for: 5m labels: severity: critical annotations: summary: GitLab Gitaly resource exhausted (instance {{ $labels.instance }}) description: "Gitaly on {{ $labels.instance }} is returning ResourceExhausted errors, indicating overload ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.3.3. GitLab Gitaly high RPC latency
Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s). [copy] - alert: GitlabGitalyHighRpcLatency expr: histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job="gitaly",grpc_type="unary"}[5m])) by (le)) > 1 for: 5m labels: severity: warning annotations: summary: GitLab Gitaly high RPC latency (instance {{ $labels.instance }}) description: "Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.3.4. GitLab Gitaly CPU throttled
Gitaly processes on {{ $labels.instance }} are being CPU throttled by cgroups. [copy] - alert: GitlabGitalyCpuThrottled expr: rate(gitaly_cgroup_cpu_cfs_throttled_seconds_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: GitLab Gitaly CPU throttled (instance {{ $labels.instance }}) description: "Gitaly processes on {{ $labels.instance }} are being CPU throttled by cgroups.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.3.5. GitLab Gitaly authentication failures
Gitaly on {{ $labels.instance }} has authentication failures ({{ $value }}). [copy] - alert: GitlabGitalyAuthenticationFailures expr: increase(gitaly_authentications_total{status="failed"}[5m]) > 0 for: 0m labels: severity: warning annotations: summary: GitLab Gitaly authentication failures (instance {{ $labels.instance }}) description: "Gitaly on {{ $labels.instance }} has authentication failures ({{ $value }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.12.3.6. GitLab Gitaly circuit breaker tripped
Gitaly circuit breaker has tripped on {{ $labels.instance }}. Git operations are failing. [copy] # When the circuit breaker trips to "open" state, Git operations (push, pull, clone) will fail. # Check Gitaly service health and logs. - alert: GitlabGitalyCircuitBreakerTripped expr: increase(gitaly_circuit_breaker_transitions_total{to_state="open"}[5m]) > 0 for: 0m labels: severity: critical annotations: summary: GitLab Gitaly circuit breaker tripped (instance {{ $labels.instance }}) description: "Gitaly circuit breaker has tripped on {{ $labels.instance }}. Git operations are failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
-
# 8.13. Jaeger : Embedded exporter (8 rules) [copy section]
$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/jaeger/embedded-exporter.yml-
# 8.13.1. Jaeger agent HTTP server errors
Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors. [copy] - alert: JaegerAgentHttpServerErrors expr: 100 * sum(rate(jaeger_agent_http_server_errors_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger agent HTTP server errors (instance {{ $labels.instance }}) description: "Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.2. Jaeger client RPC request errors
Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors. [copy] - alert: JaegerClientRpcRequestErrors expr: 100 * sum(rate(jaeger_client_jaeger_rpc_http_requests{status_code=~"4xx|5xx"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger client RPC request errors (instance {{ $labels.instance }}) description: "Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.3. Jaeger client spans dropped
Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans. [copy] - alert: JaegerClientSpansDropped expr: 100 * sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger client spans dropped (instance {{ $labels.instance }}) description: "Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.4. Jaeger agent spans dropped
Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches. [copy] - alert: JaegerAgentSpansDropped expr: 100 * sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger agent spans dropped (instance {{ $labels.instance }}) description: "Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.5. Jaeger collector dropping spans
Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans. [copy] - alert: JaegerCollectorDroppingSpans expr: 100 * sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger collector dropping spans (instance {{ $labels.instance }}) description: "Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.6. Jaeger sampling update failing
Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates. [copy] - alert: JaegerSamplingUpdateFailing expr: 100 * sum(rate(jaeger_sampler_queries{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger sampling update failing (instance {{ $labels.instance }}) description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.7. Jaeger throttling update failing
Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates. [copy] - alert: JaegerThrottlingUpdateFailing expr: 100 * sum(rate(jaeger_throttler_updates{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger throttling update failing (instance {{ $labels.instance }}) description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-
# 8.13.8. Jaeger query request failures
Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests. [copy] - alert: JaegerQueryRequestFailures expr: 100 * sum(rate(jaeger_query_requests_total{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 1 for: 15m labels: severity: warning annotations: summary: Jaeger query request failures (instance {{ $labels.instance }}) description: "Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
-