Awesome Prometheus alerts

⚠️ Caution ⚠️

Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.

Building an efficient and battle-tested monitoring platform takes time. 😉

# 1.1. Prometheus self-monitoring (28 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/prometheus-self-monitoring/embedded-exporter.yml

# 1.1.1. Prometheus job missing

A Prometheus job has disappeared [copy]

  - alert: PrometheusJobMissing
    expr: absent(up{job="prometheus"})
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus job missing (instance {{ $labels.instance }})
      description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.2. Prometheus target missing

A Prometheus target has disappeared. An exporter might be crashed. [copy]

  - alert: PrometheusTargetMissing
    expr: up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing (instance {{ $labels.instance }})
      description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.3. Prometheus all targets missing

A Prometheus job does not have living target anymore. [copy]

  - alert: PrometheusAllTargetsMissing
    expr: sum by (job) (up) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus all targets missing (instance {{ $labels.instance }})
      description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.4. Prometheus target missing with warmup time

Allow a job time to start up (10 minutes) before alerting that it's down. [copy]

  - alert: PrometheusTargetMissingWithWarmupTime
    expr: sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600))
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
      description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.5. Prometheus configuration reload failure

Prometheus configuration reload error [copy]

  - alert: PrometheusConfigurationReloadFailure
    expr: prometheus_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
      description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.6. Prometheus too many restarts

Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping. [copy]

  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.7. Prometheus AlertManager job missing

A Prometheus AlertManager job has disappeared [copy]

  - alert: PrometheusAlertmanagerJobMissing
    expr: absent(up{job="alertmanager"})
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
      description: "A Prometheus AlertManager job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.8. Prometheus AlertManager configuration reload failure

AlertManager configuration reload error [copy]

  - alert: PrometheusAlertmanagerConfigurationReloadFailure
    expr: alertmanager_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.9. Prometheus AlertManager config not synced

Configurations of AlertManager cluster instances are out of sync [copy]

  - alert: PrometheusAlertmanagerConfigNotSynced
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
      description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.10. Prometheus AlertManager E2E dead man switch

Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager. [copy]

  - alert: PrometheusAlertmanagerE2eDeadManSwitch
    expr: vector(1)
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
      description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.11. Prometheus not connected to alertmanager

Prometheus cannot connect the alertmanager [copy]

  - alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
      description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.12. Prometheus rule evaluation failures

Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts. [copy]

  - alert: PrometheusRuleEvaluationFailures
    expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.13. Prometheus template text expansion failures

Prometheus encountered {{ $value }} template text expansion failures [copy]

  - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.14. Prometheus rule evaluation slow

Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query. [copy]

  - alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
      description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.15. Prometheus notifications backlog

The Prometheus notification queue has not been empty for 10 minutes [copy]

  - alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus notifications backlog (instance {{ $labels.instance }})
      description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.16. Prometheus AlertManager notification failing

Alertmanager is failing sending notifications [copy]

  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
      description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.17. Prometheus target empty

Prometheus has no target in service discovery [copy]

  - alert: PrometheusTargetEmpty
    expr: prometheus_sd_discovered_targets == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target empty (instance {{ $labels.instance }})
      description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.18. Prometheus target scraping slow

Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned. [copy]

  - alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scraping slow (instance {{ $labels.instance }})
      description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.19. Prometheus large scrape

Prometheus has many scrapes that exceed the sample limit [copy]

  - alert: PrometheusLargeScrape
    expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus large scrape (instance {{ $labels.instance }})
      description: "Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.20. Prometheus target scrape duplicate

Prometheus has many samples rejected due to duplicate timestamps but different values [copy]

  - alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
      description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.21. Prometheus TSDB checkpoint creation failures

Prometheus encountered {{ $value }} checkpoint creation failures [copy]

  - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.22. Prometheus TSDB checkpoint deletion failures

Prometheus encountered {{ $value }} checkpoint deletion failures [copy]

  - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.23. Prometheus TSDB compactions failed

Prometheus encountered {{ $value }} TSDB compactions failures [copy]

  - alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.24. Prometheus TSDB head truncations failed

Prometheus encountered {{ $value }} TSDB head truncation failures [copy]

  - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.25. Prometheus TSDB reload failures

Prometheus encountered {{ $value }} TSDB reload failures [copy]

  - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.26. Prometheus TSDB WAL corruptions

Prometheus encountered {{ $value }} TSDB WAL corruptions [copy]

  - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.27. Prometheus TSDB WAL truncations failed

Prometheus encountered {{ $value }} TSDB WAL truncation failures [copy]

  - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.1.28. Prometheus timeseries cardinality

The "{{ $labels.name }}" timeseries cardinality is getting very high: {{ $value }} [copy]

  - alert: PrometheusTimeseriesCardinality
    expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus timeseries cardinality (instance {{ $labels.instance }})
      description: "The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2. Host and hardware : node-exporter (35 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/host-and-hardware/node-exporter.yml

# 1.2.1. Host out of memory

Node memory is filling up (< 10% left) [copy]

  - alert: HostOutOfMemory
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < .10)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.2. Host memory under memory pressure

The node is under heavy memory pressure. High rate of loading memory pages from disk. [copy]

  - alert: HostMemoryUnderMemoryPressure
    expr: (rate(node_vmstat_pgmajfault[5m]) > 1000)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host memory under memory pressure (instance {{ $labels.instance }})
      description: "The node is under heavy memory pressure. High rate of loading memory pages from disk.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.3. Host Memory is underutilized

Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }}) [copy]

  # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
  - alert: HostMemoryIsUnderutilized
    expr: min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host Memory is underutilized (instance {{ $labels.instance }})
      description: "Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.4. Host unusual network throughput in

Host receive bandwidth is high (>80%). [copy]

  - alert: HostUnusualNetworkThroughputIn
    expr: ((rate(node_network_receive_bytes_total[5m]) / on(instance, device) node_network_speed_bytes) > .80)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host receive bandwidth is high (>80%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.5. Host unusual network throughput out

Host transmit bandwidth is high (>80%) [copy]

  - alert: HostUnusualNetworkThroughputOut
    expr: ((rate(node_network_transmit_bytes_total[5m]) / on(instance, device) node_network_speed_bytes) > .80)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: "Host transmit bandwidth is high (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.6. Host unusual disk read rate

Disk is too busy (IO wait > 80%) [copy]

  - alert: HostUnusualDiskReadRate
    expr: (rate(node_disk_io_time_seconds_total[5m]) > .80)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: "Disk is too busy (IO wait > 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.7. Host out of disk space

Disk is almost full (< 10% left) [copy]

  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"} / node_filesystem_size_bytes < .10 and on (instance, device, mountpoint) node_filesystem_readonly == 0)
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.8. Host disk may fill in 24 hours

Filesystem will likely run out of space within the next 24 hours. [copy]

  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostDiskMayFillIn24Hours
    expr: predict_linear(node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"}[3h], 86400) <= 0 and node_filesystem_avail_bytes > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk may fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem will likely run out of space within the next 24 hours.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.9. Host out of inodes

Disk is almost running out of available inodes (< 10% left) [copy]

  - alert: HostOutOfInodes
    expr: (node_filesystem_files_free / node_filesystem_files < .10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0)
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Host out of inodes (instance {{ $labels.instance }})
      description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.10. Host filesystem device error

Error stat-ing the {{ $labels.mountpoint }} filesystem [copy]

  - alert: HostFilesystemDeviceError
    expr: node_filesystem_device_error{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Host filesystem device error (instance {{ $labels.instance }})
      description: "Error stat-ing the {{ $labels.mountpoint }} filesystem\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.11. Host inodes may fill in 24 hours

Filesystem will likely run out of inodes within the next 24 hours at current write rate [copy]

  - alert: HostInodesMayFillIn24Hours
    expr: predict_linear(node_filesystem_files_free{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"}[1h], 86400) <= 0 and node_filesystem_files_free > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host inodes may fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem will likely run out of inodes within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.12. Host unusual disk read latency

Disk latency is growing (read operations > 100ms) [copy]

  - alert: HostUnusualDiskReadLatency
    expr: (rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.13. Host unusual disk write latency

Disk latency is growing (write operations > 100ms) [copy]

  - alert: HostUnusualDiskWriteLatency
    expr: (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.14. Host high CPU load

CPU load is > 80% [copy]

  - alert: HostHighCpuLoad
    expr: 1 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > .80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.15. Host CPU is underutilized

CPU load has been < 20% for 1 week. Consider reducing the number of CPUs. [copy]

  # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
  - alert: HostCpuIsUnderutilized
    expr: (min by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1h]))) > 0.8
    for: 1w
    labels:
      severity: info
    annotations:
      summary: Host CPU is underutilized (instance {{ $labels.instance }})
      description: "CPU load has been < 20% for 1 week. Consider reducing the number of CPUs.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.16. Host CPU steal noisy neighbor

CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. [copy]

  - alert: HostCpuStealNoisyNeighbor
    expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.17. Host CPU high iowait

CPU iowait > 10%. Your CPU is idling waiting for storage to respond. [copy]

  - alert: HostCpuHighIowait
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) > .10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU high iowait (instance {{ $labels.instance }})
      description: "CPU iowait > 10%. Your CPU is idling waiting for storage to respond.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.18. Host unusual disk IO

Disk usage >80%. Check storage for issues or increase IOPS capabilities. Check storage for issues. [copy]

  - alert: HostUnusualDiskIo
    expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk IO (instance {{ $labels.instance }})
      description: "Disk usage >80%. Check storage for issues or increase IOPS capabilities. Check storage for issues.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.19. Host context switching high

Context switching is growing on the node (twice the daily average during the last 15m) [copy]

  # x2 context switches is an arbitrary number.
  # The alert threshold depends on the nature of the application.
  # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
  - alert: HostContextSwitchingHigh
    expr: (rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host context switching high (instance {{ $labels.instance }})
      description: "Context switching is growing on the node (twice the daily average during the last 15m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.20. Host swap is filling up

Swap is filling up (>80%) [copy]

  - alert: HostSwapIsFillingUp
    expr: ((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host swap is filling up (instance {{ $labels.instance }})
      description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.21. Host systemd service crashed

systemd service crashed [copy]

  - alert: HostSystemdServiceCrashed
    expr: (node_systemd_unit_state{state="failed"} == 1)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host systemd service crashed (instance {{ $labels.instance }})
      description: "systemd service crashed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.22. Host physical component too hot

Physical hardware component too hot [copy]

  - alert: HostPhysicalComponentTooHot
    expr: node_hwmon_temp_celsius > node_hwmon_temp_max_celsius
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host physical component too hot (instance {{ $labels.instance }})
      description: "Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.23. Host node overtemperature alarm

Physical node temperature alarm triggered [copy]

  - alert: HostNodeOvertemperatureAlarm
    expr: ((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1))
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host node overtemperature alarm (instance {{ $labels.instance }})
      description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.24. Host software RAID insufficient drives

MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining. [copy]

  - alert: HostSoftwareRaidInsufficientDrives
    expr: ((node_md_disks_required - on(device, instance) node_md_disks{state="active"}) > 0)
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host software RAID insufficient drives (instance {{ $labels.instance }})
      description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.25. Host software RAID disk failure

MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention. [copy]

  - alert: HostSoftwareRaidDiskFailure
    expr: (node_md_disks{state="failed"} > 0)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host software RAID disk failure (instance {{ $labels.instance }})
      description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.26. Host kernel version deviations

Kernel version for {{ $labels.instance }} has changed. [copy]

  - alert: HostKernelVersionDeviations
    expr: changes(node_uname_info[1h]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host kernel version deviations (instance {{ $labels.instance }})
      description: "Kernel version for {{ $labels.instance }} has changed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.27. Host OOM kill detected

OOM kill detected [copy]

  - alert: HostOomKillDetected
    expr: (increase(node_vmstat_oom_kill[1m]) > 0)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.28. Host EDAC Correctable Errors detected

Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes. [copy]

  - alert: HostEdacCorrectableErrorsDetected
    expr: (increase(node_edac_correctable_errors_total[1m]) > 0)
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.29. Host EDAC Uncorrectable Errors detected

Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes. [copy]

  - alert: HostEdacUncorrectableErrorsDetected
    expr: (node_edac_uncorrectable_errors_total > 0)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.30. Host Network Receive Errors

Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes. [copy]

  - alert: HostNetworkReceiveErrors
    expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Receive Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.31. Host Network Transmit Errors

Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes. [copy]

  - alert: HostNetworkTransmitErrors
    expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Transmit Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.32. Host Network Bond Degraded

Bond "{{ $labels.device }}" degraded on "{{ $labels.instance }}". [copy]

  - alert: HostNetworkBondDegraded
    expr: ((node_bonding_active - node_bonding_slaves) != 0)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Bond Degraded (instance {{ $labels.instance }})
      description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.33. Host conntrack limit

The number of conntrack is approaching limit [copy]

  - alert: HostConntrackLimit
    expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host conntrack limit (instance {{ $labels.instance }})
      description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.34. Host clock skew

Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host. [copy]

  - alert: HostClockSkew
    expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0))
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host clock skew (instance {{ $labels.instance }})
      description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.2.35. Host clock not synchronising

Clock not synchronising. Ensure NTP is configured on this host. [copy]

  - alert: HostClockNotSynchronising
    expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock not synchronising (instance {{ $labels.instance }})
      description: "Clock not synchronising. Ensure NTP is configured on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3. S.M.A.R.T Device Monitoring : smartctl-exporter (8 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/s.m.a.r.t-device-monitoring/smartctl-exporter.yml

# 1.3.1. SMART device temperature warning

Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C [copy]

  - alert: SmartDeviceTemperatureWarning
    expr: (avg_over_time(smartctl_device_temperature{temperature_type="current"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type="drive_trip"}) > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SMART device temperature warning (instance {{ $labels.instance }})
      description: "Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.2. SMART device temperature critical

Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C [copy]

  - alert: SmartDeviceTemperatureCritical
    expr: (max_over_time(smartctl_device_temperature{temperature_type="current"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type="drive_trip"}) > 70
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SMART device temperature critical (instance {{ $labels.instance }})
      description: "Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.3. SMART device temperature over trip value

Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }}) [copy]

  - alert: SmartDeviceTemperatureOverTripValue
    expr: max_over_time(smartctl_device_temperature{temperature_type="current"} [10m]) >= on(device, instance) smartctl_device_temperature{temperature_type="drive_trip"}
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SMART device temperature over trip value (instance {{ $labels.instance }})
      description: "Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.4. SMART device temperature nearing trip value

Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }}) [copy]

  - alert: SmartDeviceTemperatureNearingTripValue
    expr: max_over_time(smartctl_device_temperature{temperature_type="current"} [10m]) >= on(device, instance) (smartctl_device_temperature{temperature_type="drive_trip"} * .80)
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SMART device temperature nearing trip value (instance {{ $labels.instance }})
      description: "Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.5. SMART status

Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }}) [copy]

  - alert: SmartStatus
    expr: smartctl_device_smart_status != 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SMART status (instance {{ $labels.instance }})
      description: "Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.6. SMART critical warning

Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }}) [copy]

  - alert: SmartCriticalWarning
    expr: smartctl_device_critical_warning > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SMART critical warning (instance {{ $labels.instance }})
      description: "Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.7. SMART media errors

Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }}) [copy]

  - alert: SmartMediaErrors
    expr: smartctl_device_media_errors > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SMART media errors (instance {{ $labels.instance }})
      description: "Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.3.8. SMART Wearout Indicator

Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }}) [copy]

  - alert: SmartWearoutIndicator
    expr: smartctl_device_available_spare < smartctl_device_available_spare_threshold
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SMART Wearout Indicator (instance {{ $labels.instance }})
      description: "Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4. Docker containers : google/cAdvisor (9 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/docker-containers/google-cadvisor.yml

# 1.4.1. Container killed

A container has disappeared [copy]

  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.2. Container absent

A container is absent for 5 min [copy]

  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container absent (instance {{ $labels.instance }})
      description: "A container is absent for 5 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.3. Container High CPU utilization

Container CPU utilization is above 80% [copy]

  - alert: ContainerHighCpuUtilization
    expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container High CPU utilization (instance {{ $labels.instance }})
      description: "Container CPU utilization is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.4. Container High Memory usage

Container Memory usage is above 80% [copy]

  # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
  - alert: ContainerHighMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container High Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.5. Container Volume usage

Container Volume usage is above 80% [copy]

  - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume usage (instance {{ $labels.instance }})
      description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.6. Container high throttle rate

Container is being throttled [copy]

  - alert: ContainerHighThrottleRate
    expr: sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 )
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.7. Container high low change CPU usage

This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%. [copy]

  - alert: ContainerHighLowChangeCpuUsage
    expr: (abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m] offset 1m)) * 100)) or abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[5m] offset 1m)) * 100))) > 25
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Container high low change CPU usage (instance {{ $labels.instance }})
      description: "This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.8. Container Low CPU utilization

Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU. [copy]

  - alert: ContainerLowCpuUtilization
    expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) < 20
    for: 7d
    labels:
      severity: info
    annotations:
      summary: Container Low CPU utilization (instance {{ $labels.instance }})
      description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.4.9. Container Low Memory usage

Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory. [copy]

  - alert: ContainerLowMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20
    for: 7d
    labels:
      severity: info
    annotations:
      summary: Container Low Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5. Blackbox : prometheus/blackbox_exporter (9 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/blackbox/blackbox-exporter.yml

# 1.5.1. Blackbox probe failed

Probe failed [copy]

  - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe failed (instance {{ $labels.instance }})
      description: "Probe failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.2. Blackbox configuration reload failure

Blackbox configuration reload failure [copy]

  - alert: BlackboxConfigurationReloadFailure
    expr: blackbox_exporter_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blackbox configuration reload failure (instance {{ $labels.instance }})
      description: "Blackbox configuration reload failure\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.3. Blackbox slow probe

Blackbox probe took more than 1s to complete [copy]

  - alert: BlackboxSlowProbe
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox slow probe (instance {{ $labels.instance }})
      description: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.4. Blackbox probe HTTP failure

HTTP status code is not 200-399 [copy]

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.5. Blackbox SSL certificate will expire soon

SSL certificate expires in less than 20 days [copy]

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: 3 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 20
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in less than 20 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.6. Blackbox SSL certificate will expire soon

SSL certificate expires in less than 3 days [copy]

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: 0 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in less than 3 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.7. Blackbox SSL certificate expired

SSL certificate has expired already [copy]

  # For probe_ssl_earliest_cert_expiry to be exposed after expiration, you
  # need to enable insecure_skip_verify. Note that this will disable
  # certificate validation.
  # See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config
  - alert: BlackboxSslCertificateExpired
    expr: round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
      description: "SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.8. Blackbox probe slow HTTP

HTTP request took more than 1s [copy]

  - alert: BlackboxProbeSlowHttp
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
      description: "HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.5.9. Blackbox probe slow ping

Blackbox ping took more than 1s [copy]

  - alert: BlackboxProbeSlowPing
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow ping (instance {{ $labels.instance }})
      description: "Blackbox ping took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.6. Windows Server : prometheus-community/windows_exporter (5 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/windows-server/windows-exporter.yml

# 1.6.1. Windows Server collector Error

Collector {{ $labels.collector }} was not successful [copy]

  - alert: WindowsServerCollectorError
    expr: windows_exporter_collector_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Windows Server collector Error (instance {{ $labels.instance }})
      description: "Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.6.2. Windows Server service Status

Windows Service state is not OK [copy]

  - alert: WindowsServerServiceStatus
    expr: windows_service_status{status="ok"} != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Windows Server service Status (instance {{ $labels.instance }})
      description: "Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.6.3. Windows Server CPU Usage

CPU Usage is more than 80% [copy]

  - alert: WindowsServerCpuUsage
    expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Windows Server CPU Usage (instance {{ $labels.instance }})
      description: "CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.6.4. Windows Server memory Usage

Memory usage is more than 90% [copy]

  - alert: WindowsServerMemoryUsage
    expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Windows Server memory Usage (instance {{ $labels.instance }})
      description: "Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.6.5. Windows Server disk Space Usage

Disk usage is more than 80% [copy]

  - alert: WindowsServerDiskSpaceUsage
    expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
      description: "Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.7. VMware : pryorda/vmware_exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/vmware/pryorda-vmware-exporter.yml

# 1.7.1. Virtual Machine Memory Warning

High memory usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: VirtualMachineMemoryWarning
    expr: vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Virtual Machine Memory Warning (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.7.2. Virtual Machine Memory Critical

High memory usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: VirtualMachineMemoryCritical
    expr: vmware_vm_mem_usage_average / 100 >= 90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Virtual Machine Memory Critical (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.7.3. High Number of Snapshots

High snapshots number on {{ $labels.instance }}: {{ $value }} [copy]

  - alert: HighNumberOfSnapshots
    expr: vmware_vm_snapshots > 3
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: High Number of Snapshots (instance {{ $labels.instance }})
      description: "High snapshots number on {{ $labels.instance }}: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.7.4. Outdated Snapshots

Outdated snapshots on {{ $labels.instance }}: {{ $value | printf "%.0f"}} days [copy]

  - alert: OutdatedSnapshots
    expr: (time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Outdated Snapshots (instance {{ $labels.instance }})
      description: "Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \"%.0f\"}} days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8. Netdata : Embedded exporter (9 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/netdata/embedded-exporter.yml

# 1.8.1. Netdata high cpu usage

Netdata high CPU usage (> 80%) [copy]

  - alert: NetdataHighCpuUsage
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="idle"}[1m]) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high cpu usage (instance {{ $labels.instance }})
      description: "Netdata high CPU usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.2. Host CPU steal noisy neighbor

CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. [copy]

  - alert: HostCpuStealNoisyNeighbor
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="steal"}[1m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.3. Netdata high memory usage

Netdata high memory usage (> 80%) [copy]

  - alert: NetdataHighMemoryUsage
    expr: 100 / netdata_system_ram_MiB_average * netdata_system_ram_MiB_average{dimension=~"free|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high memory usage (instance {{ $labels.instance }})
      description: "Netdata high memory usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.4. Netdata low disk space

Netdata low disk space (> 80%) [copy]

  - alert: NetdataLowDiskSpace
    expr: 100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata low disk space (instance {{ $labels.instance }})
      description: "Netdata low disk space (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.5. Netdata predicted disk full

Netdata predicted disk full in 24 hours [copy]

  - alert: NetdataPredictedDiskFull
    expr: predict_linear(netdata_disk_space_GB_average{dimension=~"avail|cached"}[3h], 24 * 3600) < 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata predicted disk full (instance {{ $labels.instance }})
      description: "Netdata predicted disk full in 24 hours\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.6. Netdata MD mismatch cnt unsynchronized blocks

RAID Array have unsynchronized blocks [copy]

  - alert: NetdataMdMismatchCntUnsynchronizedBlocks
    expr: netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Netdata MD mismatch cnt unsynchronized blocks (instance {{ $labels.instance }})
      description: "RAID Array have unsynchronized blocks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.7. Netdata disk reallocated sectors

Reallocated sectors on disk [copy]

  - alert: NetdataDiskReallocatedSectors
    expr: increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Netdata disk reallocated sectors (instance {{ $labels.instance }})
      description: "Reallocated sectors on disk\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.8. Netdata disk current pending sector

Disk current pending sector [copy]

  - alert: NetdataDiskCurrentPendingSector
    expr: netdata_smartd_log_current_pending_sector_count_sectors_average > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata disk current pending sector (instance {{ $labels.instance }})
      description: "Disk current pending sector\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 1.8.9. Netdata reported uncorrectable disk sectors

Reported uncorrectable disk sectors [copy]

  - alert: NetdataReportedUncorrectableDiskSectors
    expr: increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }})
      description: "Reported uncorrectable disk sectors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1. MySQL : prometheus/mysqld_exporter (14 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mysql/mysqld-exporter.yml

# 2.1.1. MySQL down

MySQL instance is down on {{ $labels.instance }} [copy]

  - alert: MysqlDown
    expr: mysql_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MySQL down (instance {{ $labels.instance }})
      description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.2. MySQL too many connections (> 80%)

More than 80% of MySQL connections are in use on {{ $labels.instance }} [copy]

  - alert: MysqlTooManyConnections(>80%)
    expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})
      description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.3. MySQL high prepared statements utilization (> 80%)

High utilization of prepared statements (>80%) on {{ $labels.instance }} [copy]

  - alert: MysqlHighPreparedStatementsUtilization(>80%)
    expr: max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL high prepared statements utilization (> 80%) (instance {{ $labels.instance }})
      description: "High utilization of prepared statements (>80%) on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.4. MySQL high threads running

More than 60% of MySQL connections are in running state on {{ $labels.instance }} [copy]

  - alert: MysqlHighThreadsRunning
    expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL high threads running (instance {{ $labels.instance }})
      description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.5. MySQL Slave IO thread not running

MySQL Slave IO thread not running on {{ $labels.instance }} [copy]

  - alert: MysqlSlaveIoThreadNotRunning
    expr: ( mysql_slave_status_slave_io_running and ON (instance) mysql_slave_status_master_server_id > 0 ) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
      description: "MySQL Slave IO thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.6. MySQL Slave SQL thread not running

MySQL Slave SQL thread not running on {{ $labels.instance }} [copy]

  - alert: MysqlSlaveSqlThreadNotRunning
    expr: ( mysql_slave_status_slave_sql_running and ON (instance) mysql_slave_status_master_server_id > 0) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})
      description: "MySQL Slave SQL thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.7. MySQL Slave replication lag

MySQL replication lag on {{ $labels.instance }} [copy]

  - alert: MysqlSlaveReplicationLag
    expr: ( (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) and ON (instance) mysql_slave_status_master_server_id > 0 ) > 30
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave replication lag (instance {{ $labels.instance }})
      description: "MySQL replication lag on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.8. MySQL slow queries

MySQL server mysql has some new slow query. [copy]

  - alert: MysqlSlowQueries
    expr: increase(mysql_global_status_slow_queries[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL slow queries (instance {{ $labels.instance }})
      description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.9. MySQL InnoDB log waits

MySQL innodb log writes stalling [copy]

  - alert: MysqlInnodbLogWaits
    expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: MySQL InnoDB log waits (instance {{ $labels.instance }})
      description: "MySQL innodb log writes stalling\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.10. MySQL restarted

MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}. [copy]

  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 0m
    labels:
      severity: info
    annotations:
      summary: MySQL restarted (instance {{ $labels.instance }})
      description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.11. MySQL High QPS

MySQL is being overload with unusual QPS (> 10k QPS). [copy]

  - alert: MysqlHighQps
    expr: irate(mysql_global_status_questions[1m]) > 10000
    for: 2m
    labels:
      severity: info
    annotations:
      summary: MySQL High QPS (instance {{ $labels.instance }})
      description: "MySQL is being overload with unusual QPS (> 10k QPS).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.12. MySQL too many open files

MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}. [copy]

  - alert: MysqlTooManyOpenFiles
    expr: mysql_global_status_innodb_num_open_files / mysql_global_variables_open_files_limit * 100 > 75
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL too many open files (instance {{ $labels.instance }})
      description: "MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.13. MySQL InnoDB Force Recovery is enabled

MySQL InnoDB force recovery is enabled on {{ $labels.instance }} [copy]

  - alert: MysqlInnodbForceRecoveryIsEnabled
    expr: mysql_global_variables_innodb_force_recovery != 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL InnoDB Force Recovery is enabled (instance {{ $labels.instance }})
      description: "MySQL InnoDB force recovery is enabled on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.1.14. MySQL InnoDB history_len too long

MySQL history_len (undo log) too long on {{ $labels.instance }} [copy]

  - alert: MysqlInnodbHistory_lenTooLong
    expr: mysql_info_schema_innodb_metrics_transaction_trx_rseg_history_len > 50000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL InnoDB history_len too long (instance {{ $labels.instance }})
      description: "MySQL history_len (undo log) too long on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2. PostgreSQL : prometheus-community/postgres_exporter (22 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/postgresql/postgres-exporter.yml

# 2.2.1. Postgresql down

Postgresql instance is down [copy]

  - alert: PostgresqlDown
    expr: pg_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql down (instance {{ $labels.instance }})
      description: "Postgresql instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.2. Postgresql restarted

Postgresql restarted [copy]

  - alert: PostgresqlRestarted
    expr: time() - pg_postmaster_start_time_seconds < 60
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql restarted (instance {{ $labels.instance }})
      description: "Postgresql restarted\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.3. Postgresql exporter error

Postgresql exporter is showing errors. A query may be buggy in query.yaml [copy]

  - alert: PostgresqlExporterError
    expr: pg_exporter_last_scrape_error > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql exporter error (instance {{ $labels.instance }})
      description: "Postgresql exporter is showing errors. A query may be buggy in query.yaml\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.4. Postgresql table not auto vacuumed

Table {{ $labels.relname }} has not been auto vacuumed for 10 days [copy]

  - alert: PostgresqlTableNotAutoVacuumed
    expr: ((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_vacuum_threshold) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql table not auto vacuumed (instance {{ $labels.instance }})
      description: "Table {{ $labels.relname }} has not been auto vacuumed for 10 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.5. Postgresql table not auto analyzed

Table {{ $labels.relname }} has not been auto analyzed for 10 days [copy]

  - alert: PostgresqlTableNotAutoAnalyzed
    expr: ((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_analyze_threshold) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql table not auto analyzed (instance {{ $labels.instance }})
      description: "Table {{ $labels.relname }} has not been auto analyzed for 10 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.6. Postgresql too many connections

PostgreSQL instance has too many connections (> 80%). [copy]

  - alert: PostgresqlTooManyConnections
    expr: sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Postgresql too many connections (instance {{ $labels.instance }})
      description: "PostgreSQL instance has too many connections (> 80%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.7. Postgresql not enough connections

PostgreSQL instance should have more connections (> 5) [copy]

  - alert: PostgresqlNotEnoughConnections
    expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Postgresql not enough connections (instance {{ $labels.instance }})
      description: "PostgreSQL instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.8. Postgresql dead locks

PostgreSQL has dead-locks [copy]

  - alert: PostgresqlDeadLocks
    expr: increase(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql dead locks (instance {{ $labels.instance }})
      description: "PostgreSQL has dead-locks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.9. Postgresql high rollback rate

Ratio of transactions being aborted compared to committed is > 2 % [copy]

  - alert: PostgresqlHighRollbackRate
    expr: sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m])))) > 0.02
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql high rollback rate (instance {{ $labels.instance }})
      description: "Ratio of transactions being aborted compared to committed is > 2 %\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.10. Postgresql commit rate low

Postgresql seems to be processing very few transactions [copy]

  - alert: PostgresqlCommitRateLow
    expr: increase(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[5m]) < 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Postgresql commit rate low (instance {{ $labels.instance }})
      description: "Postgresql seems to be processing very few transactions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.11. Postgresql low XID consumption

Postgresql seems to be consuming transaction IDs very slowly [copy]

  - alert: PostgresqlLowXidConsumption
    expr: rate(pg_txid_current[1m]) < 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Postgresql low XID consumption (instance {{ $labels.instance }})
      description: "Postgresql seems to be consuming transaction IDs very slowly\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.12. Postgresql high rate statement timeout

Postgres transactions showing high rate of statement timeouts [copy]

  - alert: PostgresqlHighRateStatementTimeout
    expr: rate(postgresql_errors_total{type="statement_timeout"}[1m]) > 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql high rate statement timeout (instance {{ $labels.instance }})
      description: "Postgres transactions showing high rate of statement timeouts\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.13. Postgresql high rate deadlock

Postgres detected deadlocks [copy]

  - alert: PostgresqlHighRateDeadlock
    expr: increase(postgresql_errors_total{type="deadlock_detected"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql high rate deadlock (instance {{ $labels.instance }})
      description: "Postgres detected deadlocks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.14. Postgresql unused replication slot

Unused Replication Slots [copy]

  - alert: PostgresqlUnusedReplicationSlot
    expr: pg_replication_slots_active == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Postgresql unused replication slot (instance {{ $labels.instance }})
      description: "Unused Replication Slots\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.15. Postgresql too many dead tuples

PostgreSQL dead tuples is too large [copy]

  - alert: PostgresqlTooManyDeadTuples
    expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Postgresql too many dead tuples (instance {{ $labels.instance }})
      description: "PostgreSQL dead tuples is too large\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.16. Postgresql configuration changed

Postgres Database configuration change has occurred [copy]

  - alert: PostgresqlConfigurationChanged
    expr: {__name__=~"pg_settings_.*"} != ON(__name__, instance) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Postgresql configuration changed (instance {{ $labels.instance }})
      description: "Postgres Database configuration change has occurred\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.17. Postgresql SSL compression active

Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`. [copy]

  - alert: PostgresqlSslCompressionActive
    expr: sum(pg_stat_ssl_compression) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql SSL compression active (instance {{ $labels.instance }})
      description: "Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.18. Postgresql too many locks acquired

Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction. [copy]

  - alert: PostgresqlTooManyLocksAcquired
    expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Postgresql too many locks acquired (instance {{ $labels.instance }})
      description: "Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.19. Postgresql bloat index high (> 80%)

The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};` [copy]

  # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
  - alert: PostgresqlBloatIndexHigh(>80%)
    expr: pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Postgresql bloat index high (> 80%) (instance {{ $labels.instance }})
      description: "The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.20. Postgresql bloat table high (> 80%)

The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};` [copy]

  # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
  - alert: PostgresqlBloatTableHigh(>80%)
    expr: pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Postgresql bloat table high (> 80%) (instance {{ $labels.instance }})
      description: "The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.21. Postgresql invalid index

The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};` [copy]

  # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
  - alert: PostgresqlInvalidIndex
    expr: pg_general_index_info_pg_relation_size{indexrelname=~".*ccnew.*"}
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: Postgresql invalid index (instance {{ $labels.instance }})
      description: "The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.2.22. Postgresql replication lag

The PostgreSQL replication lag is high (> 5s) [copy]

  - alert: PostgresqlReplicationLag
    expr: pg_replication_lag_seconds > 5
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: Postgresql replication lag (instance {{ $labels.instance }})
      description: "The PostgreSQL replication lag is high (> 5s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.3. SQL Server : Ozarklake/prometheus-mssql-exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/sql-server/ozarklake-mssql-exporter.yml

# 2.3.1. SQL Server down

SQL server instance is down [copy]

  - alert: SqlServerDown
    expr: mssql_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SQL Server down (instance {{ $labels.instance }})
      description: "SQL server instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.3.2. SQL Server deadlock

SQL Server is having some deadlock. [copy]

  - alert: SqlServerDeadlock
    expr: increase(mssql_deadlocks[1m]) > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SQL Server deadlock (instance {{ $labels.instance }})
      description: "SQL Server is having some deadlock.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.4. Patroni : Embedded exporter (Patroni >= 2.1.0) (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/patroni/embedded-exporter-patroni.yml

# 2.4.1. Patroni has no Leader

A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }} [copy]

  - alert: PatroniHasNoLeader
    expr: (max by (scope) (patroni_master) < 1) and (max by (scope) (patroni_standby_leader) < 1)
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Patroni has no Leader (instance {{ $labels.instance }})
      description: "A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.5. PGBouncer : spreaker/prometheus-pgbouncer-exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/pgbouncer/spreaker-pgbouncer-exporter.yml

# 2.5.1. PGBouncer active connections

PGBouncer pools are filling up [copy]

  - alert: PgbouncerActiveConnections
    expr: pgbouncer_pools_server_active_connections > 200
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: PGBouncer active connections (instance {{ $labels.instance }})
      description: "PGBouncer pools are filling up\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.5.2. PGBouncer errors

PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console. [copy]

  - alert: PgbouncerErrors
    expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[1m]) > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: PGBouncer errors (instance {{ $labels.instance }})
      description: "PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.5.3. PGBouncer max connections

The number of PGBouncer client connections has reached max_client_conn. [copy]

  - alert: PgbouncerMaxConnections
    expr: increase(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[30s]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: PGBouncer max connections (instance {{ $labels.instance }})
      description: "The number of PGBouncer client connections has reached max_client_conn.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6. Redis : oliver006/redis_exporter (12 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/redis/oliver006-redis-exporter.yml

# 2.6.1. Redis down

Redis instance is down [copy]

  - alert: RedisDown
    expr: redis_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis down (instance {{ $labels.instance }})
      description: "Redis instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.2. Redis missing master

Redis cluster has no node marked as master. [copy]

  - alert: RedisMissingMaster
    expr: (count(redis_instance_info{role="master"}) or vector(0)) < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis missing master (instance {{ $labels.instance }})
      description: "Redis cluster has no node marked as master.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.3. Redis too many masters

Redis cluster has too many nodes marked as master. [copy]

  - alert: RedisTooManyMasters
    expr: count(redis_instance_info{role="master"}) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis too many masters (instance {{ $labels.instance }})
      description: "Redis cluster has too many nodes marked as master.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.4. Redis disconnected slaves

Redis not replicating for all slaves. Consider reviewing the redis replication status. [copy]

  - alert: RedisDisconnectedSlaves
    expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis disconnected slaves (instance {{ $labels.instance }})
      description: "Redis not replicating for all slaves. Consider reviewing the redis replication status.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.5. Redis replication broken

Redis instance lost a slave [copy]

  - alert: RedisReplicationBroken
    expr: delta(redis_connected_slaves[1m]) < 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis replication broken (instance {{ $labels.instance }})
      description: "Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.6. Redis cluster flapping

Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping). [copy]

  - alert: RedisClusterFlapping
    expr: changes(redis_connected_slaves[1m]) > 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Redis cluster flapping (instance {{ $labels.instance }})
      description: "Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.7. Redis missing backup

Redis has not been backuped for 24 hours [copy]

  - alert: RedisMissingBackup
    expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis missing backup (instance {{ $labels.instance }})
      description: "Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.8. Redis out of system memory

Redis is running out of system memory (> 90%) [copy]

  # The exporter must be started with --include-system-metrics flag or REDIS_EXPORTER_INCL_SYSTEM_METRICS=true environment variable.
  - alert: RedisOutOfSystemMemory
    expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis out of system memory (instance {{ $labels.instance }})
      description: "Redis is running out of system memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.9. Redis out of configured maxmemory

Redis is running out of configured maxmemory (> 90%) [copy]

  - alert: RedisOutOfConfiguredMaxmemory
    expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90 and on(instance) redis_memory_max_bytes > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis out of configured maxmemory (instance {{ $labels.instance }})
      description: "Redis is running out of configured maxmemory (> 90%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.10. Redis too many connections

Redis is running out of connections (> 90% used) [copy]

  - alert: RedisTooManyConnections
    expr: redis_connected_clients / redis_config_maxclients * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis too many connections (instance {{ $labels.instance }})
      description: "Redis is running out of connections (> 90% used)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.11. Redis not enough connections

Redis instance should have more connections (> 5) [copy]

  - alert: RedisNotEnoughConnections
    expr: redis_connected_clients < 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis not enough connections (instance {{ $labels.instance }})
      description: "Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.6.12. Redis rejected connections

Some connections to Redis has been rejected [copy]

  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis rejected connections (instance {{ $labels.instance }})
      description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1. MongoDB : percona/mongodb_exporter (7 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mongodb/percona-mongodb-exporter.yml

# 2.7.1.1. MongoDB Down

MongoDB instance is down [copy]

  - alert: MongodbDown
    expr: mongodb_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB Down (instance {{ $labels.instance }})
      description: "MongoDB instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1.2. Mongodb replica member unhealthy

MongoDB replica member is not healthy [copy]

  - alert: MongodbReplicaMemberUnhealthy
    expr: mongodb_rs_members_health == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Mongodb replica member unhealthy (instance {{ $labels.instance }})
      description: "MongoDB replica member is not healthy\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1.3. MongoDB replication lag

Mongodb replication lag is more than 10s [copy]

  - alert: MongodbReplicationLag
    expr: (mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"}) / 1000 > 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication lag (instance {{ $labels.instance }})
      description: "Mongodb replication lag is more than 10s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1.4. MongoDB replication headroom

MongoDB replication headroom is <= 0 [copy]

  - alert: MongodbReplicationHeadroom
    expr: sum(avg(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp)) - sum(avg(mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"})) <= 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication headroom (instance {{ $labels.instance }})
      description: "MongoDB replication headroom is <= 0\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1.5. MongoDB number cursors open

Too many cursors opened by MongoDB for clients (> 10k) [copy]

  - alert: MongodbNumberCursorsOpen
    expr: mongodb_ss_metrics_cursor_open{csr_type="total"} > 10 * 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB number cursors open (instance {{ $labels.instance }})
      description: "Too many cursors opened by MongoDB for clients (> 10k)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1.6. MongoDB cursors timeouts

Too many cursors are timing out [copy]

  - alert: MongodbCursorsTimeouts
    expr: increase(mongodb_ss_metrics_cursor_timedOut[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
      description: "Too many cursors are timing out\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.1.7. MongoDB too many connections

Too many connections (> 80%) [copy]

  - alert: MongodbTooManyConnections
    expr: avg by(instance) (rate(mongodb_ss_connections{conn_type="current"}[1m])) / avg by(instance) (sum (mongodb_ss_connections) by (instance)) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB too many connections (instance {{ $labels.instance }})
      description: "Too many connections (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2. MongoDB : dcu/mongodb_exporter (10 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mongodb/dcu-mongodb-exporter.yml

# 2.7.2.1. MongoDB replication lag

Mongodb replication lag is more than 10s [copy]

  - alert: MongodbReplicationLag
    expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication lag (instance {{ $labels.instance }})
      description: "Mongodb replication lag is more than 10s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.2. MongoDB replication Status 3

MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync [copy]

  - alert: MongodbReplicationStatus3
    expr: mongodb_replset_member_state == 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 3 (instance {{ $labels.instance }})
      description: "MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.3. MongoDB replication Status 6

MongoDB Replication set member as seen from another member of the set, is not yet known [copy]

  - alert: MongodbReplicationStatus6
    expr: mongodb_replset_member_state == 6
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 6 (instance {{ $labels.instance }})
      description: "MongoDB Replication set member as seen from another member of the set, is not yet known\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.4. MongoDB replication Status 8

MongoDB Replication set member as seen from another member of the set, is unreachable [copy]

  - alert: MongodbReplicationStatus8
    expr: mongodb_replset_member_state == 8
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 8 (instance {{ $labels.instance }})
      description: "MongoDB Replication set member as seen from another member of the set, is unreachable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.5. MongoDB replication Status 9

MongoDB Replication set member is actively performing a rollback. Data is not available for reads [copy]

  - alert: MongodbReplicationStatus9
    expr: mongodb_replset_member_state == 9
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 9 (instance {{ $labels.instance }})
      description: "MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.6. MongoDB replication Status 10

MongoDB Replication set member was once in a replica set but was subsequently removed [copy]

  - alert: MongodbReplicationStatus10
    expr: mongodb_replset_member_state == 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 10 (instance {{ $labels.instance }})
      description: "MongoDB Replication set member was once in a replica set but was subsequently removed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.7. MongoDB number cursors open

Too many cursors opened by MongoDB for clients (> 10k) [copy]

  - alert: MongodbNumberCursorsOpen
    expr: mongodb_metrics_cursor_open{state="total_open"} > 10000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB number cursors open (instance {{ $labels.instance }})
      description: "Too many cursors opened by MongoDB for clients (> 10k)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.8. MongoDB cursors timeouts

Too many cursors are timing out [copy]

  - alert: MongodbCursorsTimeouts
    expr: increase(mongodb_metrics_cursor_timed_out_total[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
      description: "Too many cursors are timing out\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.9. MongoDB too many connections

Too many connections (> 80%) [copy]

  - alert: MongodbTooManyConnections
    expr: avg by(instance) (rate(mongodb_connections{state="current"}[1m])) / avg by(instance) (sum (mongodb_connections) by (instance)) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB too many connections (instance {{ $labels.instance }})
      description: "Too many connections (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.2.10. MongoDB virtual memory usage

High memory usage [copy]

  - alert: MongodbVirtualMemoryUsage
    expr: (sum(mongodb_memory{type="virtual"}) BY (instance) / sum(mongodb_memory{type="mapped"}) BY (instance)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB virtual memory usage (instance {{ $labels.instance }})
      description: "High memory usage\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.7.3. MongoDB : stefanprodan/mgob (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mongodb/stefanprodan-mgob-exporter.yml

# 2.7.3.1. Mgob backup failed

MongoDB backup has failed [copy]

  - alert: MgobBackupFailed
    expr: changes(mgob_scheduler_backup_total{status="500"}[1h]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Mgob backup failed (instance {{ $labels.instance }})
      description: "MongoDB backup has failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1. RabbitMQ : rabbitmq/rabbitmq-prometheus (10 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/rabbitmq/rabbitmq-exporter.yml

# 2.8.1.1. RabbitMQ node down

Less than 3 nodes running in RabbitMQ cluster [copy]

  - alert: RabbitmqNodeDown
    expr: sum(rabbitmq_build_info) < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ node down (instance {{ $labels.instance }})
      description: "Less than 3 nodes running in RabbitMQ cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.2. RabbitMQ node not distributed

Distribution link state is not 'up' [copy]

  - alert: RabbitmqNodeNotDistributed
    expr: erlang_vm_dist_node_state < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ node not distributed (instance {{ $labels.instance }})
      description: "Distribution link state is not 'up'\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.3. RabbitMQ instances different versions

Running different version of RabbitMQ in the same cluster, can lead to failure. [copy]

  - alert: RabbitmqInstancesDifferentVersions
    expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ instances different versions (instance {{ $labels.instance }})
      description: "Running different version of RabbitMQ in the same cluster, can lead to failure.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.4. RabbitMQ memory high

A node use more than 90% of allocated RAM [copy]

  - alert: RabbitmqMemoryHigh
    expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ memory high (instance {{ $labels.instance }})
      description: "A node use more than 90% of allocated RAM\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.5. RabbitMQ file descriptors usage

A node use more than 90% of file descriptors [copy]

  - alert: RabbitmqFileDescriptorsUsage
    expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ file descriptors usage (instance {{ $labels.instance }})
      description: "A node use more than 90% of file descriptors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.6. RabbitMQ too many ready messages

RabbitMQ too many ready messages on {{ $labels.instace }} [copy]

  - alert: RabbitmqTooManyReadyMessages
    expr: sum(rabbitmq_queue_messages_ready) BY (queue) > 1000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many ready messages (instance {{ $labels.instance }})
      description: "RabbitMQ too many ready messages on {{ $labels.instace }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.7. RabbitMQ too many unack messages

Too many unacknowledged messages [copy]

  - alert: RabbitmqTooManyUnackMessages
    expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many unack messages (instance {{ $labels.instance }})
      description: "Too many unacknowledged messages\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.8. RabbitMQ too many connections

The total connections of a node is too high [copy]

  - alert: RabbitmqTooManyConnections
    expr: rabbitmq_connections > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many connections (instance {{ $labels.instance }})
      description: "The total connections of a node is too high\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.9. RabbitMQ no queue consumer

A queue has less than 1 consumer [copy]

  - alert: RabbitmqNoQueueConsumer
    expr: rabbitmq_queue_consumers < 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ no queue consumer (instance {{ $labels.instance }})
      description: "A queue has less than 1 consumer\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.1.10. RabbitMQ unroutable messages

A queue has unroutable messages [copy]

  - alert: RabbitmqUnroutableMessages
    expr: increase(rabbitmq_channel_messages_unroutable_returned_total[1m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ unroutable messages (instance {{ $labels.instance }})
      description: "A queue has unroutable messages\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2. RabbitMQ : kbudde/rabbitmq-exporter (11 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/rabbitmq/kbudde-rabbitmq-exporter.yml

# 2.8.2.1. RabbitMQ down

RabbitMQ node down [copy]

  - alert: RabbitmqDown
    expr: rabbitmq_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ down (instance {{ $labels.instance }})
      description: "RabbitMQ node down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.2. RabbitMQ cluster down

Less than 3 nodes running in RabbitMQ cluster [copy]

  - alert: RabbitmqClusterDown
    expr: sum(rabbitmq_running) < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ cluster down (instance {{ $labels.instance }})
      description: "Less than 3 nodes running in RabbitMQ cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.3. RabbitMQ cluster partition

Cluster partition [copy]

  - alert: RabbitmqClusterPartition
    expr: rabbitmq_partitions > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ cluster partition (instance {{ $labels.instance }})
      description: "Cluster partition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.4. RabbitMQ out of memory

Memory available for RabbmitMQ is low (< 10%) [copy]

  - alert: RabbitmqOutOfMemory
    expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ out of memory (instance {{ $labels.instance }})
      description: "Memory available for RabbmitMQ is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.5. RabbitMQ too many connections

RabbitMQ instance has too many connections (> 1000) [copy]

  - alert: RabbitmqTooManyConnections
    expr: rabbitmq_connectionsTotal > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many connections (instance {{ $labels.instance }})
      description: "RabbitMQ instance has too many connections (> 1000)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.6. RabbitMQ dead letter queue filling up

Dead letter queue is filling up (> 10 msgs) [copy]

  # Indicate the queue name in dedicated label.
  - alert: RabbitmqDeadLetterQueueFillingUp
    expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ dead letter queue filling up (instance {{ $labels.instance }})
      description: "Dead letter queue is filling up (> 10 msgs)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.7. RabbitMQ too many messages in queue

Queue is filling up (> 1000 msgs) [copy]

  # Indicate the queue name in dedicated label.
  - alert: RabbitmqTooManyMessagesInQueue
    expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many messages in queue (instance {{ $labels.instance }})
      description: "Queue is filling up (> 1000 msgs)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.8. RabbitMQ slow queue consuming

Queue messages are consumed slowly (> 60s) [copy]

  # Indicate the queue name in dedicated label.
  - alert: RabbitmqSlowQueueConsuming
    expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ slow queue consuming (instance {{ $labels.instance }})
      description: "Queue messages are consumed slowly (> 60s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.9. RabbitMQ no consumer

Queue has no consumer [copy]

  - alert: RabbitmqNoConsumer
    expr: rabbitmq_queue_consumers == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ no consumer (instance {{ $labels.instance }})
      description: "Queue has no consumer\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.10. RabbitMQ too many consumers

Queue should have only 1 consumer [copy]

  # Indicate the queue name in dedicated label.
  - alert: RabbitmqTooManyConsumers
    expr: rabbitmq_queue_consumers{queue="my-queue"} > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ too many consumers (instance {{ $labels.instance }})
      description: "Queue should have only 1 consumer\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.8.2.11. RabbitMQ unactive exchange

Exchange receive less than 5 msgs per second [copy]

  # Indicate the exchange name in dedicated label.
  - alert: RabbitmqUnactiveExchange
    expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ unactive exchange (instance {{ $labels.instance }})
      description: "Exchange receive less than 5 msgs per second\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9. Elasticsearch : prometheus-community/elasticsearch_exporter (19 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/elasticsearch/prometheus-community-elasticsearch-exporter.yml

# 2.9.1. Elasticsearch Heap Usage Too High

The heap usage is over 90% [copy]

  - alert: ElasticsearchHeapUsageTooHigh
    expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})
      description: "The heap usage is over 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.2. Elasticsearch Heap Usage warning

The heap usage is over 80% [copy]

  - alert: ElasticsearchHeapUsageWarning
    expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }})
      description: "The heap usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.3. Elasticsearch disk out of space

The disk usage is over 90% [copy]

  - alert: ElasticsearchDiskOutOfSpace
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch disk out of space (instance {{ $labels.instance }})
      description: "The disk usage is over 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.4. Elasticsearch disk space low

The disk usage is over 80% [copy]

  - alert: ElasticsearchDiskSpaceLow
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch disk space low (instance {{ $labels.instance }})
      description: "The disk usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.5. Elasticsearch Cluster Red

Elastic Cluster Red status [copy]

  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Cluster Red (instance {{ $labels.instance }})
      description: "Elastic Cluster Red status\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.6. Elasticsearch Cluster Yellow

Elastic Cluster Yellow status [copy]

  - alert: ElasticsearchClusterYellow
    expr: elasticsearch_cluster_health_status{color="yellow"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})
      description: "Elastic Cluster Yellow status\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.7. Elasticsearch Healthy Nodes

Missing node in Elasticsearch cluster [copy]

  - alert: ElasticsearchHealthyNodes
    expr: elasticsearch_cluster_health_number_of_nodes < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
      description: "Missing node in Elasticsearch cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.8. Elasticsearch Healthy Data Nodes

Missing data node in Elasticsearch cluster [copy]

  - alert: ElasticsearchHealthyDataNodes
    expr: elasticsearch_cluster_health_number_of_data_nodes < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }})
      description: "Missing data node in Elasticsearch cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.9. Elasticsearch relocating shards

Elasticsearch is relocating shards [copy]

  - alert: ElasticsearchRelocatingShards
    expr: elasticsearch_cluster_health_relocating_shards > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Elasticsearch relocating shards (instance {{ $labels.instance }})
      description: "Elasticsearch is relocating shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.10. Elasticsearch relocating shards too long

Elasticsearch has been relocating shards for 15min [copy]

  - alert: ElasticsearchRelocatingShardsTooLong
    expr: elasticsearch_cluster_health_relocating_shards > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch relocating shards too long (instance {{ $labels.instance }})
      description: "Elasticsearch has been relocating shards for 15min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.11. Elasticsearch initializing shards

Elasticsearch is initializing shards [copy]

  - alert: ElasticsearchInitializingShards
    expr: elasticsearch_cluster_health_initializing_shards > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Elasticsearch initializing shards (instance {{ $labels.instance }})
      description: "Elasticsearch is initializing shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.12. Elasticsearch initializing shards too long

Elasticsearch has been initializing shards for 15 min [copy]

  - alert: ElasticsearchInitializingShardsTooLong
    expr: elasticsearch_cluster_health_initializing_shards > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch initializing shards too long (instance {{ $labels.instance }})
      description: "Elasticsearch has been initializing shards for 15 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.13. Elasticsearch unassigned shards

Elasticsearch has unassigned shards [copy]

  - alert: ElasticsearchUnassignedShards
    expr: elasticsearch_cluster_health_unassigned_shards > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch unassigned shards (instance {{ $labels.instance }})
      description: "Elasticsearch has unassigned shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.14. Elasticsearch pending tasks

Elasticsearch has pending tasks. Cluster works slowly. [copy]

  - alert: ElasticsearchPendingTasks
    expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch pending tasks (instance {{ $labels.instance }})
      description: "Elasticsearch has pending tasks. Cluster works slowly.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.15. Elasticsearch no new documents

No new documents for 10 min! [copy]

  - alert: ElasticsearchNoNewDocuments
    expr: increase(elasticsearch_indices_indexing_index_total{es_data_node="true"}[10m]) < 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch no new documents (instance {{ $labels.instance }})
      description: "No new documents for 10 min!\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.16. Elasticsearch High Indexing Latency

The indexing latency on Elasticsearch cluster is higher than the threshold. [copy]

  - alert: ElasticsearchHighIndexingLatency
    expr: increase(elasticsearch_indices_indexing_index_time_seconds_total[1m]) / increase(elasticsearch_indices_indexing_index_total[1m]) > 0.0005
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Indexing Latency (instance {{ $labels.instance }})
      description: "The indexing latency on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.17. Elasticsearch High Indexing Rate

The indexing rate on Elasticsearch cluster is higher than the threshold. [copy]

  - alert: ElasticsearchHighIndexingRate
    expr: sum(rate(elasticsearch_indices_indexing_index_total[1m]))> 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Indexing Rate (instance {{ $labels.instance }})
      description: "The indexing rate on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.18. Elasticsearch High Query Rate

The query rate on Elasticsearch cluster is higher than the threshold. [copy]

  - alert: ElasticsearchHighQueryRate
    expr: sum(rate(elasticsearch_indices_search_query_total[1m])) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Query Rate (instance {{ $labels.instance }})
      description: "The query rate on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.9.19. Elasticsearch High Query Latency

The query latency on Elasticsearch cluster is higher than the threshold. [copy]

  - alert: ElasticsearchHighQueryLatency
    expr: increase(elasticsearch_indices_search_fetch_time_seconds[1m]) / increase(elasticsearch_indices_search_fetch_total[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Query Latency (instance {{ $labels.instance }})
      description: "The query latency on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.10. Meilisearch : Embedded exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/meilisearch/embedded-exporter.yml

# 2.10.1. Meilisearch index is empty

Meilisearch instance is down [copy]

  - alert: MeilisearchIndexIsEmpty
    expr: meilisearch_index_docs_count == 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Meilisearch index is empty (instance {{ $labels.instance }})
      description: "Meilisearch instance is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.10.2. Meilisearch http response time

Meilisearch http response time is too high [copy]

  - alert: MeilisearchHttpResponseTime
    expr: meilisearch_http_response_time_seconds > 0.5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Meilisearch http response time (instance {{ $labels.instance }})
      description: "Meilisearch http response time is too high\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1. Cassandra : instaclustr/cassandra-exporter (12 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/cassandra/instaclustr-cassandra-exporter.yml

# 2.11.1.1. Cassandra Node is unavailable

Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }} [copy]

  - alert: CassandraNodeIsUnavailable
    expr: sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint) < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra Node is unavailable (instance {{ $labels.instance }})
      description: "Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.2. Cassandra many compaction tasks are pending

Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraManyCompactionTasksArePending
    expr: cassandra_table_estimated_pending_compactions > 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }})
      description: "Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.3. Cassandra commitlog pending tasks

Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_commit_log_pending_tasks > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: "Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.4. Cassandra compaction executor blocked tasks

Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_thread_pool_blocked_tasks{pool="CompactionExecutor"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.5. Cassandra flush writer blocked tasks

Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_thread_pool_blocked_tasks{pool="MemtableFlushWriter"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.6. Cassandra connection timeouts total

Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraConnectionTimeoutsTotal
    expr: avg(cassandra_client_request_timeouts_total) by (cassandra_cluster,instance) > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: "Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.7. Cassandra storage exceptions

Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraStorageExceptions
    expr: changes(cassandra_storage_exceptions_total[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: "Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.8. Cassandra tombstone dump

Cassandra tombstone dump - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraTombstoneDump
    expr: avg(cassandra_table_tombstones_scanned{quantile="0.99"}) by (instance,cassandra_cluster,keyspace) > 100
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: "Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.9. Cassandra client request unavailable write

Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraClientRequestUnavailableWrite
    expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="write"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable write (instance {{ $labels.instance }})
      description: "Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.10. Cassandra client request unavailable read

Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraClientRequestUnavailableRead
    expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="read"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable read (instance {{ $labels.instance }})
      description: "Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.11. Cassandra client request write failure

Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.1.12. Cassandra client request read failure

Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }} [copy]

  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2. Cassandra : criteo/cassandra_exporter (18 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/cassandra/criteo-cassandra-exporter.yml

# 2.11.2.1. Cassandra hints count

Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down [copy]

  - alert: CassandraHintsCount
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra hints count (instance {{ $labels.instance }})
      description: "Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.2. Cassandra compaction task pending

Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster. [copy]

  - alert: CassandraCompactionTaskPending
    expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction task pending (instance {{ $labels.instance }})
      description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.3. Cassandra viewwrite latency

High viewwrite latency on {{ $labels.instance }} cassandra node [copy]

  - alert: CassandraViewwriteLatency
    expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
      description: "High viewwrite latency on {{ $labels.instance }} cassandra node\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.4. Cassandra bad hacker

Increase of Cassandra authentication failures [copy]

  - alert: CassandraBadHacker
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra bad hacker (instance {{ $labels.instance }})
      description: "Increase of Cassandra authentication failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.5. Cassandra node down

Cassandra node down [copy]

  - alert: CassandraNodeDown
    expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra node down (instance {{ $labels.instance }})
      description: "Cassandra node down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.6. Cassandra commitlog pending tasks

Unexpected number of Cassandra commitlog pending tasks [copy]

  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: "Unexpected number of Cassandra commitlog pending tasks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.7. Cassandra compaction executor blocked tasks

Some Cassandra compaction executor tasks are blocked [copy]

  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra compaction executor tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.8. Cassandra flush writer blocked tasks

Some Cassandra flush writer tasks are blocked [copy]

  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra flush writer tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.9. Cassandra repair pending tasks

Some Cassandra repair tasks are pending [copy]

  - alert: CassandraRepairPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
      description: "Some Cassandra repair tasks are pending\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.10. Cassandra repair blocked tasks

Some Cassandra repair tasks are blocked [copy]

  - alert: CassandraRepairBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra repair tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.11. Cassandra connection timeouts total

Some connection between nodes are ending in timeout [copy]

  - alert: CassandraConnectionTimeoutsTotal
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: "Some connection between nodes are ending in timeout\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.12. Cassandra storage exceptions

Something is going wrong with cassandra storage [copy]

  - alert: CassandraStorageExceptions
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: "Something is going wrong with cassandra storage\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.13. Cassandra tombstone dump

Too much tombstones scanned in queries [copy]

  - alert: CassandraTombstoneDump
    expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: "Too much tombstones scanned in queries\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.14. Cassandra client request unavailable write

Write failures have occurred because too many nodes are unavailable [copy]

  - alert: CassandraClientRequestUnavailableWrite
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable write (instance {{ $labels.instance }})
      description: "Write failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.15. Cassandra client request unavailable read

Read failures have occurred because too many nodes are unavailable [copy]

  - alert: CassandraClientRequestUnavailableRead
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable read (instance {{ $labels.instance }})
      description: "Read failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.16. Cassandra client request write failure

A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. [copy]

  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: "A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.17. Cassandra client request read failure

A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. [copy]

  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: "A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.11.2.18. Cassandra cache hit rate key cache

Key cache hit rate is below 85% [copy]

  - alert: CassandraCacheHitRateKeyCache
    expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
      description: "Key cache hit rate is below 85%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12. Clickhouse : Embedded Exporter (14 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/clickhouse/embedded-exporter.yml

# 2.12.1. ClickHouse Memory Usage Critical

Memory usage is critically high, over 90%. [copy]

  - alert: ClickhouseMemoryUsageCritical
    expr: ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse Memory Usage Critical (instance {{ $labels.instance }})
      description: "Memory usage is critically high, over 90%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.2. ClickHouse Memory Usage Warning

Memory usage is over 80%. [copy]

  - alert: ClickhouseMemoryUsageWarning
    expr: ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Memory Usage Warning (instance {{ $labels.instance }})
      description: "Memory usage is over 80%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.3. ClickHouse Disk Space Low on Default

Disk space on default is below 20%. [copy]

  - alert: ClickhouseDiskSpaceLowOnDefault
    expr: ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Disk Space Low on Default (instance {{ $labels.instance }})
      description: "Disk space on default is below 20%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.4. ClickHouse Disk Space Critical on Default

Disk space on default disk is critically low, below 10%. [copy]

  - alert: ClickhouseDiskSpaceCriticalOnDefault
    expr: ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse Disk Space Critical on Default (instance {{ $labels.instance }})
      description: "Disk space on default disk is critically low, below 10%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.5. ClickHouse Disk Space Low on Backups

Disk space on backups is below 20%. [copy]

  - alert: ClickhouseDiskSpaceLowOnBackups
    expr: ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Disk Space Low on Backups (instance {{ $labels.instance }})
      description: "Disk space on backups is below 20%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.6. ClickHouse Replica Errors

Critical replica errors detected, either all replicas are stale or lost. [copy]

  - alert: ClickhouseReplicaErrors
    expr: ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse Replica Errors (instance {{ $labels.instance }})
      description: "Critical replica errors detected, either all replicas are stale or lost.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.7. ClickHouse No Available Replicas

No available replicas in ClickHouse. [copy]

  - alert: ClickhouseNoAvailableReplicas
    expr: ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse No Available Replicas (instance {{ $labels.instance }})
      description: "No available replicas in ClickHouse.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.8. ClickHouse No Live Replicas

There are too few live replicas available, risking data loss and service disruption. [copy]

  - alert: ClickhouseNoLiveReplicas
    expr: ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse No Live Replicas (instance {{ $labels.instance }})
      description: "There are too few live replicas available, risking data loss and service disruption.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.9. ClickHouse High Network Traffic

Network traffic is unusually high, may affect cluster performance. [copy]

  # Please replace the threshold with an appropriate value
  - alert: ClickhouseHighNetworkTraffic
    expr: ClickHouseMetrics_NetworkSend > 250 or ClickHouseMetrics_NetworkReceive > 250
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse High Network Traffic (instance {{ $labels.instance }})
      description: "Network traffic is unusually high, may affect cluster performance.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.10. ClickHouse High TCP Connections

High number of TCP connections, indicating heavy client or inter-cluster communication. [copy]

  # Please replace the threshold with an appropriate value
  - alert: ClickhouseHighTcpConnections
    expr: ClickHouseMetrics_TCPConnection > 400
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse High TCP Connections (instance {{ $labels.instance }})
      description: "High number of TCP connections, indicating heavy client or inter-cluster communication.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.11. ClickHouse Interserver Connection Issues

An increase in interserver connections may indicate replication or distributed query handling issues. [copy]

  - alert: ClickhouseInterserverConnectionIssues
    expr: increase(ClickHouseMetrics_InterserverConnection[5m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Interserver Connection Issues (instance {{ $labels.instance }})
      description: "An increase in interserver connections may indicate replication or distributed query handling issues.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.12. ClickHouse ZooKeeper Connection Issues

ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination. [copy]

  - alert: ClickhouseZookeeperConnectionIssues
    expr: avg(ClickHouseMetrics_ZooKeeperSession) != 1
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse ZooKeeper Connection Issues (instance {{ $labels.instance }})
      description: "ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.13. ClickHouse Authentication Failures

Authentication failures detected, indicating potential security issues or misconfiguration. [copy]

  - alert: ClickhouseAuthenticationFailures
    expr: increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: ClickHouse Authentication Failures (instance {{ $labels.instance }})
      description: "Authentication failures detected, indicating potential security issues or misconfiguration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.12.14. ClickHouse Access Denied Errors

Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts. [copy]

  - alert: ClickhouseAccessDeniedErrors
    expr: increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: ClickHouse Access Denied Errors (instance {{ $labels.instance }})
      description: "Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.13.1. Zookeeper : cloudflare/kafka_zookeeper_exporter

// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

# 2.13.2. Zookeeper : dabealu/zookeeper-exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/zookeeper/dabealu-zookeeper-exporter.yml

# 2.13.2.1. Zookeeper Down

Zookeeper down on instance {{ $labels.instance }} [copy]

  - alert: ZookeeperDown
    expr: zk_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Zookeeper Down (instance {{ $labels.instance }})
      description: "Zookeeper down on instance {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.13.2.2. Zookeeper missing leader

Zookeeper cluster has no node marked as leader [copy]

  - alert: ZookeeperMissingLeader
    expr: sum(zk_server_leader) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Zookeeper missing leader (instance {{ $labels.instance }})
      description: "Zookeeper cluster has no node marked as leader\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.13.2.3. Zookeeper Too Many Leaders

Zookeeper cluster has too many nodes marked as leader [copy]

  - alert: ZookeeperTooManyLeaders
    expr: sum(zk_server_leader) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Zookeeper Too Many Leaders (instance {{ $labels.instance }})
      description: "Zookeeper cluster has too many nodes marked as leader\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.13.2.4. Zookeeper Not Ok

Zookeeper instance is not ok [copy]

  - alert: ZookeeperNotOk
    expr: zk_ruok == 0
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: Zookeeper Not Ok (instance {{ $labels.instance }})
      description: "Zookeeper instance is not ok\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.14.1. Kafka : danielqsj/kafka_exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/kafka/danielqsj-kafka-exporter.yml

# 2.14.1.1. Kafka topics replicas

Kafka topic in-sync partition [copy]

  - alert: KafkaTopicsReplicas
    expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kafka topics replicas (instance {{ $labels.instance }})
      description: "Kafka topic in-sync partition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.14.1.2. Kafka consumers group

Kafka consumers group [copy]

  - alert: KafkaConsumersGroup
    expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Kafka consumers group (instance {{ $labels.instance }})
      description: "Kafka consumers group\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.14.2. Kafka : linkedin/Burrow (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/kafka/linkedin-kafka-exporter.yml

# 2.14.2.1. Kafka topic offset decreased

Kafka topic offset has decreased [copy]

  - alert: KafkaTopicOffsetDecreased
    expr: delta(kafka_burrow_partition_current_offset[1m]) < 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kafka topic offset decreased (instance {{ $labels.instance }})
      description: "Kafka topic offset has decreased\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.14.2.2. Kafka consumer lag

Kafka consumer has a 30 minutes and increasing lag [copy]

  - alert: KafkaConsumerLag
    expr: kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset >= (kafka_burrow_topic_partition_offset offset 15m - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset offset 15m) AND kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Kafka consumer lag (instance {{ $labels.instance }})
      description: "Kafka consumer has a 30 minutes and increasing lag\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15. Pulsar : embedded exporter (10 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/pulsar/embedded-exporter.yml

# 2.15.1. Pulsar subscription high number of backlog entries

The number of subscription backlog entries is over 5k [copy]

  - alert: PulsarSubscriptionHighNumberOfBacklogEntries
    expr: sum(pulsar_subscription_back_log) by (subscription) > 5000
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Pulsar subscription high number of backlog entries (instance {{ $labels.instance }})
      description: "The number of subscription backlog entries is over 5k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.2. Pulsar subscription very high number of backlog entries

The number of subscription backlog entries is over 100k [copy]

  - alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries
    expr: sum(pulsar_subscription_back_log) by (subscription) > 100000
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }})
      description: "The number of subscription backlog entries is over 100k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.3. Pulsar topic large backlog storage size

The topic backlog storage size is over 5 GB [copy]

  - alert: PulsarTopicLargeBacklogStorageSize
    expr: sum(pulsar_storage_size > 5*1024*1024*1024) by (topic)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Pulsar topic large backlog storage size (instance {{ $labels.instance }})
      description: "The topic backlog storage size is over 5 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.4. Pulsar topic very large backlog storage size

The topic backlog storage size is over 20 GB [copy]

  - alert: PulsarTopicVeryLargeBacklogStorageSize
    expr: sum(pulsar_storage_size > 20*1024*1024*1024) by (topic)
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance }})
      description: "The topic backlog storage size is over 20 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.5. Pulsar high write latency

Messages cannot be written in a timely fashion [copy]

  - alert: PulsarHighWriteLatency
    expr: sum(pulsar_storage_write_latency_overflow > 0) by (topic)
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar high write latency (instance {{ $labels.instance }})
      description: "Messages cannot be written in a timely fashion\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.6. Pulsar large message payload

Observing large message payload (> 1MB) [copy]

  - alert: PulsarLargeMessagePayload
    expr: sum(pulsar_entry_size_overflow > 0) by (topic)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Pulsar large message payload (instance {{ $labels.instance }})
      description: "Observing large message payload (> 1MB)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.7. Pulsar high ledger disk usage

Observing Ledger Disk Usage (> 75%) [copy]

  - alert: PulsarHighLedgerDiskUsage
    expr: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar high ledger disk usage (instance {{ $labels.instance }})
      description: "Observing Ledger Disk Usage (> 75%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.8. Pulsar read only bookies

Observing Readonly Bookies [copy]

  - alert: PulsarReadOnlyBookies
    expr: count(bookie_SERVER_STATUS{} == 0) by (pod)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Pulsar read only bookies (instance {{ $labels.instance }})
      description: "Observing Readonly Bookies\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.9. Pulsar high number of function errors

Observing more than 10 Function errors per minute [copy]

  - alert: PulsarHighNumberOfFunctionErrors
    expr: sum((rate(pulsar_function_user_exceptions_total{}[1m]) + rate(pulsar_function_system_exceptions_total{}[1m])) > 10) by (name)
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Pulsar high number of function errors (instance {{ $labels.instance }})
      description: "Observing more than 10 Function errors per minute\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.15.10. Pulsar high number of sink errors

Observing more than 10 Sink errors per minute [copy]

  - alert: PulsarHighNumberOfSinkErrors
    expr: sum(rate(pulsar_sink_sink_exceptions_total{}[1m]) > 10) by (name)
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Pulsar high number of sink errors (instance {{ $labels.instance }})
      description: "Observing more than 10 Sink errors per minute\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16. Nats : nats-io/prometheus-nats-exporter (19 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/nats/nats-exporter.yml

# 2.16.1. Nats high connection count

High number of NATS connections ({{ $value }}) for {{ $labels.instance }} [copy]

  - alert: NatsHighConnectionCount
    expr: gnatsd_varz_connections > 100
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: Nats high connection count (instance {{ $labels.instance }})
      description: "High number of NATS connections ({{ $value }}) for {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.2. Nats high subscriptions count

High number of NATS subscriptions ({{ $value }}) for {{ $labels.instance }} [copy]

  - alert: NatsHighSubscriptionsCount
    expr: gnatsd_connz_subscriptions > 50
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: Nats high subscriptions count (instance {{ $labels.instance }})
      description: "High number of NATS subscriptions ({{ $value }}) for {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.3. Nats high routes count

High number of NATS routes ({{ $value }}) for {{ $labels.instance }} [copy]

  - alert: NatsHighRoutesCount
    expr: gnatsd_varz_routes > 10
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: Nats high routes count (instance {{ $labels.instance }})
      description: "High number of NATS routes ({{ $value }}) for {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.4. Nats high memory usage

NATS server memory usage is above 200MB for {{ $labels.instance }} [copy]

  - alert: NatsHighMemoryUsage
    expr: gnatsd_varz_mem > 200 * 1024 * 1024
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high memory usage (instance {{ $labels.instance }})
      description: "NATS server memory usage is above 200MB for {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.5. Nats slow consumers

There are slow consumers in NATS for {{ $labels.instance }} [copy]

  - alert: NatsSlowConsumers
    expr: gnatsd_varz_slow_consumers > 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: Nats slow consumers (instance {{ $labels.instance }})
      description: "There are slow consumers in NATS for {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.6. Nats server down

NATS server has been down for more than 5 minutes [copy]

  - alert: NatsServerDown
    expr: absent(up{job="nats"})
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Nats server down (instance {{ $labels.instance }})
      description: "NATS server has been down for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.7. Nats high CPU usage

NATS server is using more than 80% CPU for the last 5 minutes [copy]

  - alert: NatsHighCpuUsage
    expr: rate(gnatsd_varz_cpu[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high CPU usage (instance {{ $labels.instance }})
      description: "NATS server is using more than 80% CPU for the last 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.8. Nats high number of connections

NATS server has more than 1000 active connections [copy]

  - alert: NatsHighNumberOfConnections
    expr: gnatsd_connz_num_connections > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high number of connections (instance {{ $labels.instance }})
      description: "NATS server has more than 1000 active connections\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.9. Nats high JetStream store usage

JetStream store usage is over 80% [copy]

  - alert: NatsHighJetstreamStoreUsage
    expr: gnatsd_varz_jetstream_stats_storage / gnatsd_varz_jetstream_config_max_storage > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high JetStream store usage (instance {{ $labels.instance }})
      description: "JetStream store usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.10. Nats high JetStream memory usage

JetStream memory usage is over 80% [copy]

  - alert: NatsHighJetstreamMemoryUsage
    expr: gnatsd_varz_jetstream_stats_memory / gnatsd_varz_jetstream_config_max_memory > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high JetStream memory usage (instance {{ $labels.instance }})
      description: "JetStream memory usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.11. Nats high number of subscriptions

NATS server has more than 1000 active subscriptions [copy]

  - alert: NatsHighNumberOfSubscriptions
    expr: gnatsd_connz_subscriptions > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high number of subscriptions (instance {{ $labels.instance }})
      description: "NATS server has more than 1000 active subscriptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.12. Nats high pending bytes

NATS server has more than 100,000 pending bytes [copy]

  - alert: NatsHighPendingBytes
    expr: gnatsd_connz_pending_bytes > 100000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats high pending bytes (instance {{ $labels.instance }})
      description: "NATS server has more than 100,000 pending bytes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.13. Nats too many errors

NATS server has encountered errors in the last 5 minutes [copy]

  - alert: NatsTooManyErrors
    expr: increase(gnatsd_varz_jetstream_stats_api_errors[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats too many errors (instance {{ $labels.instance }})
      description: "NATS server has encountered errors in the last 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.14. Nats JetStream consumers exceeded

JetStream has more than 100 active consumers [copy]

  - alert: NatsJetstreamConsumersExceeded
    expr: sum(gnatsd_varz_jetstream_stats_accounts) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats JetStream consumers exceeded (instance {{ $labels.instance }})
      description: "JetStream has more than 100 active consumers\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.15. Nats frequent authentication timeouts

There have been more than 5 authentication timeouts in the last 5 minutes [copy]

  - alert: NatsFrequentAuthenticationTimeouts
    expr: increase(gnatsd_varz_auth_timeout[5m]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats frequent authentication timeouts (instance {{ $labels.instance }})
      description: "There have been more than 5 authentication timeouts in the last 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.16. Nats max payload size exceeded

The max payload size allowed by NATS has been exceeded (1MB) [copy]

  - alert: NatsMaxPayloadSizeExceeded
    expr: max(gnatsd_varz_max_payload) > 1024 * 1024
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Nats max payload size exceeded (instance {{ $labels.instance }})
      description: "The max payload size allowed by NATS has been exceeded (1MB)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.17. Nats leaf node connection issue

No leaf node connections have been established in the last 5 minutes [copy]

  - alert: NatsLeafNodeConnectionIssue
    expr: increase(gnatsd_varz_leafnodes[5m]) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Nats leaf node connection issue (instance {{ $labels.instance }})
      description: "No leaf node connections have been established in the last 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.18. Nats max ping operations exceeded

The maximum number of ping operations in NATS has exceeded 50 [copy]

  - alert: NatsMaxPingOperationsExceeded
    expr: gnatsd_varz_ping_max > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nats max ping operations exceeded (instance {{ $labels.instance }})
      description: "The maximum number of ping operations in NATS has exceeded 50\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.16.19. Nats write deadline exceeded

The write deadline has been exceeded in NATS, indicating potential message delivery issues [copy]

  - alert: NatsWriteDeadlineExceeded
    expr: gnatsd_varz_write_deadline > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Nats write deadline exceeded (instance {{ $labels.instance }})
      description: "The write deadline has been exceeded in NATS, indicating potential message delivery issues\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.17. Solr : embedded exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/solr/embedded-exporter.yml

# 2.17.1. Solr update errors

Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]

  - alert: SolrUpdateErrors
    expr: increase(solr_metrics_core_update_handler_errors_total[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Solr update errors (instance {{ $labels.instance }})
      description: "Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.17.2. Solr query errors

Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]

  - alert: SolrQueryErrors
    expr: increase(solr_metrics_core_errors_total{category="QUERY"}[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Solr query errors (instance {{ $labels.instance }})
      description: "Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.17.3. Solr replication errors

Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]

  - alert: SolrReplicationErrors
    expr: increase(solr_metrics_core_errors_total{category="REPLICATION"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Solr replication errors (instance {{ $labels.instance }})
      description: "Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.17.4. Solr low live node count

Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]

  - alert: SolrLowLiveNodeCount
    expr: solr_collections_live_nodes < 2
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Solr low live node count (instance {{ $labels.instance }})
      description: "Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18. Hadoop : hadoop/jmx_exporter (10 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/hadoop/jmx_exporter.yml

# 2.18.1. Hadoop Name Node Down

The Hadoop NameNode service is unavailable. [copy]

  - alert: HadoopNameNodeDown
    expr: up{job="hadoop-namenode"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Hadoop Name Node Down (instance {{ $labels.instance }})
      description: "The Hadoop NameNode service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.2. Hadoop Resource Manager Down

The Hadoop ResourceManager service is unavailable. [copy]

  - alert: HadoopResourceManagerDown
    expr: up{job="hadoop-resourcemanager"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Hadoop Resource Manager Down (instance {{ $labels.instance }})
      description: "The Hadoop ResourceManager service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.3. Hadoop Data Node Out Of Service

The Hadoop DataNode is not sending heartbeats. [copy]

  - alert: HadoopDataNodeOutOfService
    expr: hadoop_datanode_last_heartbeat == 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Hadoop Data Node Out Of Service (instance {{ $labels.instance }})
      description: "The Hadoop DataNode is not sending heartbeats.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.4. Hadoop HDFS Disk Space Low

Available HDFS disk space is running low. [copy]

  - alert: HadoopHdfsDiskSpaceLow
    expr: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Hadoop HDFS Disk Space Low (instance {{ $labels.instance }})
      description: "Available HDFS disk space is running low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.5. Hadoop Map Reduce Task Failures

There is an unusually high number of MapReduce task failures. [copy]

  - alert: HadoopMapReduceTaskFailures
    expr: hadoop_mapreduce_task_failures_total > 100
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Hadoop Map Reduce Task Failures (instance {{ $labels.instance }})
      description: "There is an unusually high number of MapReduce task failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.6. Hadoop Resource Manager Memory High

The Hadoop ResourceManager is approaching its memory limit. [copy]

  - alert: HadoopResourceManagerMemoryHigh
    expr: hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Hadoop Resource Manager Memory High (instance {{ $labels.instance }})
      description: "The Hadoop ResourceManager is approaching its memory limit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.7. Hadoop YARN Container Allocation Failures

There is a significant number of YARN container allocation failures. [copy]

  - alert: HadoopYarnContainerAllocationFailures
    expr: hadoop_yarn_container_allocation_failures_total > 10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Hadoop YARN Container Allocation Failures (instance {{ $labels.instance }})
      description: "There is a significant number of YARN container allocation failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.8. Hadoop HBase Region Count High

The HBase cluster has an unusually high number of regions. [copy]

  - alert: HadoopHbaseRegionCountHigh
    expr: hadoop_hbase_region_count > 5000
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Hadoop HBase Region Count High (instance {{ $labels.instance }})
      description: "The HBase cluster has an unusually high number of regions.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.9. Hadoop HBase Region Server Heap Low

HBase Region Servers are running low on heap space. [copy]

  - alert: HadoopHbaseRegionServerHeapLow
    expr: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes < 0.2
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})
      description: "HBase Region Servers are running low on heap space.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 2.18.10. Hadoop HBase Write Requests Latency High

HBase Write Requests are experiencing high latency. [copy]

  - alert: HadoopHbaseWriteRequestsLatencyHigh
    expr: hadoop_hbase_write_requests_latency_seconds > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Hadoop HBase Write Requests Latency High (instance {{ $labels.instance }})
      description: "HBase Write Requests are experiencing high latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.1. Nginx : knyar/nginx-lua-prometheus (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/nginx/knyar-nginx-exporter.yml

# 3.1.1. Nginx high HTTP 4xx error rate

Too many HTTP requests with status 4xx (> 5%) [copy]

  - alert: NginxHighHttp4xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.1.2. Nginx high HTTP 5xx error rate

Too many HTTP requests with status 5xx (> 5%) [copy]

  - alert: NginxHighHttp5xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.1.3. Nginx latency high

Nginx p99 latency is higher than 3 seconds [copy]

  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: "Nginx p99 latency is higher than 3 seconds\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.2. Apache : Lusitaniae/apache_exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/apache/lusitaniae-apache-exporter.yml

# 3.2.1. Apache down

Apache down [copy]

  - alert: ApacheDown
    expr: apache_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Apache down (instance {{ $labels.instance }})
      description: "Apache down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.2.2. Apache workers load

Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }} [copy]

  - alert: ApacheWorkersLoad
    expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Apache workers load (instance {{ $labels.instance }})
      description: "Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.2.3. Apache restart

Apache has just been restarted. [copy]

  - alert: ApacheRestart
    expr: apache_uptime_seconds_total / 60 < 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Apache restart (instance {{ $labels.instance }})
      description: "Apache has just been restarted.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1. HaProxy : Embedded exporter (HAProxy >= v2) (14 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/haproxy/embedded-exporter-v2.yml

# 3.3.1.1. HAProxy high HTTP 4xx error rate backend

Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]

  - alert: HaproxyHighHttp4xxErrorRateBackend
    expr: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.2. HAProxy high HTTP 5xx error rate backend

Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]

  - alert: HaproxyHighHttp5xxErrorRateBackend
    expr: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.3. HAProxy high HTTP 4xx error rate server

Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }} [copy]

  - alert: HaproxyHighHttp4xxErrorRateServer
    expr: ((sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.4. HAProxy high HTTP 5xx error rate server

Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }} [copy]

  - alert: HaproxyHighHttp5xxErrorRateServer
    expr: ((sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.5. HAProxy server response errors

Too many response errors to {{ $labels.server }} server (> 5%). [copy]

  - alert: HaproxyServerResponseErrors
    expr: (sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server response errors (instance {{ $labels.instance }})
      description: "Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.6. HAProxy backend connection errors

Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high. [copy]

  - alert: HaproxyBackendConnectionErrors
    expr: (sum by (proxy) (rate(haproxy_backend_connection_errors_total[1m]))) > 100
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.7. HAProxy server connection errors

Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high. [copy]

  - alert: HaproxyServerConnectionErrors
    expr: (sum by (proxy) (rate(haproxy_server_connection_errors_total[1m]))) > 100
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.8. HAProxy backend max active session > 80%

Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf "%.2f"}}% [copy]

  - alert: HaproxyBackendMaxActiveSession>80%
    expr: ((haproxy_server_max_sessions >0) * 100) / (haproxy_server_limit_sessions > 0) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy backend max active session > 80% (instance {{ $labels.instance }})
      description: "Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.9. HAProxy pending requests

Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf "%.2f"}} [copy]

  - alert: HaproxyPendingRequests
    expr: sum by (proxy) (rate(haproxy_backend_current_queue[2m])) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy pending requests (instance {{ $labels.instance }})
      description: "Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.10. HAProxy HTTP slowing down

Average request time is increasing - {{ $value | printf "%.2f"}} [copy]

  - alert: HaproxyHttpSlowingDown
    expr: avg by (instance, proxy) (haproxy_backend_max_total_time_seconds) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
      description: "Average request time is increasing - {{ $value | printf \"%.2f\"}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.11. HAProxy retry high

High rate of retry on {{ $labels.proxy }} - {{ $value | printf "%.2f"}} [copy]

  - alert: HaproxyRetryHigh
    expr: sum by (proxy) (rate(haproxy_backend_retry_warnings_total[1m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy retry high (instance {{ $labels.instance }})
      description: "High rate of retry on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.12. HAproxy has no alive backends

HAProxy has no alive active or backup backends for {{ $labels.proxy }} [copy]

  - alert: HaproxyHasNoAliveBackends
    expr: haproxy_backend_active_servers + haproxy_backend_backup_servers == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAproxy has no alive backends (instance {{ $labels.instance }})
      description: "HAProxy has no alive active or backup backends for {{ $labels.proxy }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.13. HAProxy frontend security blocked requests

HAProxy is blocking requests for security reason [copy]

  - alert: HaproxyFrontendSecurityBlockedRequests
    expr: sum by (proxy) (rate(haproxy_frontend_denied_connections_total[2m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
      description: "HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.1.14. HAProxy server healthcheck failure

Some server healthcheck are failing on {{ $labels.server }} [copy]

  - alert: HaproxyServerHealthcheckFailure
    expr: increase(haproxy_server_check_failures_total[1m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
      description: "Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2. HaProxy : prometheus/haproxy_exporter (HAProxy < v2) (16 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/haproxy/haproxy-exporter-v1.yml

# 3.3.2.1. HAProxy down

HAProxy down [copy]

  - alert: HaproxyDown
    expr: haproxy_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy down (instance {{ $labels.instance }})
      description: "HAProxy down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.2. HAProxy high HTTP 4xx error rate backend

Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]

  - alert: HaproxyHighHttp4xxErrorRateBackend
    expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.3. HAProxy high HTTP 5xx error rate backend

Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]

  - alert: HaproxyHighHttp5xxErrorRateBackend
    expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.4. HAProxy high HTTP 4xx error rate server

Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }} [copy]

  - alert: HaproxyHighHttp4xxErrorRateServer
    expr: sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.5. HAProxy high HTTP 5xx error rate server

Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }} [copy]

  - alert: HaproxyHighHttp5xxErrorRateServer
    expr: sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.6. HAProxy server response errors

Too many response errors to {{ $labels.server }} server (> 5%). [copy]

  - alert: HaproxyServerResponseErrors
    expr: sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server response errors (instance {{ $labels.instance }})
      description: "Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.7. HAProxy backend connection errors

Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high. [copy]

  - alert: HaproxyBackendConnectionErrors
    expr: sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) > 100
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.8. HAProxy server connection errors

Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high. [copy]

  - alert: HaproxyServerConnectionErrors
    expr: sum by (server) (rate(haproxy_server_connection_errors_total[1m])) > 100
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.9. HAProxy backend max active session

HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%). [copy]

  - alert: HaproxyBackendMaxActiveSession
    expr: ((sum by (backend) (avg_over_time(haproxy_backend_current_sessions[2m]) * 100) / sum by (backend) (avg_over_time(haproxy_backend_limit_sessions[2m])))) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy backend max active session (instance {{ $labels.instance }})
      description: "HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.10. HAProxy pending requests

Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend [copy]

  - alert: HaproxyPendingRequests
    expr: sum by (backend) (haproxy_backend_current_queue) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy pending requests (instance {{ $labels.instance }})
      description: "Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.11. HAProxy HTTP slowing down

Average request time is increasing [copy]

  - alert: HaproxyHttpSlowingDown
    expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
      description: "Average request time is increasing\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.12. HAProxy retry high

High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend [copy]

  - alert: HaproxyRetryHigh
    expr: sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy retry high (instance {{ $labels.instance }})
      description: "High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.13. HAProxy backend down

HAProxy backend is down [copy]

  - alert: HaproxyBackendDown
    expr: haproxy_backend_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend down (instance {{ $labels.instance }})
      description: "HAProxy backend is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.14. HAProxy server down

HAProxy server is down [copy]

  - alert: HaproxyServerDown
    expr: haproxy_server_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server down (instance {{ $labels.instance }})
      description: "HAProxy server is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.15. HAProxy frontend security blocked requests

HAProxy is blocking requests for security reason [copy]

  - alert: HaproxyFrontendSecurityBlockedRequests
    expr: sum by (frontend) (rate(haproxy_frontend_requests_denied_total[2m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
      description: "HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.3.2.16. HAProxy server healthcheck failure

Some server healthcheck are failing on {{ $labels.server }} [copy]

  - alert: HaproxyServerHealthcheckFailure
    expr: increase(haproxy_server_check_failures_total[1m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
      description: "Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.4.1. Traefik : Embedded exporter v2 (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/traefik/embedded-exporter-v2.yml

# 3.4.1.1. Traefik service down

All Traefik services are down [copy]

  - alert: TraefikServiceDown
    expr: count(traefik_service_server_up) by (service) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Traefik service down (instance {{ $labels.instance }})
      description: "All Traefik services are down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.4.1.2. Traefik high HTTP 4xx error rate service

Traefik service 4xx error rate is above 5% [copy]

  - alert: TraefikHighHttp4xxErrorRateService
    expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }})
      description: "Traefik service 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.4.1.3. Traefik high HTTP 5xx error rate service

Traefik service 5xx error rate is above 5% [copy]

  - alert: TraefikHighHttp5xxErrorRateService
    expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }})
      description: "Traefik service 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.4.2. Traefik : Embedded exporter v1 (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/traefik/embedded-exporter-v1.yml

# 3.4.2.1. Traefik backend down

All Traefik backends are down [copy]

  - alert: TraefikBackendDown
    expr: count(traefik_backend_server_up) by (backend) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Traefik backend down (instance {{ $labels.instance }})
      description: "All Traefik backends are down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.4.2.2. Traefik high HTTP 4xx error rate backend

Traefik backend 4xx error rate is above 5% [copy]

  - alert: TraefikHighHttp4xxErrorRateBackend
    expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: "Traefik backend 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.4.2.3. Traefik high HTTP 5xx error rate backend

Traefik backend 5xx error rate is above 5% [copy]

  - alert: TraefikHighHttp5xxErrorRateBackend
    expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: "Traefik backend 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.5. Caddy : Embedded exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/caddy/embedded-exporter.yml

# 3.5.1. Caddy Reverse Proxy Down

All Caddy reverse proxies are down [copy]

  - alert: CaddyReverseProxyDown
    expr: count(caddy_reverse_proxy_upstreams_healthy) by (upstream) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Caddy Reverse Proxy Down (instance {{ $labels.instance }})
      description: "All Caddy reverse proxies are down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.5.2. Caddy high HTTP 4xx error rate service

Caddy service 4xx error rate is above 5% [copy]

  - alert: CaddyHighHttp4xxErrorRateService
    expr: sum(rate(caddy_http_request_duration_seconds_count{code=~"4.."}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Caddy high HTTP 4xx error rate service (instance {{ $labels.instance }})
      description: "Caddy service 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 3.5.3. Caddy high HTTP 5xx error rate service

Caddy service 5xx error rate is above 5% [copy]

  - alert: CaddyHighHttp5xxErrorRateService
    expr: sum(rate(caddy_http_request_duration_seconds_count{code=~"5.."}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Caddy high HTTP 5xx error rate service (instance {{ $labels.instance }})
      description: "Caddy service 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 4.1. PHP-FPM : bakins/php-fpm-exporter (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/php-fpm/bakins-fpm-exporter.yml

# 4.1.1. PHP-FPM max-children reached

PHP-FPM reached max children - {{ $labels.instance }} [copy]

  - alert: Php-fpmMax-childrenReached
    expr: sum(phpfpm_max_children_reached_total) by (instance) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: PHP-FPM max-children reached (instance {{ $labels.instance }})
      description: "PHP-FPM reached max children - {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 4.2. JVM : java-client (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/jvm/jvm-exporter.yml

# 4.2.1. JVM memory filling up

JVM memory is filling up (> 80%) [copy]

  - alert: JvmMemoryFillingUp
    expr: (sum by (instance)(jvm_memory_used_bytes{area="heap"}) / sum by (instance)(jvm_memory_max_bytes{area="heap"})) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: JVM memory filling up (instance {{ $labels.instance }})
      description: "JVM memory is filling up (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 4.3. Sidekiq : Strech/sidekiq-prometheus-exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/sidekiq/strech-sidekiq-exporter.yml

# 4.3.1. Sidekiq queue size

Sidekiq queue {{ $labels.name }} is growing [copy]

  - alert: SidekiqQueueSize
    expr: sidekiq_queue_size > 100
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Sidekiq queue size (instance {{ $labels.instance }})
      description: "Sidekiq queue {{ $labels.name }} is growing\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 4.3.2. Sidekiq scheduling latency too high

Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing. [copy]

  - alert: SidekiqSchedulingLatencyTooHigh
    expr: max(sidekiq_queue_latency) > 60
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
      description: "Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1. Kubernetes : kube-state-metrics (36 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/kubernetes/kubestate-exporter.yml

# 5.1.1. Kubernetes Node not ready

Node {{ $labels.node }} has been unready for a long time [copy]

  - alert: KubernetesNodeNotReady
    expr: kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node not ready (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.2. Kubernetes Node scheduling disabled

Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes. [copy]

  # Kubernetes Node with disabled schedules are fine.
  # This alarm can be useful to get warned if there are nodes which are longer unscheduled.
  - alert: KubernetesNodeSchedulingDisabled
    expr: kube_node_spec_taint{key="node.kubernetes.io/unschedulable"} == 1
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Node scheduling disabled (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.3. Kubernetes Node memory pressure

Node {{ $labels.node }} has MemoryPressure condition [copy]

  - alert: KubernetesNodeMemoryPressure
    expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node memory pressure (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.4. Kubernetes Node disk pressure

Node {{ $labels.node }} has DiskPressure condition [copy]

  - alert: KubernetesNodeDiskPressure
    expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node disk pressure (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has DiskPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.5. Kubernetes Node network unavailable

Node {{ $labels.node }} has NetworkUnavailable condition [copy]

  - alert: KubernetesNodeNetworkUnavailable
    expr: kube_node_status_condition{condition="NetworkUnavailable",status="true"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node network unavailable (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has NetworkUnavailable condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.6. Kubernetes Node out of pod capacity

Node {{ $labels.node }} is out of pod capacity [copy]

  - alert: KubernetesNodeOutOfPodCapacity
    expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid, instance) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Node out of pod capacity (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} is out of pod capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.7. Kubernetes Container oom killer

Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes. [copy]

  - alert: KubernetesContainerOomKiller
    expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
      description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.8. Kubernetes Job failed

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete [copy]

  - alert: KubernetesJobFailed
    expr: kube_job_status_failed > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Job failed (instance {{ $labels.instance }})
      description: "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.9. Kubernetes Job not starting

Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes [copy]

  - alert: KubernetesJobNotStarting
    expr: kube_job_status_active == 0 and kube_job_status_failed == 0 and kube_job_status_succeeded == 0 and (time() - kube_job_status_start_time) > 600
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Job not starting (instance {{ $labels.instance }})
      description: "Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.10. Kubernetes CronJob suspended

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended [copy]

  - alert: KubernetesCronjobSuspended
    expr: kube_cronjob_spec_suspend != 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes CronJob suspended (instance {{ $labels.instance }})
      description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.11. Kubernetes PersistentVolumeClaim pending

PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending [copy]

  - alert: KubernetesPersistentvolumeclaimPending
    expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})
      description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.12. Kubernetes Volume out of disk space

Volume is almost full (< 10% left) [copy]

  - alert: KubernetesVolumeOutOfDiskSpace
    expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})
      description: "Volume is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.13. Kubernetes Volume full in four days

Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available. [copy]

  - alert: KubernetesVolumeFullInFourDays
    expr: predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600) < 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})
      description: "Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.14. Kubernetes PersistentVolume error

Persistent volume {{ $labels.persistentvolume }} is in bad state [copy]

  - alert: KubernetesPersistentvolumeError
    expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
      description: "Persistent volume {{ $labels.persistentvolume }} is in bad state\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.15. Kubernetes StatefulSet down

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down [copy]

  - alert: KubernetesStatefulsetDown
    expr: kube_statefulset_replicas != kube_statefulset_status_replicas_ready > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
      description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.16. Kubernetes HPA scale inability

HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale [copy]

  - alert: KubernetesHpaScaleInability
    expr: (kube_horizontalpodautoscaler_spec_max_replicas - kube_horizontalpodautoscaler_status_desired_replicas) * on (horizontalpodautoscaler,namespace) (kube_horizontalpodautoscaler_status_condition{condition="ScalingLimited", status="true"} == 1) == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA scale inability (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.17. Kubernetes HPA metrics unavailability

HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics [copy]

  - alert: KubernetesHpaMetricsUnavailability
    expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="ScalingActive"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA metrics unavailability (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.18. Kubernetes HPA scale maximum

HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods [copy]

  - alert: KubernetesHpaScaleMaximum
    expr: (kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas)
    for: 2m
    labels:
      severity: info
    annotations:
      summary: Kubernetes HPA scale maximum (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.19. Kubernetes HPA underutilized

HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here. [copy]

  - alert: KubernetesHpaUnderutilized
    expr: max(quantile_over_time(0.5, kube_horizontalpodautoscaler_status_desired_replicas[1d]) == kube_horizontalpodautoscaler_spec_min_replicas) by (horizontalpodautoscaler) > 3
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Kubernetes HPA underutilized (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.20. Kubernetes Pod not healthy

Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes. [copy]

  - alert: KubernetesPodNotHealthy
    expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.21. Kubernetes pod crash looping

Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping [copy]

  - alert: KubernetesPodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
      description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.22. Kubernetes ReplicaSet replicas mismatch

ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch [copy]

  - alert: KubernetesReplicasetReplicasMismatch
    expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes ReplicaSet replicas mismatch (instance {{ $labels.instance }})
      description: "ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.23. Kubernetes Deployment replicas mismatch

Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch [copy]

  - alert: KubernetesDeploymentReplicasMismatch
    expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
      description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.24. Kubernetes StatefulSet replicas mismatch

StatefulSet does not match the expected number of replicas. [copy]

  - alert: KubernetesStatefulsetReplicasMismatch
    expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
      description: "StatefulSet does not match the expected number of replicas.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.25. Kubernetes Deployment generation mismatch

Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back. [copy]

  - alert: KubernetesDeploymentGenerationMismatch
    expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})
      description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.26. Kubernetes StatefulSet generation mismatch

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back. [copy]

  - alert: KubernetesStatefulsetGenerationMismatch
    expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})
      description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.27. Kubernetes StatefulSet update not rolled out

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. [copy]

  - alert: KubernetesStatefulsetUpdateNotRolledOut
    expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})
      description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.28. Kubernetes DaemonSet rollout stuck

Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready [copy]

  - alert: KubernetesDaemonsetRolloutStuck
    expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
      description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.29. Kubernetes DaemonSet misscheduled

Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run [copy]

  - alert: KubernetesDaemonsetMisscheduled
    expr: kube_daemonset_status_number_misscheduled > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})
      description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.30. Kubernetes CronJob too long

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. [copy]

  # Threshold should be customized for each cronjob name.
  - alert: KubernetesCronjobTooLong
    expr: kube_job_status_start_time > 0 and absent(kube_job_status_completion_time) and (time() - kube_job_status_start_time) > 3600
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes CronJob too long (instance {{ $labels.instance }})
      description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.31. Kubernetes Job slow completion

Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time. [copy]

  - alert: KubernetesJobSlowCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded - kube_job_status_failed > 0
    for: 12h
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Job slow completion (instance {{ $labels.instance }})
      description: "Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.32. Kubernetes API server errors

Kubernetes API server is experiencing high error rate [copy]

  - alert: KubernetesApiServerErrors
    expr: sum(rate(apiserver_request_total{job="apiserver",code=~"(?:5..)"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) by (instance, job) * 100 > 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes API server errors (instance {{ $labels.instance }})
      description: "Kubernetes API server is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.33. Kubernetes API client errors

Kubernetes API client is experiencing high error rate [copy]

  - alert: KubernetesApiClientErrors
    expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes API client errors (instance {{ $labels.instance }})
      description: "Kubernetes API client is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.34. Kubernetes client certificate expires next week

A client certificate used to authenticate to the apiserver is expiring next week. [copy]

  - alert: KubernetesClientCertificateExpiresNextWeek
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }})
      description: "A client certificate used to authenticate to the apiserver is expiring next week.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.35. Kubernetes client certificate expires soon

A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. [copy]

  - alert: KubernetesClientCertificateExpiresSoon
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }})
      description: "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.1.36. Kubernetes API server latency

Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. [copy]

  - alert: KubernetesApiServerLatency
    expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~"(?:CONNECT|WATCHLIST|WATCH|PROXY)"} [10m])) WITHOUT (subresource)) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes API server latency (instance {{ $labels.instance }})
      description: "Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.2. Nomad : Embedded exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/nomad/embedded-exporter.yml

# 5.2.1. Nomad job failed

Nomad job failed [copy]

  - alert: NomadJobFailed
    expr: nomad_nomad_job_summary_failed > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Nomad job failed (instance {{ $labels.instance }})
      description: "Nomad job failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.2.2. Nomad job lost

Nomad job lost [copy]

  - alert: NomadJobLost
    expr: nomad_nomad_job_summary_lost > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Nomad job lost (instance {{ $labels.instance }})
      description: "Nomad job lost\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.2.3. Nomad job queued

Nomad job queued [copy]

  - alert: NomadJobQueued
    expr: nomad_nomad_job_summary_queued > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nomad job queued (instance {{ $labels.instance }})
      description: "Nomad job queued\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.2.4. Nomad blocked evaluation

Nomad blocked evaluation [copy]

  - alert: NomadBlockedEvaluation
    expr: nomad_nomad_blocked_evals_total_blocked > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Nomad blocked evaluation (instance {{ $labels.instance }})
      description: "Nomad blocked evaluation\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.3. Consul : prometheus/consul_exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/consul/consul-exporter.yml

# 5.3.1. Consul service healthcheck failed

Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}` [copy]

  - alert: ConsulServiceHealthcheckFailed
    expr: consul_catalog_service_node_healthy == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Consul service healthcheck failed (instance {{ $labels.instance }})
      description: "Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.3.2. Consul missing master node

Numbers of consul raft peers should be 3, in order to preserve quorum. [copy]

  - alert: ConsulMissingMasterNode
    expr: consul_raft_peers < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Consul missing master node (instance {{ $labels.instance }})
      description: "Numbers of consul raft peers should be 3, in order to preserve quorum.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.3.3. Consul agent unhealthy

A Consul agent is down [copy]

  - alert: ConsulAgentUnhealthy
    expr: consul_health_node_status{status="critical"} == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Consul agent unhealthy (instance {{ $labels.instance }})
      description: "A Consul agent is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4. Etcd : Embedded exporter (13 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/etcd/embedded-exporter.yml

# 5.4.1. Etcd insufficient Members

Etcd cluster should have an odd number of members [copy]

  - alert: EtcdInsufficientMembers
    expr: count(etcd_server_id) % 2 == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Etcd insufficient Members (instance {{ $labels.instance }})
      description: "Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.2. Etcd no Leader

Etcd cluster have no leader [copy]

  - alert: EtcdNoLeader
    expr: etcd_server_has_leader == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Etcd no Leader (instance {{ $labels.instance }})
      description: "Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.3. Etcd high number of leader changes

Etcd leader changed more than 2 times during 10 minutes [copy]

  - alert: EtcdHighNumberOfLeaderChanges
    expr: increase(etcd_server_leader_changes_seen_total[10m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of leader changes (instance {{ $labels.instance }})
      description: "Etcd leader changed more than 2 times during 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.4. Etcd high number of failed GRPC requests

More than 1% GRPC request failure detected in Etcd [copy]

  - alert: EtcdHighNumberOfFailedGrpcRequests
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
      description: "More than 1% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.5. Etcd high number of failed GRPC requests

More than 5% GRPC request failure detected in Etcd [copy]

  - alert: EtcdHighNumberOfFailedGrpcRequests
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
      description: "More than 5% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.6. Etcd GRPC requests slow

GRPC requests slowing down, 99th percentile is over 0.15s [copy]

  - alert: EtcdGrpcRequestsSlow
    expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[1m])) by (grpc_service, grpc_method, le)) > 0.15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
      description: "GRPC requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.7. Etcd high number of failed HTTP requests

More than 1% HTTP failure detected in Etcd [copy]

  - alert: EtcdHighNumberOfFailedHttpRequests
    expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
      description: "More than 1% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.8. Etcd high number of failed HTTP requests

More than 5% HTTP failure detected in Etcd [copy]

  - alert: EtcdHighNumberOfFailedHttpRequests
    expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
      description: "More than 5% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.9. Etcd HTTP requests slow

HTTP requests slowing down, 99th percentile is over 0.15s [copy]

  - alert: EtcdHttpRequestsSlow
    expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
      description: "HTTP requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.10. Etcd member communication slow

Etcd member communication slowing down, 99th percentile is over 0.15s [copy]

  - alert: EtcdMemberCommunicationSlow
    expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd member communication slow (instance {{ $labels.instance }})
      description: "Etcd member communication slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.11. Etcd high number of failed proposals

Etcd server got more than 5 failed proposals past hour [copy]

  - alert: EtcdHighNumberOfFailedProposals
    expr: increase(etcd_server_proposals_failed_total[1h]) > 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
      description: "Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.12. Etcd high fsync durations

Etcd WAL fsync duration increasing, 99th percentile is over 0.5s [copy]

  - alert: EtcdHighFsyncDurations
    expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high fsync durations (instance {{ $labels.instance }})
      description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.4.13. Etcd high commit durations

Etcd commit duration increasing, 99th percentile is over 0.25s [copy]

  - alert: EtcdHighCommitDurations
    expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high commit durations (instance {{ $labels.instance }})
      description: "Etcd commit duration increasing, 99th percentile is over 0.25s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.5. Linkerd : Embedded exporter (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/linkerd/embedded-exporter.yml

# 5.5.1. Linkerd high error rate

Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10% [copy]

  - alert: LinkerdHighErrorRate
    expr: sum(rate(request_errors_total[1m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Linkerd high error rate (instance {{ $labels.instance }})
      description: "Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6. Istio : Embedded exporter (10 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/istio/embedded-exporter.yml

# 5.6.1. Istio Kubernetes gateway availability drop

Gateway pods have dropped. Inbound traffic will likely be affected. [copy]

  - alert: IstioKubernetesGatewayAvailabilityDrop
    expr: min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio Kubernetes gateway availability drop (instance {{ $labels.instance }})
      description: "Gateway pods have dropped. Inbound traffic will likely be affected.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.2. Istio Pilot high total request rate

Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration. [copy]

  - alert: IstioPilotHighTotalRequestRate
    expr: sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio Pilot high total request rate (instance {{ $labels.instance }})
      description: "Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.3. Istio Mixer Prometheus dispatches low

Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly. [copy]

  - alert: IstioMixerPrometheusDispatchesLow
    expr: sum(rate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[1m])) < 180
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio Mixer Prometheus dispatches low (instance {{ $labels.instance }})
      description: "Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.4. Istio high total request rate

Global request rate in the service mesh is unusually high. [copy]

  - alert: IstioHighTotalRequestRate
    expr: sum(rate(istio_requests_total{reporter="destination"}[5m])) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Istio high total request rate (instance {{ $labels.instance }})
      description: "Global request rate in the service mesh is unusually high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.5. Istio low total request rate

Global request rate in the service mesh is unusually low. [copy]

  - alert: IstioLowTotalRequestRate
    expr: sum(rate(istio_requests_total{reporter="destination"}[5m])) < 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Istio low total request rate (instance {{ $labels.instance }})
      description: "Global request rate in the service mesh is unusually low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.6. Istio high 4xx error rate

High percentage of HTTP 4xx responses in Istio (> 5%). [copy]

  - alert: IstioHigh4xxErrorRate
    expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio high 4xx error rate (instance {{ $labels.instance }})
      description: "High percentage of HTTP 4xx responses in Istio (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.7. Istio high 5xx error rate

High percentage of HTTP 5xx responses in Istio (> 5%). [copy]

  - alert: IstioHigh5xxErrorRate
    expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio high 5xx error rate (instance {{ $labels.instance }})
      description: "High percentage of HTTP 5xx responses in Istio (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.8. Istio high request latency

Istio average requests execution is longer than 100ms. [copy]

  - alert: IstioHighRequestLatency
    expr: rate(istio_request_duration_milliseconds_sum{reporter="destination"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter="destination"}[1m]) > 100
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio high request latency (instance {{ $labels.instance }})
      description: "Istio average requests execution is longer than 100ms.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.9. Istio latency 99 percentile

Istio 1% slowest requests are longer than 1000ms. [copy]

  - alert: IstioLatency99Percentile
    expr: histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio latency 99 percentile (instance {{ $labels.instance }})
      description: "Istio 1% slowest requests are longer than 1000ms.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.6.10. Istio Pilot Duplicate Entry

Istio pilot duplicate entry error. [copy]

  - alert: IstioPilotDuplicateEntry
    expr: sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Istio Pilot Duplicate Entry (instance {{ $labels.instance }})
      description: "Istio pilot duplicate entry error.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.7. ArgoCD : Embedded exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/argocd/embedded-exporter.yml

# 5.7.1. ArgoCD service not synced

Service {{ $labels.name }} run by argo is currently not in sync. [copy]

  - alert: ArgocdServiceNotSynced
    expr: argocd_app_info{sync_status!="Synced"} != 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: ArgoCD service not synced (instance {{ $labels.instance }})
      description: "Service {{ $labels.name }} run by argo is currently not in sync.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.7.2. ArgoCD service unhealthy

Service {{ $labels.name }} run by argo is currently not healthy. [copy]

  - alert: ArgocdServiceUnhealthy
    expr: argocd_app_info{health_status!="Healthy"} != 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: ArgoCD service unhealthy (instance {{ $labels.instance }})
      description: "Service {{ $labels.name }} run by argo is currently not healthy.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.8. FluxCD : Embedded exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/fluxcd/embedded-exporter.yml

# 5.8.1. Flux Kustomization Failure

The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready. [copy]

  - alert: FluxKustomizationFailure
    expr: gotk_resource_info{ready="False", customresource_kind="Kustomization"} > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Flux Kustomization Failure (instance {{ $labels.instance }})
      description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.8.2. Flux HelmRelease Failure

The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready. [copy]

  - alert: FluxHelmreleaseFailure
    expr: gotk_resource_info{ready="False", customresource_kind="HelmRelease"} > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Flux HelmRelease Failure (instance {{ $labels.instance }})
      description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.8.3. Flux Source Issue

Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s). [copy]

  - alert: FluxSourceIssue
    expr: gotk_resource_info{ready="False", customresource_kind=~"GitRepository|HelmRepository|Bucket|OCIRepository"} > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Flux Source Issue (instance {{ $labels.instance }})
      description: "Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 5.8.4. Flux Image Issue

The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready. [copy]

  - alert: FluxImageIssue
    expr: gotk_resource_info{ready="False", customresource_kind=~"ImagePolicy|ImageRepository|ImageUpdateAutomation"} > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Flux Image Issue (instance {{ $labels.instance }})
      description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1. Ceph : Embedded exporter (13 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/ceph/embedded-exporter.yml

# 6.1.1. Ceph State

Ceph instance unhealthy [copy]

  - alert: CephState
    expr: ceph_health_status != 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Ceph State (instance {{ $labels.instance }})
      description: "Ceph instance unhealthy\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.2. Ceph monitor clock skew

Ceph monitor clock skew detected. Please check ntp and hardware clock settings [copy]

  - alert: CephMonitorClockSkew
    expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph monitor clock skew (instance {{ $labels.instance }})
      description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.3. Ceph monitor low space

Ceph monitor storage is low. [copy]

  - alert: CephMonitorLowSpace
    expr: ceph_monitor_avail_percent < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph monitor low space (instance {{ $labels.instance }})
      description: "Ceph monitor storage is low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.4. Ceph OSD Down

Ceph Object Storage Daemon Down [copy]

  - alert: CephOsdDown
    expr: ceph_osd_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Ceph OSD Down (instance {{ $labels.instance }})
      description: "Ceph Object Storage Daemon Down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.5. Ceph high OSD latency

Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state. [copy]

  - alert: CephHighOsdLatency
    expr: ceph_osd_perf_apply_latency_seconds > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Ceph high OSD latency (instance {{ $labels.instance }})
      description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.6. Ceph OSD low space

Ceph Object Storage Daemon is going out of space. Please add more disks. [copy]

  - alert: CephOsdLowSpace
    expr: ceph_osd_utilization > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph OSD low space (instance {{ $labels.instance }})
      description: "Ceph Object Storage Daemon is going out of space. Please add more disks.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.7. Ceph OSD reweighted

Ceph Object Storage Daemon takes too much time to resize. [copy]

  - alert: CephOsdReweighted
    expr: ceph_osd_weight < 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph OSD reweighted (instance {{ $labels.instance }})
      description: "Ceph Object Storage Daemon takes too much time to resize.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.8. Ceph PG down

Some Ceph placement groups are down. Please ensure that all the data are available. [copy]

  - alert: CephPgDown
    expr: ceph_pg_down > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Ceph PG down (instance {{ $labels.instance }})
      description: "Some Ceph placement groups are down. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.9. Ceph PG incomplete

Some Ceph placement groups are incomplete. Please ensure that all the data are available. [copy]

  - alert: CephPgIncomplete
    expr: ceph_pg_incomplete > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Ceph PG incomplete (instance {{ $labels.instance }})
      description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.10. Ceph PG inconsistent

Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes. [copy]

  - alert: CephPgInconsistent
    expr: ceph_pg_inconsistent > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Ceph PG inconsistent (instance {{ $labels.instance }})
      description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.11. Ceph PG activation long

Some Ceph placement groups are too long to activate. [copy]

  - alert: CephPgActivationLong
    expr: ceph_pg_activating > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph PG activation long (instance {{ $labels.instance }})
      description: "Some Ceph placement groups are too long to activate.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.12. Ceph PG backfill full

Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules. [copy]

  - alert: CephPgBackfillFull
    expr: ceph_pg_backfill_toofull > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph PG backfill full (instance {{ $labels.instance }})
      description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.1.13. Ceph PG unavailable

Some Ceph placement groups are unavailable. [copy]

  - alert: CephPgUnavailable
    expr: ceph_pg_total - ceph_pg_active > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Ceph PG unavailable (instance {{ $labels.instance }})
      description: "Some Ceph placement groups are unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.2. SpeedTest : Speedtest exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/speedtest/nlamirault-speedtest-exporter.yml

# 6.2.1. SpeedTest Slow Internet Download

Internet download speed is currently {{humanize $value}} Mbps. [copy]

  - alert: SpeedtestSlowInternetDownload
    expr: avg_over_time(speedtest_download[10m]) < 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }})
      description: "Internet download speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.2.2. SpeedTest Slow Internet Upload

Internet upload speed is currently {{humanize $value}} Mbps. [copy]

  - alert: SpeedtestSlowInternetUpload
    expr: avg_over_time(speedtest_upload[10m]) < 20
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }})
      description: "Internet upload speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.3.1. ZFS : node-exporter (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/zfs/node-exporter.yml

# 6.3.1.1. ZFS offline pool

A ZFS zpool is in a unexpected state: {{ $labels.state }}. [copy]

  - alert: ZfsOfflinePool
    expr: node_zfs_zpool_state{state!="online"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: ZFS offline pool (instance {{ $labels.instance }})
      description: "A ZFS zpool is in a unexpected state: {{ $labels.state }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.3.2. ZFS : ZFS exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/zfs/zfs_exporter.yml

# 6.3.2.1. ZFS pool out of space

Disk is almost full (< 10% left) [copy]

  - alert: ZfsPoolOutOfSpace
    expr: zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: ZFS pool out of space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.3.2.2. ZFS pool unhealthy

ZFS pool state is {{ $value }}. See comments for more information. [copy]

  # 0: ONLINE
  # 1: DEGRADED
  # 2: FAULTED
  # 3: OFFLINE
  # 4: UNAVAIL
  # 5: REMOVED
  # 6: SUSPENDED
  - alert: ZfsPoolUnhealthy
    expr: zfs_pool_health > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ZFS pool unhealthy (instance {{ $labels.instance }})
      description: "ZFS pool state is {{ $value }}. See comments for more information.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.3.2.3. ZFS collector failed

ZFS collector for {{ $labels.instance }} has failed to collect information [copy]

  - alert: ZfsCollectorFailed
    expr: zfs_scrape_collector_success != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: ZFS collector failed (instance {{ $labels.instance }})
      description: "ZFS collector for {{ $labels.instance }} has failed to collect information\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.4. OpenEBS : Embedded exporter (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/openebs/embedded-exporter.yml

# 6.4.1. OpenEBS used pool capacity

OpenEBS Pool use more than 80% of his capacity [copy]

  - alert: OpenebsUsedPoolCapacity
    expr: openebs_used_pool_capacity_percent > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: OpenEBS used pool capacity (instance {{ $labels.instance }})
      description: "OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.5. Minio : Embedded exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/minio/embedded-exporter.yml

# 6.5.1. Minio cluster disk offline

Minio cluster disk is offline [copy]

  - alert: MinioClusterDiskOffline
    expr: minio_cluster_drive_offline_total > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Minio cluster disk offline (instance {{ $labels.instance }})
      description: "Minio cluster disk is offline\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.5.2. Minio node disk offline

Minio cluster node disk is offline [copy]

  - alert: MinioNodeDiskOffline
    expr: minio_cluster_nodes_offline_total > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Minio node disk offline (instance {{ $labels.instance }})
      description: "Minio cluster node disk is offline\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.5.3. Minio disk space usage

Minio available free space is low (< 10%) [copy]

  - alert: MinioDiskSpaceUsage
    expr: disk_storage_available / disk_storage_total * 100 < 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Minio disk space usage (instance {{ $labels.instance }})
      description: "Minio available free space is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.6. SSL/TLS : ssl_exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/ssl/tls/ribbybibby-ssl-exporter.yml

# 6.6.1. SSL certificate probe failed

Failed to fetch SSL information {{ $labels.instance }} [copy]

  - alert: SslCertificateProbeFailed
    expr: ssl_probe_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SSL certificate probe failed (instance {{ $labels.instance }})
      description: "Failed to fetch SSL information {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.6.2. SSL certificate OSCP status unknown

Failed to get the OSCP status {{ $labels.instance }} [copy]

  - alert: SslCertificateOscpStatusUnknown
    expr: ssl_ocsp_response_status == 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SSL certificate OSCP status unknown (instance {{ $labels.instance }})
      description: "Failed to get the OSCP status {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.6.3. SSL certificate revoked

SSL certificate revoked {{ $labels.instance }} [copy]

  - alert: SslCertificateRevoked
    expr: ssl_ocsp_response_status == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: SSL certificate revoked (instance {{ $labels.instance }})
      description: "SSL certificate revoked {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.6.4. SSL certificate expiry (< 7 days)

{{ $labels.instance }} Certificate is expiring in 7 days [copy]

  - alert: SslCertificateExpiry(<7Days)
    expr: ssl_verified_cert_not_after{chain_no="0"} - time() < 86400 * 7
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SSL certificate expiry (< 7 days) (instance {{ $labels.instance }})
      description: "{{ $labels.instance }} Certificate is expiring in 7 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.7. Juniper : czerwonk/junos_exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/juniper/czerwonk-junos-exporter.yml

# 6.7.1. Juniper switch down

The switch appears to be down [copy]

  - alert: JuniperSwitchDown
    expr: junos_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Juniper switch down (instance {{ $labels.instance }})
      description: "The switch appears to be down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.7.2. Juniper high Bandwidth Usage 1GiB

Interface is highly saturated. (> 0.90GiB/s) [copy]

  - alert: JuniperHighBandwidthUsage1gib
    expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }})
      description: "Interface is highly saturated. (> 0.90GiB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.7.3. Juniper high Bandwidth Usage 1GiB

Interface is getting saturated. (> 0.80GiB/s) [copy]

  - alert: JuniperHighBandwidthUsage1gib
    expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }})
      description: "Interface is getting saturated. (> 0.80GiB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.8. CoreDNS : Embedded exporter (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/coredns/embedded-exporter.yml

# 6.8.1. CoreDNS Panic Count

Number of CoreDNS panics encountered [copy]

  - alert: CorednsPanicCount
    expr: increase(coredns_panics_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: CoreDNS Panic Count (instance {{ $labels.instance }})
      description: "Number of CoreDNS panics encountered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.9. Freeswitch : znerol/prometheus-freeswitch-exporter (3 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/freeswitch/znerol-freeswitch-exporter.yml

# 6.9.1. Freeswitch down

Freeswitch is unresponsive [copy]

  - alert: FreeswitchDown
    expr: freeswitch_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Freeswitch down (instance {{ $labels.instance }})
      description: "Freeswitch is unresponsive\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.9.2. Freeswitch Sessions Warning

High sessions usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: FreeswitchSessionsWarning
    expr: (freeswitch_session_active * 100 / freeswitch_session_limit) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Freeswitch Sessions Warning (instance {{ $labels.instance }})
      description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.9.3. Freeswitch Sessions Critical

High sessions usage on {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: FreeswitchSessionsCritical
    expr: (freeswitch_session_active * 100 / freeswitch_session_limit) > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Freeswitch Sessions Critical (instance {{ $labels.instance }})
      description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.10. Hashicorp Vault : Embedded exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/hashicorp-vault/embedded-exporter.yml

# 6.10.1. Vault sealed

Vault instance is sealed on {{ $labels.instance }} [copy]

  - alert: VaultSealed
    expr: vault_core_unsealed == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Vault sealed (instance {{ $labels.instance }})
      description: "Vault instance is sealed on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.10.2. Vault too many pending tokens

Too many pending tokens {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: VaultTooManyPendingTokens
    expr: avg(vault_token_create_count - vault_token_store_count) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Vault too many pending tokens (instance {{ $labels.instance }})
      description: "Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.10.3. Vault too many infinity tokens

Too many infinity tokens {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: VaultTooManyInfinityTokens
    expr: vault_token_count_by_ttl{creation_ttl="+Inf"} > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Vault too many infinity tokens (instance {{ $labels.instance }})
      description: "Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.10.4. Vault cluster health

Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf "%.2f"}}% [copy]

  - alert: VaultClusterHealth
    expr: sum(vault_core_active) / count(vault_core_active) <= 0.5
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Vault cluster health (instance {{ $labels.instance }})
      description: "Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.11. Cloudflare : lablabs/cloudflare-exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/cloudflare/lablabs-cloudflare-exporter.yml

# 6.11.1. Cloudflare http 4xx error rate

Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }}) [copy]

  - alert: CloudflareHttp4xxErrorRate
    expr: (sum by(zone) (rate(cloudflare_zone_requests_status{status=~"^4.."}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cloudflare http 4xx error rate (instance {{ $labels.instance }})
      description: "Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 6.11.2. Cloudflare http 5xx error rate

Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }}) [copy]

  - alert: CloudflareHttp5xxErrorRate
    expr: (sum by (zone) (rate(cloudflare_zone_requests_status{status=~"^5.."}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cloudflare http 5xx error rate (instance {{ $labels.instance }})
      description: "Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.1. Thanos : Thanos Compactor (5 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-compactor.yml

# 7.1.1.1. Thanos Compactor Multiple Running

No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running. [copy]

  - alert: ThanosCompactorMultipleRunning
    expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compactor Multiple Running (instance {{ $labels.instance }})
      description: "No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.1.2. Thanos Compactor Halted

Thanos Compact {{$labels.job}} has failed to run and now is halted. [copy]

  - alert: ThanosCompactorHalted
    expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compactor Halted (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} has failed to run and now is halted.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.1.3. Thanos Compactor High Compaction Failures

Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions. [copy]

  - alert: ThanosCompactorHighCompactionFailures
    expr: (sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compactor High Compaction Failures (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.1.4. Thanos Compact Bucket High Operation Failures

Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. [copy]

  - alert: ThanosCompactBucketHighOperationFailures
    expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compact Bucket High Operation Failures (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.1.5. Thanos Compact Has Not Run

Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours. [copy]

  - alert: ThanosCompactHasNotRun
    expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compact Has Not Run (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2. Thanos : Thanos Query (8 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-query.yml

# 7.1.2.1. Thanos Query Http Request Query Error Rate High

Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests. [copy]

  - alert: ThanosQueryHttpRequestQueryErrorRateHigh
    expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query\" requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.2. Thanos Query Http Request Query Range Error Rate High

Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests. [copy]

  - alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
    expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query_range\" requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.3. Thanos Query Grpc Server Error Rate

Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy]

  - alert: ThanosQueryGrpcServerErrorRate
    expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query Grpc Server Error Rate (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.4. Thanos Query Grpc Client Error Rate

Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests. [copy]

  - alert: ThanosQueryGrpcClientErrorRate
    expr: (sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.5. Thanos Query High D N S Failures

Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints. [copy]

  - alert: ThanosQueryHighDNSFailures
    expr: (sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))) * 100 > 1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query High D N S Failures (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.6. Thanos Query Instant Latency High

Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries. [copy]

  - alert: ThanosQueryInstantLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0)
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Instant Latency High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.7. Thanos Query Range Latency High

Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries. [copy]

  - alert: ThanosQueryRangeLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0)
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Range Latency High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.2.8. Thanos Query Overload

Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support. [copy]

  - alert: ThanosQueryOverload
    expr: (max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query Overload (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3. Thanos : Thanos Receiver (7 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-receiver.yml

# 7.1.3.1. Thanos Receive Http Request Error Rate High

Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy]

  - alert: ThanosReceiveHttpRequestErrorRateHigh
    expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive Http Request Error Rate High (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3.2. Thanos Receive Http Request Latency High

Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests. [copy]

  - alert: ThanosReceiveHttpRequestLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0)
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive Http Request Latency High (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3.3. Thanos Receive High Replication Failures

Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests. [copy]

  - alert: ThanosReceiveHighReplicationFailures
    expr: thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}))) * 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Receive High Replication Failures (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3.4. Thanos Receive High Forward Request Failures

Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests. [copy]

  - alert: ThanosReceiveHighForwardRequestFailures
    expr: (sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))/  sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))) * 100 > 20
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Thanos Receive High Forward Request Failures (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3.5. Thanos Receive High Hashring File Refresh Failures

Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed. [copy]

  - alert: ThanosReceiveHighHashringFileRefreshFailures
    expr: (sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) > 0)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3.6. Thanos Receive Config Reload Failure

Thanos Receive {{$labels.job}} has not been able to reload hashring configurations. [copy]

  - alert: ThanosReceiveConfigReloadFailure
    expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Receive Config Reload Failure (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.3.7. Thanos Receive No Upload

Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage. [copy]

  - alert: ThanosReceiveNoUpload
    expr: (up{job=~".*thanos-receive.*"} - 1) + on (job, instance) (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0)
    for: 3h
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive No Upload (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.4. Thanos : Thanos Sidecar (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-sidecar.yml

# 7.1.4.1. Thanos Sidecar Bucket Operations Failed

Thanos Sidecar {{$labels.instance}} bucket operations are failing [copy]

  - alert: ThanosSidecarBucketOperationsFailed
    expr: sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Sidecar Bucket Operations Failed (instance {{ $labels.instance }})
      description: "Thanos Sidecar {{$labels.instance}} bucket operations are failing\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.4.2. Thanos Sidecar No Connection To Started Prometheus

Thanos Sidecar {{$labels.instance}} is unhealthy. [copy]

  - alert: ThanosSidecarNoConnectionToStartedPrometheus
    expr: thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 and on (namespace, pod)prometheus_tsdb_data_replay_duration_seconds != 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Sidecar No Connection To Started Prometheus (instance {{ $labels.instance }})
      description: "Thanos Sidecar {{$labels.instance}} is unhealthy.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.5. Thanos : Thanos Store (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-store.yml

# 7.1.5.1. Thanos Store Grpc Error Rate

Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy]

  - alert: ThanosStoreGrpcErrorRate
    expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Grpc Error Rate (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.5.2. Thanos Store Series Gate Latency High

Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests. [copy]

  - alert: ThanosStoreSeriesGateLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Series Gate Latency High (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.5.3. Thanos Store Bucket High Operation Failures

Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. [copy]

  - alert: ThanosStoreBucketHighOperationFailures
    expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) * 100 > 5)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Bucket High Operation Failures (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.5.4. Thanos Store Objstore Operation Latency High

Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations. [copy]

  - alert: ThanosStoreObjstoreOperationLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and  sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Objstore Operation Latency High (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6. Thanos : Thanos Ruler (11 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-ruler.yml

# 7.1.6.1. Thanos Rule Queue Is Dropping Alerts

Thanos Rule {{$labels.instance}} is failing to queue alerts. [copy]

  - alert: ThanosRuleQueueIsDroppingAlerts
    expr: sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule Queue Is Dropping Alerts (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} is failing to queue alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.2. Thanos Rule Sender Is Failing Alerts

Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager. [copy]

  - alert: ThanosRuleSenderIsFailingAlerts
    expr: sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule Sender Is Failing Alerts (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.3. Thanos Rule High Rule Evaluation Failures

Thanos Rule {{$labels.instance}} is failing to evaluate rules. [copy]

  - alert: ThanosRuleHighRuleEvaluationFailures
    expr: (sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule High Rule Evaluation Failures (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} is failing to evaluate rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.4. Thanos Rule High Rule Evaluation Warnings

Thanos Rule {{$labels.instance}} has high number of evaluation warnings. [copy]

  - alert: ThanosRuleHighRuleEvaluationWarnings
    expr: sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0
    for: 15m
    labels:
      severity: info
    annotations:
      summary: Thanos Rule High Rule Evaluation Warnings (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} has high number of evaluation warnings.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.5. Thanos Rule Rule Evaluation Latency High

Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}. [copy]

  - alert: ThanosRuleRuleEvaluationLatencyHigh
    expr: (sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}))
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.6. Thanos Rule Grpc Error Rate

Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests. [copy]

  - alert: ThanosRuleGrpcErrorRate
    expr: (sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))/  sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Grpc Error Rate (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.7. Thanos Rule Config Reload Failure

Thanos Rule {{$labels.job}} has not been able to reload its configuration. [copy]

  - alert: ThanosRuleConfigReloadFailure
    expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) != 1
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Thanos Rule Config Reload Failure (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} has not been able to reload its configuration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.8. Thanos Rule Query High D N S Failures

Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints. [copy]

  - alert: ThanosRuleQueryHighDNSFailures
    expr: (sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Query High D N S Failures (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.9. Thanos Rule Alertmanager High D N S Failures

Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints. [copy]

  - alert: ThanosRuleAlertmanagerHighDNSFailures
    expr: (sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.10. Thanos Rule No Evaluation For10 Intervals

Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval. [copy]

  - alert: ThanosRuleNoEvaluationFor10Intervals
    expr: time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Thanos Rule No Evaluation For10 Intervals (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.6.11. Thanos No Rule Evaluations

Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes. [copy]

  - alert: ThanosNoRuleEvaluations
    expr: sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0  and sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos No Rule Evaluations (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.7. Thanos : Thanos Bucket Replicate (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-bucket-replicate.yml

# 7.1.7.1. Thanos Bucket Replicate Error Rate

Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed. [copy]

  - alert: ThanosBucketReplicateErrorRate
    expr: (sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m]))/ on (job) group_left  sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))) * 100 >= 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Bucket Replicate Error Rate (instance {{ $labels.instance }})
      description: "Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.7.2. Thanos Bucket Replicate Run Latency

Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations. [copy]

  - alert: ThanosBucketReplicateRunLatency
    expr: (histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 and  sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }})
      description: "Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.8. Thanos : Thanos Component Absent (6 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/thanos/thanos-component-absent.yml

# 7.1.8.1. Thanos Compact Is Down

ThanosCompact has disappeared. Prometheus target for the component cannot be discovered. [copy]

  - alert: ThanosCompactIsDown
    expr: absent(up{job=~".*thanos-compact.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Compact Is Down (instance {{ $labels.instance }})
      description: "ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.8.2. Thanos Query Is Down

ThanosQuery has disappeared. Prometheus target for the component cannot be discovered. [copy]

  - alert: ThanosQueryIsDown
    expr: absent(up{job=~".*thanos-query.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Is Down (instance {{ $labels.instance }})
      description: "ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.8.3. Thanos Receive Is Down

ThanosReceive has disappeared. Prometheus target for the component cannot be discovered. [copy]

  - alert: ThanosReceiveIsDown
    expr: absent(up{job=~".*thanos-receive.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive Is Down (instance {{ $labels.instance }})
      description: "ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.8.4. Thanos Rule Is Down

ThanosRule has disappeared. Prometheus target for the component cannot be discovered. [copy]

  - alert: ThanosRuleIsDown
    expr: absent(up{job=~".*thanos-rule.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule Is Down (instance {{ $labels.instance }})
      description: "ThanosRule has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.8.5. Thanos Sidecar Is Down

ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered. [copy]

  - alert: ThanosSidecarIsDown
    expr: absent(up{job=~".*thanos-sidecar.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Sidecar Is Down (instance {{ $labels.instance }})
      description: "ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.1.8.6. Thanos Store Is Down

ThanosStore has disappeared. Prometheus target for the component cannot be discovered. [copy]

  - alert: ThanosStoreIsDown
    expr: absent(up{job=~".*thanos-store.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Store Is Down (instance {{ $labels.instance }})
      description: "ThanosStore has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.2. Loki : Embedded exporter (4 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/loki/embedded-exporter.yml

# 7.2.1. Loki process too many restarts

A loki process had too many restarts (target {{ $labels.instance }}) [copy]

  - alert: LokiProcessTooManyRestarts
    expr: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Loki process too many restarts (instance {{ $labels.instance }})
      description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.2.2. Loki request errors

The {{ $labels.job }} and {{ $labels.route }} are experiencing errors [copy]

  - alert: LokiRequestErrors
    expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: Loki request errors (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.2.3. Loki request panic

The {{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panics [copy]

  - alert: LokiRequestPanic
    expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Loki request panic (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.2.4. Loki request latency

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency [copy]

  - alert: LokiRequestLatency
    expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Loki request latency (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.3. Promtail : Embedded exporter (2 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/promtail/embedded-exporter.yml

# 7.3.1. Promtail request errors

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors. [copy]

  - alert: PromtailRequestErrors
    expr: 100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Promtail request errors (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}% errors.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.3.2. Promtail request latency

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency. [copy]

  - alert: PromtailRequestLatency
    expr: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Promtail request latency (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.4. Cortex : Embedded exporter (6 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/cortex/embedded-exporter.yml

# 7.4.1. Cortex ruler configuration reload failure

Cortex ruler configuration reload failure (instance {{ $labels.instance }}) [copy]

  - alert: CortexRulerConfigurationReloadFailure
    expr: cortex_ruler_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cortex ruler configuration reload failure (instance {{ $labels.instance }})
      description: "Cortex ruler configuration reload failure (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.4.2. Cortex not connected to Alertmanager

Cortex not connected to Alertmanager (instance {{ $labels.instance }}) [copy]

  - alert: CortexNotConnectedToAlertmanager
    expr: cortex_prometheus_notifications_alertmanagers_discovered < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex not connected to Alertmanager (instance {{ $labels.instance }})
      description: "Cortex not connected to Alertmanager (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.4.3. Cortex notification are being dropped

Cortex notification are being dropped due to errors (instance {{ $labels.instance }}) [copy]

  - alert: CortexNotificationAreBeingDropped
    expr: rate(cortex_prometheus_notifications_dropped_total[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex notification are being dropped (instance {{ $labels.instance }})
      description: "Cortex notification are being dropped due to errors (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.4.4. Cortex notification error

Cortex is failing when sending alert notifications (instance {{ $labels.instance }}) [copy]

  - alert: CortexNotificationError
    expr: rate(cortex_prometheus_notifications_errors_total[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex notification error (instance {{ $labels.instance }})
      description: "Cortex is failing when sending alert notifications (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.4.5. Cortex ingester unhealthy

Cortex has an unhealthy ingester [copy]

  - alert: CortexIngesterUnhealthy
    expr: cortex_ring_members{state="Unhealthy", name="ingester"} > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex ingester unhealthy (instance {{ $labels.instance }})
      description: "Cortex has an unhealthy ingester\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.4.6. Cortex frontend queries stuck

There are queued up queries in query-frontend. [copy]

  - alert: CortexFrontendQueriesStuck
    expr: sum by (job) (cortex_query_frontend_queue_length) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cortex frontend queries stuck (instance {{ $labels.instance }})
      description: "There are queued up queries in query-frontend.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.5. Grafana Alloy (1 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/grafana-alloy/embedded-exporter.yml

# 7.5.1. Grafana Alloy service down

Alloy on (instance {{ $labels.instance }}) is not responding or has stopped running. [copy]

  - alert: GrafanaAlloyServiceDown
    expr: count by (instance) (alloy_build_info) unless count by (instance) (alloy_build_info offset 2m)  
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Grafana Alloy service down (instance {{ $labels.instance }})
      description: "Alloy on (instance {{ $labels.instance }}) is not responding or has stopped running.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6. Jenkins : Metric plugin (7 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/jenkins/metric-plugin.yml

# 7.6.1. Jenkins offline

Jenkins offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy]

  - alert: JenkinsOffline
    expr: jenkins_node_offline_value > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Jenkins offline (instance {{ $labels.instance }})
      description: "Jenkins offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6.2. Jenkins healthcheck

Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy]

  - alert: JenkinsHealthcheck
    expr: jenkins_health_check_score < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Jenkins healthcheck (instance {{ $labels.instance }})
      description: "Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6.3. Jenkins outdated plugins

{{ $value }} plugins need update [copy]

  - alert: JenkinsOutdatedPlugins
    expr: sum(jenkins_plugins_withUpdate) by (instance) > 3
    for: 1d
    labels:
      severity: warning
    annotations:
      summary: Jenkins outdated plugins (instance {{ $labels.instance }})
      description: "{{ $value }} plugins need update\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6.4. Jenkins builds health score

Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy]

  - alert: JenkinsBuildsHealthScore
    expr: default_jenkins_builds_health_score < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Jenkins builds health score (instance {{ $labels.instance }})
      description: "Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6.5. Jenkins run failure total

Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy]

  - alert: JenkinsRunFailureTotal
    expr: delta(jenkins_runs_failure_total[1h]) > 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Jenkins run failure total (instance {{ $labels.instance }})
      description: "Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6.6. Jenkins build tests failing

Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}}) [copy]

  - alert: JenkinsBuildTestsFailing
    expr: default_jenkins_builds_last_build_tests_failing > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Jenkins build tests failing (instance {{ $labels.instance }})
      description: "Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.6.7. Jenkins last build failed

Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}}) [copy]

  # * RUNNING  -1 true  - The build had no errors.
  # * SUCCESS   0 true  - The build had no errors.
  # * UNSTABLE  1 true  - The build had some errors but they were not fatal. For example, some tests failed.
  # * FAILURE   2 false - The build had a fatal error.
  # * NOT_BUILT 3 false - The module was not built.
  # * ABORTED   4 false - The build was manually aborted.
  - alert: JenkinsLastBuildFailed
    expr: default_jenkins_builds_last_build_result_ordinal == 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Jenkins last build failed (instance {{ $labels.instance }})
      description: "Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.7. APC UPS : mdlayher/apcupsd_exporter (6 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/apc-ups/apcupsd_exporter.yml

# 7.7.1. APC UPS Battery nearly empty

Battery is almost empty (< 10% left) [copy]

  - alert: ApcUpsBatteryNearlyEmpty
    expr: apcupsd_battery_charge_percent < 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: APC UPS Battery nearly empty (instance {{ $labels.instance }})
      description: "Battery is almost empty (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.7.2. APC UPS Less than 15 Minutes of battery time remaining

Battery is almost empty (< 15 Minutes remaining) [copy]

  - alert: ApcUpsLessThan15MinutesOfBatteryTimeRemaining
    expr: apcupsd_battery_time_left_seconds < 900
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: APC UPS Less than 15 Minutes of battery time remaining (instance {{ $labels.instance }})
      description: "Battery is almost empty (< 15 Minutes remaining)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.7.3. APC UPS AC input outage

UPS now running on battery (since {{$value | humanizeDuration}}) [copy]

  - alert: ApcUpsAcInputOutage
    expr: apcupsd_battery_time_on_seconds > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: APC UPS AC input outage (instance {{ $labels.instance }})
      description: "UPS now running on battery (since {{$value | humanizeDuration}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.7.4. APC UPS low battery voltage

Battery voltage is lower than nominal (< 95%) [copy]

  - alert: ApcUpsLowBatteryVoltage
    expr: (apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: APC UPS low battery voltage (instance {{ $labels.instance }})
      description: "Battery voltage is lower than nominal (< 95%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.7.5. APC UPS high temperature

Internal temperature is high ({{$value}}°C) [copy]

  - alert: ApcUpsHighTemperature
    expr: apcupsd_internal_temperature_celsius >= 40
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: APC UPS high temperature (instance {{ $labels.instance }})
      description: "Internal temperature is high ({{$value}}°C)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.7.6. APC UPS high load

UPS load is > 80% [copy]

  - alert: ApcUpsHighLoad
    expr: apcupsd_ups_load_percent > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: APC UPS high load (instance {{ $labels.instance }})
      description: "UPS load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.8. Graph Node : Embedded exporter (6 rules) [copy section]

$ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/graph-node/embedded-exporter.yml

# 7.8.1. Provider failed because net_version failed

Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy]

  - alert: ProviderFailedBecauseNet_versionFailed
    expr: eth_rpc_status == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Provider failed because net_version failed (instance {{ $labels.instance }})
      description: "Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.8.2. Provider failed because get genesis failed

Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy]

  - alert: ProviderFailedBecauseGetGenesisFailed
    expr: eth_rpc_status == 2
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Provider failed because get genesis failed (instance {{ $labels.instance }})
      description: "Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.8.3. Provider failed because net_version timeout

net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy]

  - alert: ProviderFailedBecauseNet_versionTimeout
    expr: eth_rpc_status == 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Provider failed because net_version timeout (instance {{ $labels.instance }})
      description: "net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.8.4. Provider failed because get genesis timeout

Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy]

  - alert: ProviderFailedBecauseGetGenesisTimeout
    expr: eth_rpc_status == 4
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Provider failed because get genesis timeout (instance {{ $labels.instance }})
      description: "Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.8.5. Store connection is too slow

Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` [copy]

  - alert: StoreConnectionIsTooSlow
    expr: store_connection_wait_time_ms > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Store connection is too slow (instance {{ $labels.instance }})
      description: "Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

# 7.8.6. Store connection is too slow

Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` [copy]

  - alert: StoreConnectionIsTooSlow
    expr: store_connection_wait_time_ms > 20
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Store connection is too slow (instance {{ $labels.instance }})
      description: "Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"