Skip to main content
APA
Sponsored by CAST AI — Kubernetes cost optimization Better Stack — Uptime monitoring and log management
⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

Etcd Prometheus Alert Rules

13 Prometheus alerting rules for Etcd. Exported via Embedded exporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

7.4. Embedded exporter (13 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/etcd/embedded-exporter.yml
critical

7.4.1. Etcd insufficient Members

Etcd cluster should have an odd number of members

- alert: EtcdInsufficientMembers
  expr: count(etcd_server_id) % 2 == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Etcd insufficient Members (instance {{ $labels.instance }})
    description: "Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

7.4.2. Etcd no Leader

Etcd cluster have no leader

- alert: EtcdNoLeader
  expr: etcd_server_has_leader == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Etcd no Leader (instance {{ $labels.instance }})
    description: "Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.3. Etcd high number of leader changes

Etcd leader changed {{ $value }} times during 10 minutes

- alert: EtcdHighNumberOfLeaderChanges
  expr: increase(etcd_server_leader_changes_seen_total[10m]) > 2
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Etcd high number of leader changes (instance {{ $labels.instance }})
    description: "Etcd leader changed {{ $value }} times during 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.4. Etcd high number of failed GRPC requests warning

More than 1% GRPC request failure detected in Etcd

  # Filters to actual error codes. grpc_code!="OK" includes benign codes like NotFound, AlreadyExists, and Cancelled.
- alert: EtcdHighNumberOfFailedGRPCRequestsWarning
  expr: sum(rate(grpc_server_handled_total{grpc_code=~"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd high number of failed GRPC requests warning (instance {{ $labels.instance }})
    description: "More than 1% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

7.4.5. Etcd high number of failed GRPC requests critical

More than 5% GRPC request failure detected in Etcd

  # Filters to actual error codes. grpc_code!="OK" includes benign codes like NotFound, AlreadyExists, and Cancelled.
- alert: EtcdHighNumberOfFailedGRPCRequestsCritical
  expr: sum(rate(grpc_server_handled_total{grpc_code=~"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: Etcd high number of failed GRPC requests critical (instance {{ $labels.instance }})
    description: "More than 5% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.6. Etcd GRPC requests slow

GRPC requests slowing down, 99th percentile is over 0.15s

- alert: EtcdGRPCRequestsSlow
  expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[1m])) by (grpc_service, grpc_method, le)) > 0.15
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
    description: "GRPC requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.7. Etcd high number of failed HTTP requests warning

More than 1% HTTP failure detected in Etcd

  # These etcd_http_* metrics are from the etcd v2 API and do not exist in etcd 3.x. Remove these rules if running etcd 3.x.
- alert: EtcdHighNumberOfFailedHTTPRequestsWarning
  expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd high number of failed HTTP requests warning (instance {{ $labels.instance }})
    description: "More than 1% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

7.4.8. Etcd high number of failed HTTP requests critical

More than 5% HTTP failure detected in Etcd

  # These etcd_http_* metrics are from the etcd v2 API and do not exist in etcd 3.x. Remove these rules if running etcd 3.x.
- alert: EtcdHighNumberOfFailedHTTPRequestsCritical
  expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: Etcd high number of failed HTTP requests critical (instance {{ $labels.instance }})
    description: "More than 5% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.9. Etcd HTTP requests slow

HTTP requests slowing down, 99th percentile is over 0.15s

  # This etcd_http_* metric is from the etcd v2 API and does not exist in etcd 3.x. Remove this rule if running etcd 3.x.
- alert: EtcdHTTPRequestsSlow
  expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
    description: "HTTP requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.10. Etcd member communication slow

Etcd member communication slowing down, 99th percentile is over 0.15s

- alert: EtcdMemberCommunicationSlow
  expr: histogram_quantile(0.99, sum(rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) by (instance, le)) > 0.15
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd member communication slow (instance {{ $labels.instance }})
    description: "Etcd member communication slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.11. Etcd high number of failed proposals

Etcd server got {{ $value }} failed proposals in the past hour

- alert: EtcdHighNumberOfFailedProposals
  expr: increase(etcd_server_proposals_failed_total[1h]) > 5
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
    description: "Etcd server got {{ $value }} failed proposals in the past hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.12. Etcd high fsync durations

Etcd WAL fsync duration increasing, 99th percentile is over 0.5s

- alert: EtcdHighFsyncDurations
  expr: histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) > 0.5
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd high fsync durations (instance {{ $labels.instance }})
    description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

7.4.13. Etcd high commit durations

Etcd commit duration increasing, 99th percentile is over 0.25s

- alert: EtcdHighCommitDurations
  expr: histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)) > 0.25
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Etcd high commit durations (instance {{ $labels.instance }})
    description: "Etcd commit duration increasing, 99th percentile is over 0.25s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"