Skip to main content
APA
Sponsored by CAST AI — Kubernetes cost optimization Better Stack — Uptime monitoring and log management
⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

Envoy Prometheus Alert Rules

19 Prometheus alerting rules for Envoy. Exported via Built-in metrics. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

4.6. Built-in metrics (19 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/envoy/embedded-exporter.yml
critical

4.6.1. Envoy server not live

Envoy server is not live (draining or shutting down) on {{ $labels.instance }}

- alert: EnvoyServerNotLive
  expr: envoy_server_live != 1
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Envoy server not live (instance {{ $labels.instance }})
    description: "Envoy server is not live (draining or shutting down) on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.2. Envoy high memory usage

Envoy memory allocated is above 90% of heap size on {{ $labels.instance }}

- alert: EnvoyHighMemoryUsage
  expr: envoy_server_memory_allocated / envoy_server_memory_heap_size * 100 > 90 and envoy_server_memory_heap_size > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy high memory usage (instance {{ $labels.instance }})
    description: "Envoy memory allocated is above 90% of heap size on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.3. Envoy high downstream HTTP 5xx error rate

More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf "%.1f" }}%)

- alert: EnvoyHighDownstreamHTTP5xxErrorRate
  expr: sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class="5"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 5 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Envoy high downstream HTTP 5xx error rate (instance {{ $labels.instance }})
    description: "More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf \"%.1f\" }}%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.4. Envoy high downstream HTTP 4xx error rate

More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf "%.1f" }}%)

- alert: EnvoyHighDownstreamHTTP4xxErrorRate
  expr: sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class="4"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 10 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy high downstream HTTP 4xx error rate (instance {{ $labels.instance }})
    description: "More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf \"%.1f\" }}%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.5. Envoy downstream connections overflowing

Downstream connections are being rejected due to listener overflow on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyDownstreamConnectionsOverflowing
  expr: increase(envoy_listener_downstream_cx_overflow[5m]) > 5
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Envoy downstream connections overflowing (instance {{ $labels.instance }})
    description: "Downstream connections are being rejected due to listener overflow on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.6. Envoy cluster membership empty

Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members

- alert: EnvoyClusterMembershipEmpty
  expr: envoy_cluster_membership_healthy == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Envoy cluster membership empty (instance {{ $labels.instance }})
    description: "Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.7. Envoy cluster membership degraded

Only {{ $value | printf "%.1f" }}% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are healthy (threshold: 75%)

- alert: EnvoyClusterMembershipDegraded
  expr: envoy_cluster_membership_healthy / envoy_cluster_membership_total * 100 < 75 and envoy_cluster_membership_total > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy cluster membership degraded (instance {{ $labels.instance }})
    description: "Only {{ $value | printf \"%.1f\" }}% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are healthy (threshold: 75%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.8. Envoy high cluster upstream connection failures

High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyHighClusterUpstreamConnectionFailures
  expr: increase(envoy_cluster_upstream_cx_connect_fail[5m]) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy high cluster upstream connection failures (instance {{ $labels.instance }})
    description: "High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.9. Envoy high cluster upstream request timeout rate

More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}

- alert: EnvoyHighClusterUpstreamRequestTimeoutRate
  expr: rate(envoy_cluster_upstream_rq_timeout[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy high cluster upstream request timeout rate (instance {{ $labels.instance }})
    description: "More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.10. Envoy high cluster upstream 5xx error rate

More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}

- alert: EnvoyHighClusterUpstream5xxErrorRate
  expr: rate(envoy_cluster_upstream_rq_xx{envoy_response_code_class="5"}[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Envoy high cluster upstream 5xx error rate (instance {{ $labels.instance }})
    description: "More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.11. Envoy cluster health check failures

Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyClusterHealthCheckFailures
  expr: increase(envoy_cluster_health_check_failure[5m]) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy cluster health check failures (instance {{ $labels.instance }})
    description: "Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
info

4.6.12. Envoy cluster outlier detection ejections active

There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}

- alert: EnvoyClusterOutlierDetectionEjectionsActive
  expr: envoy_cluster_outlier_detection_ejections_active > 0
  for: 5m
  labels:
    severity: info
  annotations:
    summary: Envoy cluster outlier detection ejections active (instance {{ $labels.instance }})
    description: "There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.13. Envoy listener SSL connection errors

Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyListenerSSLConnectionErrors
  expr: increase(envoy_listener_ssl_connection_error[5m]) > 5
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Envoy listener SSL connection errors (instance {{ $labels.instance }})
    description: "Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.14. Envoy global downstream connections overflowing

Downstream connections are being rejected due to global connection limit on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyGlobalDownstreamConnectionsOverflowing
  expr: increase(envoy_listener_downstream_global_cx_overflow[5m]) > 5
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Envoy global downstream connections overflowing (instance {{ $labels.instance }})
    description: "Downstream connections are being rejected due to global connection limit on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.15. Envoy SSL certificate expiring soon

SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days

- alert: EnvoySSLCertificateExpiringSoon
  expr: envoy_server_days_until_first_cert_expiring < 7
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Envoy SSL certificate expiring soon (instance {{ $labels.instance }})
    description: "SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.16. Envoy SSL certificate expired

SSL certificate loaded by Envoy on {{ $labels.instance }} has expired

- alert: EnvoySSLCertificateExpired
  expr: envoy_server_days_until_first_cert_expiring < 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Envoy SSL certificate expired (instance {{ $labels.instance }})
    description: "SSL certificate loaded by Envoy on {{ $labels.instance }} has expired\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.17. Envoy cluster circuit breaker tripped

Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}

- alert: EnvoyClusterCircuitBreakerTripped
  expr: envoy_cluster_circuit_breakers_default_cx_open == 1 or envoy_cluster_circuit_breakers_default_rq_open == 1
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Envoy cluster circuit breaker tripped (instance {{ $labels.instance }})
    description: "Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

4.6.18. Envoy no healthy upstream

Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyNoHealthyUpstream
  expr: increase(envoy_cluster_upstream_cx_none_healthy[5m]) > 3
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Envoy no healthy upstream (instance {{ $labels.instance }})
    description: "Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

4.6.19. Envoy high downstream request timeout rate

Downstream requests are timing out on {{ $labels.instance }} ({{ $value }} in the last 5m)

- alert: EnvoyHighDownstreamRequestTimeoutRate
  expr: increase(envoy_http_downstream_rq_timeout[5m]) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Envoy high downstream request timeout rate (instance {{ $labels.instance }})
    description: "Downstream requests are timing out on {{ $labels.instance }} ({{ $value }} in the last 5m)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"