warning
9.11.1. Cilium agent unreachable nodes
Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health.
# Metric name depends on Cilium version. Use cilium_unreachable_nodes (older) or cilium_node_connectivity_status (1.14+).
- alert: CiliumAgentUnreachableNodes
expr: sum(cilium_unreachable_nodes{}) by (pod) > 0
for: 15m
labels:
severity: warning
annotations:
summary: Cilium agent unreachable nodes (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.2. Cilium agent unreachable health endpoints
Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing.
# Metric name depends on Cilium version. Use cilium_unreachable_health_endpoints (older) or cilium_node_connectivity_status (1.14+).
- alert: CiliumAgentUnreachableHealthEndpoints
expr: sum(cilium_unreachable_health_endpoints{}) by (pod) > 0
for: 15m
labels:
severity: warning
annotations:
summary: Cilium agent unreachable health endpoints (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.3. Cilium agent failing controllers
Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details.
# Metric name depends on Cilium version. Use cilium_controllers_failing (older) or cilium_controllers_runs_total (1.14+).
- alert: CiliumAgentFailingControllers
expr: sum(cilium_controllers_failing{}) by (pod) > 0
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent failing controllers (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.4. Cilium agent endpoint failures
Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state.
- alert: CiliumAgentEndpointFailures
expr: sum(cilium_endpoint_state{endpoint_state="invalid"}) by (pod) > 0
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent endpoint failures (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.5. Cilium agent endpoint regeneration failures
Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale.
- alert: CiliumAgentEndpointRegenerationFailures
expr: sum(rate(cilium_endpoint_regenerations_total{outcome="fail"}[5m])) by (pod) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent endpoint regeneration failures (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.6. Cilium agent endpoint update failure
Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}).
- alert: CiliumAgentEndpointUpdateFailure
expr: sum(rate(cilium_k8s_client_api_calls_total{method=~"(PUT|POST|PATCH)", endpoint="endpoint", return_code!~"2[0-9][0-9]"}[5m])) by (pod, method, return_code) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent endpoint update failure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" info
9.11.7. Cilium agent endpoint create failure
Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking.
- alert: CiliumAgentEndpointCreateFailure
expr: sum(rate(cilium_api_limiter_processed_requests_total{api_call=~"endpoint-create", outcome="fail"}[1m])) by (pod, api_call) > 0.05
for: 5m
labels:
severity: info
annotations:
summary: Cilium agent endpoint create failure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.8. Cilium agent map operation failures
Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded.
- alert: CiliumAgentMapOperationFailures
expr: sum(rate(cilium_bpf_map_ops_total{outcome="fail"}[5m])) by (map_name, pod) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent map operation failures (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.9. Cilium agent BPF map pressure
Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full.
# Map pressure is a ratio from 0 to 1. At 1.0, the map is full and new entries will be dropped.
- alert: CiliumAgentBPFMapPressure
expr: cilium_bpf_map_pressure{} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent BPF map pressure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.10. Cilium agent conntrack table full
Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks.
- alert: CiliumAgentConntrackTableFull
expr: sum(rate(cilium_drop_count_total{reason="CT: Map insertion failed"}[5m])) by (pod) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium agent conntrack table full (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.11. Cilium agent conntrack failed garbage collection
Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate.
- alert: CiliumAgentConntrackFailedGarbageCollection
expr: sum(rate(cilium_datapath_conntrack_gc_runs_total{status="uncompleted"}[5m])) by (pod) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent conntrack failed garbage collection (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.12. Cilium agent NAT table full
Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate.
- alert: CiliumAgentNATTableFull
expr: sum(rate(cilium_drop_count_total{reason="No mapping for NAT masquerade"}[1m])) by (pod) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium agent NAT table full (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" info
9.11.13. Cilium agent high denied rate
Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct.
# Policy denials may be expected behavior. Investigate only if unexpected traffic is being blocked.
- alert: CiliumAgentHighDeniedRate
expr: sum(rate(cilium_drop_count_total{reason="Policy denied"}[1m])) by (pod) > 0
for: 10m
labels:
severity: info
annotations:
summary: Cilium agent high denied rate (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.14. Cilium agent high drop rate
Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues.
- alert: CiliumAgentHighDropRate
expr: sum(rate(cilium_drop_count_total{reason!~"Policy denied"}[5m])) by (pod, reason) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent high drop rate (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.15. Cilium agent policy map pressure
Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply.
- alert: CiliumAgentPolicyMapPressure
expr: sum(cilium_bpf_map_pressure{map_name=~"cilium_policy_.*"}) by (pod) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent policy map pressure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.16. Cilium agent policy import errors
Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete.
- alert: CiliumAgentPolicyImportErrors
expr: sum(rate(cilium_policy_change_total{outcome="fail"}[5m])) by (pod) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent policy import errors (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.17. Cilium agent policy implementation delay
Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies.
# Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity.
- alert: CiliumAgentPolicyImplementationDelay
expr: histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay_bucket[5m])) by (le, pod)) > 60
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent policy implementation delay (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.18. Cilium node-local high identity allocation
Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit.
- alert: CiliumNode-localHighIdentityAllocation
expr: (sum(cilium_identity{type="node_local"}) by (pod) / (2^16-1)) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: Cilium node-local high identity allocation (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.19. Cilium cluster high identity allocation
Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit.
- alert: CiliumClusterHighIdentityAllocation
expr: (sum(cilium_identity{type="cluster_local"}) by () / (2^16-256)) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: Cilium cluster high identity allocation (instance {{ $labels.instance }})
description: "Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.20. Cilium operator exhausted IPAM IPs
Cilium operator has no available IPAM IPs. New pods will fail to schedule networking.
- alert: CiliumOperatorExhaustedIPAMIPs
expr: sum(cilium_operator_ipam_ips{type="available"}) by () <= 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium operator exhausted IPAM IPs (instance {{ $labels.instance }})
description: "Cilium operator has no available IPAM IPs. New pods will fail to schedule networking.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.21. Cilium operator low available IPAM IPs
Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion.
# Threshold of 90% is a rough default. Adjust based on your pod churn rate and IP pool size.
- alert: CiliumOperatorLowAvailableIPAMIPs
expr: sum(cilium_operator_ipam_ips{type!="available"}) by () / sum(cilium_operator_ipam_ips) by () > 0.9 and sum(cilium_operator_ipam_ips) by () > 0
for: 5m
labels:
severity: warning
annotations:
summary: Cilium operator low available IPAM IPs (instance {{ $labels.instance }})
description: "Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.22. Cilium operator IPAM interface creation failures
Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted.
# Some Cilium versions may not have a status label on this metric. Verify against your Cilium version.
- alert: CiliumOperatorIPAMInterfaceCreationFailures
expr: sum(rate(cilium_operator_ipam_interface_creation_ops{status!="success"}[5m])) by () > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: Cilium operator IPAM interface creation failures (instance {{ $labels.instance }})
description: "Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.23. Cilium agent API errors
Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy.
- alert: CiliumAgentAPIErrors
expr: sum(rate(cilium_agent_api_process_time_seconds_count{return_code=~"5[0-9][0-9]"}[5m])) by (pod, return_code) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent API errors (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" info
9.11.24. Cilium agent Kubernetes client errors
Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}).
- alert: CiliumAgentKubernetesClientErrors
expr: sum(rate(cilium_k8s_client_api_calls_total{endpoint!="metrics", return_code!~"2[0-9][0-9]"}[5m])) by (pod, endpoint, return_code) > 0.05
for: 5m
labels:
severity: info
annotations:
summary: Cilium agent Kubernetes client errors (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.25. Cilium ClusterMesh remote cluster not ready
Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.
- alert: CiliumClusterMeshRemoteClusterNotReady
expr: count(cilium_clustermesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium ClusterMesh remote cluster not ready (instance {{ $labels.instance }})
description: "Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.26. Cilium ClusterMesh remote cluster failing
Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing ({{ $value }} failures).
- alert: CiliumClusterMeshRemoteClusterFailing
expr: sum(cilium_clustermesh_remote_cluster_failures) by (source_cluster, target_cluster) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium ClusterMesh remote cluster failing (instance {{ $labels.instance }})
description: "Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing ({{ $value }} failures).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.27. Cilium KVStoreMesh remote cluster not ready
Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.
- alert: CiliumKVStoreMeshRemoteClusterNotReady
expr: count(cilium_kvstoremesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium KVStoreMesh remote cluster not ready (instance {{ $labels.instance }})
description: "Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.28. Cilium KVStoreMesh remote cluster failing
Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures ({{ $value }} failures).
- alert: CiliumKVStoreMeshRemoteClusterFailing
expr: sum(cilium_kvstoremesh_remote_cluster_failures) by (source_cluster, target_cluster) > 0
for: 5m
labels:
severity: critical
annotations:
summary: Cilium KVStoreMesh remote cluster failing (instance {{ $labels.instance }})
description: "Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures ({{ $value }} failures).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" critical
9.11.29. Cilium KVStoreMesh sync errors
Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors.
- alert: CiliumKVStoreMeshSyncErrors
expr: sum(rate(cilium_kvstoremesh_kvstore_sync_errors_total[5m])) by (source_cluster) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: Cilium KVStoreMesh sync errors (instance {{ $labels.instance }})
description: "Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.30. Cilium Hubble lost events
Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete.
- alert: CiliumHubbleLostEvents
expr: sum(rate(hubble_lost_events_total[5m])) by (pod) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: Cilium Hubble lost events (instance {{ $labels.instance }})
description: "Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" warning
9.11.31. Cilium Hubble high DNS error rate
Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses.
# Threshold of 10% is a rough default. Some DNS errors may be normal depending on your workload.
- alert: CiliumHubbleHighDNSErrorRate
expr: sum(rate(hubble_dns_responses_total{rcode!="No Error"}[5m])) by (pod) / sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0.1 and sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0
for: 5m
labels:
severity: warning
annotations:
summary: Cilium Hubble high DNS error rate (instance {{ $labels.instance }})
description: "Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"