Skip to main content
APA
Sponsored by CAST AI โ€” Kubernetes cost optimization Better Stack โ€” Uptime monitoring and log management
โš ๏ธ

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. ๐Ÿ˜‰

Thanos Prometheus Alert Rules

45 Prometheus alerting rules for Thanos. Exported via Thanos Compactor, Thanos Query, Thanos Receiver, Thanos Sidecar, Thanos Store, Thanos Ruler, Thanos Bucket Replicate, Thanos Component Absent. These rules cover critical and warning conditions โ€” copy and paste the YAML into your Prometheus configuration.

12.1.1. Thanos Compactor (5 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-compactor.yml
warning

12.1.1.1. Thanos Compactor Multiple Running

No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.

- alert: ThanosCompactorMultipleRunning
  expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Compactor Multiple Running (instance {{ $labels.instance }})
    description: "No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.1.2. Thanos Compactor Halted

Thanos Compact {{$labels.job}} has failed to run and now is halted.

- alert: ThanosCompactorHalted
  expr: thanos_compact_halted == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Compactor Halted (instance {{ $labels.instance }})
    description: "Thanos Compact {{$labels.job}} has failed to run and now is halted.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.1.3. Thanos Compactor High Compaction Failures

Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.

- alert: ThanosCompactorHighCompactionFailures
  expr: (sum by (job) (rate(thanos_compact_group_compactions_failures_total[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total[5m])) * 100 > 5) and sum by (job) (rate(thanos_compact_group_compactions_total[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Compactor High Compaction Failures (instance {{ $labels.instance }})
    description: "Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.1.4. Thanos Compact Bucket High Operation Failures

Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.

- alert: ThanosCompactBucketHighOperationFailures
  expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Compact Bucket High Operation Failures (instance {{ $labels.instance }})
    description: "Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.1.5. Thanos Compact Has Not Run

Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.

- alert: ThanosCompactHasNotRun
  expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Thanos Compact Has Not Run (instance {{ $labels.instance }})
    description: "Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.2. Thanos Query (8 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-query.yml
critical

12.1.2.1. Thanos Query Http Request Query Error Rate High

Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests.

- alert: ThanosQueryHttpRequestQueryErrorRateHigh
  expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query\" requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.2.2. Thanos Query Http Request Query Range Error Rate High

Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests.

- alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
  expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query_range\" requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.2.3. Thanos Query Grpc Server Error Rate

Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.

- alert: ThanosQueryGrpcServerErrorRate
  expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) * 100 > 5) and sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Query Grpc Server Error Rate (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.2.4. Thanos Query Grpc Client Error Rate

Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.

  # Filters to actual error codes only. grpc_code!="OK" would include benign codes like NotFound, AlreadyExists, and Cancelled.
- alert: ThanosQueryGrpcClientErrorRate
  expr: (sum by (job) (rate(grpc_client_handled_total{grpc_code=~"Unknown|Internal|Unavailable|DataLoss|DeadlineExceeded|ResourceExhausted", job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))) * 100 > 5 and sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m])) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.2.5. Thanos Query High D N S Failures

Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.

- alert: ThanosQueryHighDNSFailures
  expr: (sum by (job) (rate(thanos_query_store_apis_dns_failures_total[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total[5m]))) * 100 > 1 and sum by (job) (rate(thanos_query_store_apis_dns_lookups_total[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Query High D N S Failures (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.2.6. Thanos Query Instant Latency High

Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.

- alert: ThanosQueryInstantLatencyHigh
  expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query"}[5m])) > 0)
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: Thanos Query Instant Latency High (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.2.7. Thanos Query Range Latency High

Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.

- alert: ThanosQueryRangeLatencyHigh
  expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0)
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: Thanos Query Range Latency High (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.2.8. Thanos Query Overload

Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.

- alert: ThanosQueryOverload
  expr: (max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Query Overload (instance {{ $labels.instance }})
    description: "Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.3. Thanos Receiver (7 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-receiver.yml
critical

12.1.3.1. Thanos Receive Http Request Error Rate High

Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.

- alert: ThanosReceiveHttpRequestErrorRateHigh
  expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))) * 100 > 5 and sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Receive Http Request Error Rate High (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.3.2. Thanos Receive Http Request Latency High

Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.

- alert: ThanosReceiveHttpRequestLatencyHigh
  expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0)
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: Thanos Receive Http Request Latency High (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.3.3. Thanos Receive High Replication Failures

Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.

- alert: ThanosReceiveHighReplicationFailures
  expr: thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error"}[5m])) / sum by (job) (rate(thanos_receive_replications_total[5m]))) > (max by (job) (floor((thanos_receive_replication_factor+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes))) * 100
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Receive High Replication Failures (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
info

12.1.3.4. Thanos Receive High Forward Request Failures

Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.

- alert: ThanosReceiveHighForwardRequestFailures
  expr: (sum by (job) (rate(thanos_receive_forward_requests_total{result="error"}[5m]))/  sum by (job) (rate(thanos_receive_forward_requests_total[5m]))) * 100 > 20 and sum by (job) (rate(thanos_receive_forward_requests_total[5m])) > 0
  for: 5m
  labels:
    severity: info
  annotations:
    summary: Thanos Receive High Forward Request Failures (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.3.5. Thanos Receive High Hashring File Refresh Failures

Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.

- alert: ThanosReceiveHighHashringFileRefreshFailures
  expr: (sum by (job) (rate(thanos_receive_hashrings_file_errors_total[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total[5m])) > 0) and sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.3.6. Thanos Receive Config Reload Failure

Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.

- alert: ThanosReceiveConfigReloadFailure
  expr: avg by (job) (thanos_receive_config_last_reload_successful) != 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Receive Config Reload Failure (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.3.7. Thanos Receive No Upload

Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.

- alert: ThanosReceiveNoUpload
  expr: (up{job=~".*thanos-receive.*"} - 1) + on (job, instance) (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0)
  for: 3h
  labels:
    severity: critical
  annotations:
    summary: Thanos Receive No Upload (instance {{ $labels.instance }})
    description: "Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.4. Thanos Sidecar (2 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-sidecar.yml
critical

12.1.4.1. Thanos Sidecar Bucket Operations Failed

Thanos Sidecar {{$labels.instance}} bucket operations are failing ({{ $value | humanize }}/s).

  # Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: ThanosSidecarBucketOperationsFailed
  expr: sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Sidecar Bucket Operations Failed (instance {{ $labels.instance }})
    description: "Thanos Sidecar {{$labels.instance}} bucket operations are failing ({{ $value | humanize }}/s).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.4.2. Thanos Sidecar No Connection To Started Prometheus

Thanos Sidecar {{$labels.instance}} is unhealthy.

- alert: ThanosSidecarNoConnectionToStartedPrometheus
  expr: thanos_sidecar_prometheus_up == 0 and on (namespace, pod) prometheus_tsdb_data_replay_duration_seconds != 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Sidecar No Connection To Started Prometheus (instance {{ $labels.instance }})
    description: "Thanos Sidecar {{$labels.instance}} is unhealthy.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.5. Thanos Store (4 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-store.yml
warning

12.1.5.1. Thanos Store Grpc Error Rate

Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.

- alert: ThanosStoreGrpcErrorRate
  expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) * 100 > 5) and sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Store Grpc Error Rate (instance {{ $labels.instance }})
    description: "Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.5.2. Thanos Store Series Gate Latency High

Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.

- alert: ThanosStoreSeriesGateLatencyHigh
  expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count[5m])) > 0)
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Thanos Store Series Gate Latency High (instance {{ $labels.instance }})
    description: "Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.5.3. Thanos Store Bucket High Operation Failures

Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.

- alert: ThanosStoreBucketHighOperationFailures
  expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Store Bucket High Operation Failures (instance {{ $labels.instance }})
    description: "Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.5.4. Thanos Store Objstore Operation Latency High

Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.

- alert: ThanosStoreObjstoreOperationLatencyHigh
  expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and  sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Thanos Store Objstore Operation Latency High (instance {{ $labels.instance }})
    description: "Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.6. Thanos Ruler (11 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-ruler.yml
critical

12.1.6.1. Thanos Rule Queue Is Dropping Alerts

Thanos Rule {{$labels.instance}} is failing to queue alerts ({{ $value | humanize }}/s).

- alert: ThanosRuleQueueIsDroppingAlerts
  expr: sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Rule Queue Is Dropping Alerts (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} is failing to queue alerts ({{ $value | humanize }}/s).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.6.2. Thanos Rule Sender Is Failing Alerts

Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager ({{ $value | humanize }}/s).

- alert: ThanosRuleSenderIsFailingAlerts
  expr: sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Rule Sender Is Failing Alerts (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager ({{ $value | humanize }}/s).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.6.3. Thanos Rule High Rule Evaluation Failures

Thanos Rule {{$labels.instance}} is failing to evaluate {{$value | humanize}}% of rules.

- alert: ThanosRuleHighRuleEvaluationFailures
  expr: (sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5) and sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Rule High Rule Evaluation Failures (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} is failing to evaluate {{$value | humanize}}% of rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
info

12.1.6.4. Thanos Rule High Rule Evaluation Warnings

Thanos Rule {{$labels.instance}} has high number of evaluation warnings ({{ $value | humanize }}/s).

  # Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: ThanosRuleHighRuleEvaluationWarnings
  expr: sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total[5m])) > 0.05
  for: 15m
  labels:
    severity: info
  annotations:
    summary: Thanos Rule High Rule Evaluation Warnings (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} has high number of evaluation warnings ({{ $value | humanize }}/s).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.6.5. Thanos Rule Rule Evaluation Latency High

Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.

- alert: ThanosRuleRuleEvaluationLatencyHigh
  expr: (sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}))
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.6.6. Thanos Rule Grpc Error Rate

Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.

- alert: ThanosRuleGrpcErrorRate
  expr: (sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))/  sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5) and sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Thanos Rule Grpc Error Rate (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
info

12.1.6.7. Thanos Rule Config Reload Failure

Thanos Rule {{$labels.job}} has not been able to reload its configuration.

- alert: ThanosRuleConfigReloadFailure
  expr: avg by (job, instance) (thanos_rule_config_last_reload_successful) != 1
  for: 5m
  labels:
    severity: info
  annotations:
    summary: Thanos Rule Config Reload Failure (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.job}} has not been able to reload its configuration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.6.8. Thanos Rule Query High D N S Failures

Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.

- alert: ThanosRuleQueryHighDNSFailures
  expr: (sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Rule Query High D N S Failures (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.1.6.9. Thanos Rule Alertmanager High D N S Failures

Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.

- alert: ThanosRuleAlertmanagerHighDNSFailures
  expr: (sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total[5m])) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
info

12.1.6.10. Thanos Rule No Evaluation For10 Intervals

Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.

- alert: ThanosRuleNoEvaluationFor10Intervals
  expr: time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
  for: 5m
  labels:
    severity: info
  annotations:
    summary: Thanos Rule No Evaluation For10 Intervals (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.6.11. Thanos No Rule Evaluations

Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.

- alert: ThanosNoRuleEvaluations
  expr: sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0  and sum by (job, instance) (thanos_rule_loaded_rules) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos No Rule Evaluations (instance {{ $labels.instance }})
    description: "Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.7. Thanos Bucket Replicate (2 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-bucket-replicate.yml
critical

12.1.7.1. Thanos Bucket Replicate Error Rate

Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.

- alert: ThanosBucketReplicateErrorRate
  expr: (sum by (job) (rate(thanos_replicate_replication_runs_total{result="error"}[5m])) / on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total[5m]))) * 100 >= 10 and sum by (job) (rate(thanos_replicate_replication_runs_total[5m])) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Bucket Replicate Error Rate (instance {{ $labels.instance }})
    description: "Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.7.2. Thanos Bucket Replicate Run Latency

Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.

- alert: ThanosBucketReplicateRunLatency
  expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_replicate_replication_run_duration_seconds_bucket[5m]))) > 20 and  sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_count[5m])) > 0)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }})
    description: "Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.1.8. Thanos Component Absent (6 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/thanos/thanos-component-absent.yml
critical

12.1.8.1. Thanos Compact Is Down

ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.

- alert: ThanosCompactIsDown
  expr: absent(up{job=~".*thanos-compact.*"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Compact Is Down (instance {{ $labels.instance }})
    description: "ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.8.2. Thanos Query Is Down

ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.

- alert: ThanosQueryIsDown
  expr: absent(up{job=~".*thanos-query.*"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Query Is Down (instance {{ $labels.instance }})
    description: "ThanosQuery has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.8.3. Thanos Receive Is Down

ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.

- alert: ThanosReceiveIsDown
  expr: absent(up{job=~".*thanos-receive.*"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Receive Is Down (instance {{ $labels.instance }})
    description: "ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.8.4. Thanos Rule Is Down

ThanosRule has disappeared. Prometheus target for the component cannot be discovered.

- alert: ThanosRuleIsDown
  expr: absent(up{job=~".*thanos-rule.*"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Rule Is Down (instance {{ $labels.instance }})
    description: "ThanosRule has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.8.5. Thanos Sidecar Is Down

ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.

- alert: ThanosSidecarIsDown
  expr: absent(up{job=~".*thanos-sidecar.*"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Sidecar Is Down (instance {{ $labels.instance }})
    description: "ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

12.1.8.6. Thanos Store Is Down

ThanosStore has disappeared. Prometheus target for the component cannot be discovered.

- alert: ThanosStoreIsDown
  expr: absent(up{job=~".*thanos-store.*"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Thanos Store Is Down (instance {{ $labels.instance }})
    description: "ThanosStore has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"