Skip to main content
APA
Sponsored by CAST AI — Kubernetes cost optimization Better Stack — Uptime monitoring and log management
⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

Jaeger Prometheus Alert Rules

8 Prometheus alerting rules for Jaeger. Exported via Embedded exporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

12.9. Embedded exporter (8 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/jaeger/embedded-exporter.yml
warning

12.9.1. Jaeger agent HTTP server errors

Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors.

- alert: JaegerAgentHTTPServerErrors
  expr: 100 * sum(rate(jaeger_agent_http_server_errors_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger agent HTTP server errors (instance {{ $labels.instance }})
    description: "Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.2. Jaeger client RPC request errors

Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors.

- alert: JaegerClientRPCRequestErrors
  expr: 100 * sum(rate(jaeger_client_jaeger_rpc_http_requests{status_code=~"4xx|5xx"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger client RPC request errors (instance {{ $labels.instance }})
    description: "Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.3. Jaeger client spans dropped

Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.

- alert: JaegerClientSpansDropped
  expr: 100 * sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger client spans dropped (instance {{ $labels.instance }})
    description: "Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.4. Jaeger agent spans dropped

Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches.

- alert: JaegerAgentSpansDropped
  expr: 100 * sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger agent spans dropped (instance {{ $labels.instance }})
    description: "Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.5. Jaeger collector dropping spans

Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.

- alert: JaegerCollectorDroppingSpans
  expr: 100 * sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger collector dropping spans (instance {{ $labels.instance }})
    description: "Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.6. Jaeger sampling update failing

Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates.

- alert: JaegerSamplingUpdateFailing
  expr: 100 * sum(rate(jaeger_sampler_queries{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger sampling update failing (instance {{ $labels.instance }})
    description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.7. Jaeger throttling update failing

Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates.

- alert: JaegerThrottlingUpdateFailing
  expr: 100 * sum(rate(jaeger_throttler_updates{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger throttling update failing (instance {{ $labels.instance }})
    description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

12.9.8. Jaeger query request failures

Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests.

- alert: JaegerQueryRequestFailures
  expr: 100 * sum(rate(jaeger_query_requests_total{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Jaeger query request failures (instance {{ $labels.instance }})
    description: "Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"