What is the Prometheus alert rule for "Loki process too many restarts"?

A loki process had too many restarts (target {{ $labels.instance }}) PromQL expression: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2. Severity: warning.

What is the Prometheus alert rule for "Loki request panic"?

{{ $labels.job }} is experiencing {{ $value | humanize }} panic(s) in the last 5 minutes. PromQL expression: sum(increase(loki_panic_total[5m])) by (namespace, job) > 0. Severity: critical.

What is the Prometheus alert rule for "Loki request latency"?

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency. PromQL expression: histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (namespace, job, route, le)) > 1. Severity: critical. Duration: 5m.

Loki Prometheus Alert Rules

4 Prometheus alerting rules for Loki.Exported via Embedded exporter.These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: EmbeddedExporter
  rules:
    - alert: LokiProcessTooManyRestarts
      expr: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Loki process too many restarts (instance {{ $labels.instance }})
        description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: LokiRequestErrors
      expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10 and sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 0
      for: 15m
      labels:
        severity: critical
      annotations:
        summary: Loki request errors (instance {{ $labels.instance }})
        description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing {{ printf \"%.2f\" $value }}% errors.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: LokiRequestPanic
      expr: sum(increase(loki_panic_total[5m])) by (namespace, job) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Loki request panic (instance {{ $labels.instance }})
        description: "{{ $labels.job }} is experiencing {{ $value | humanize }} panic(s) in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: LokiRequestLatency
      expr: histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (namespace, job, route, le)) > 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Loki request latency (instance {{ $labels.instance }})
        description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

12.2.Embedded exporter(4 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/loki/embedded-exporter.yml

warning

12.2.1.Loki process too many restarts

A loki process had too many restarts (target {{ $labels.instance }})

- alert: LokiProcessTooManyRestarts
  expr: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Loki process too many restarts (instance {{ $labels.instance }})
    description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

12.2.2.Loki request errors

The {{ $labels.job }} and {{ $labels.route }} are experiencing {{ printf "%.2f" $value }}% errors.

- alert: LokiRequestErrors
  expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10 and sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 0
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: Loki request errors (instance {{ $labels.instance }})
    description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing {{ printf \"%.2f\" $value }}% errors.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

12.2.3.Loki request panic

{{ $labels.job }} is experiencing {{ $value | humanize }} panic(s) in the last 5 minutes.

- alert: LokiRequestPanic
  expr: sum(increase(loki_panic_total[5m])) by (namespace, job) > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Loki request panic (instance {{ $labels.instance }})
    description: "{{ $labels.job }} is experiencing {{ $value | humanize }} panic(s) in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

12.2.4.Loki request latency

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.

- alert: LokiRequestLatency
  expr: histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (namespace, job, route, le)) > 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Loki request latency (instance {{ $labels.instance }})
    description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Observability

Thanos Promtail Cortex Grafana Tempo Grafana Mimir Grafana Alloy OpenTelemetry Collector Jaeger