Skip to main content
APA
Sponsored by CAST AI — Kubernetes cost optimization Better Stack — Uptime monitoring and log management
⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

Ceph Prometheus Alert Rules

13 Prometheus alerting rules for Ceph. Exported via Embedded exporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

10.1. Embedded exporter (13 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ceph/embedded-exporter.yml
critical

10.1.1. Ceph State

Ceph instance unhealthy

  # ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
  # This rule fires on any non-OK state. Split into ==1 (warning) and ==2 (critical) if you want separate severity levels.
- alert: CephState
  expr: ceph_health_status != 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Ceph State (instance {{ $labels.instance }})
    description: "Ceph instance unhealthy\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.2. Ceph monitor clock skew

Ceph monitor clock skew detected. Please check ntp and hardware clock settings

- alert: CephMonitorClockSkew
  expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph monitor clock skew (instance {{ $labels.instance }})
    description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.3. Ceph monitor low space

Ceph monitor storage is low.

- alert: CephMonitorLowSpace
  expr: ceph_monitor_avail_percent < 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph monitor low space (instance {{ $labels.instance }})
    description: "Ceph monitor storage is low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

10.1.4. Ceph OSD Down

Ceph Object Storage Daemon Down

- alert: CephOSDDown
  expr: ceph_osd_up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Ceph OSD Down (instance {{ $labels.instance }})
    description: "Ceph Object Storage Daemon Down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.5. Ceph high OSD latency

Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.

  # Threshold of 5000ms (5 seconds). Adjust based on your expected OSD performance.
- alert: CephHighOSDLatency
  expr: ceph_osd_apply_latency_ms > 5000
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: Ceph high OSD latency (instance {{ $labels.instance }})
    description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.6. Ceph OSD near full

A Ceph OSD is dangerously full. Please add more disks.

  # Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
  # ceph_health_detail exposes named health checks as individual time series.
- alert: CephOSDNearFull
  expr: ceph_health_detail{name="OSD_NEARFULL"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Ceph OSD near full (instance {{ $labels.instance }})
    description: "A Ceph OSD is dangerously full. Please add more disks.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.7. Ceph OSD reweighted

Ceph Object Storage Daemon takes too much time to resize.

- alert: CephOSDReweighted
  expr: ceph_osd_weight < 1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph OSD reweighted (instance {{ $labels.instance }})
    description: "Ceph Object Storage Daemon takes too much time to resize.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

10.1.8. Ceph PG down

Some Ceph placement groups are down. Please ensure that all the data are available.

- alert: CephPGDown
  expr: ceph_pg_down > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Ceph PG down (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are down. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

10.1.9. Ceph PG incomplete

Some Ceph placement groups are incomplete. Please ensure that all the data are available.

- alert: CephPGIncomplete
  expr: ceph_pg_incomplete > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Ceph PG incomplete (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.10. Ceph PG inconsistent

Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.

- alert: CephPGInconsistent
  expr: ceph_pg_inconsistent > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Ceph PG inconsistent (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.11. Ceph PG activation long

Some Ceph placement groups are too long to activate.

- alert: CephPGActivationLong
  expr: ceph_pg_activating > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph PG activation long (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are too long to activate.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

10.1.12. Ceph PG backfill full

Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.

- alert: CephPGBackfillFull
  expr: ceph_pg_backfill_toofull > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph PG backfill full (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

10.1.13. Ceph PG unavailable

Some Ceph placement groups are unavailable.

- alert: CephPGUnavailable
  expr: ceph_pg_total - ceph_pg_active > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Ceph PG unavailable (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"