What is the Prometheus alert rule for "Ceph State"?

Ceph instance unhealthy PromQL expression: ceph_health_status != 0. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Ceph monitor clock skew"?

Ceph monitor clock skew detected. Please check ntp and hardware clock settings PromQL expression: abs(ceph_monitor_clock_skew_seconds) > 0.2. Severity: warning. Duration: 2m.

What is the Prometheus alert rule for "Ceph monitor low space"?

Ceph monitor storage is low. PromQL expression: ceph_monitor_avail_percent < 10. Severity: warning. Duration: 2m.

What is the Prometheus alert rule for "Ceph OSD Down"?

Ceph Object Storage Daemon Down PromQL expression: ceph_osd_up == 0. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Ceph high OSD latency"?

Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state. PromQL expression: ceph_osd_apply_latency_ms > 5000. Severity: warning. Duration: 1m.

What is the Prometheus alert rule for "Ceph OSD near full"?

A Ceph OSD is dangerously full. Please add more disks. PromQL expression: ceph_health_detail{name="OSD_NEARFULL"} == 1. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Ceph OSD reweighted"?

Ceph Object Storage Daemon takes too much time to resize. PromQL expression: ceph_osd_weight < 1. Severity: warning. Duration: 2m.

What is the Prometheus alert rule for "Ceph PG down"?

Some Ceph placement groups are down. Please ensure that all the data are available. PromQL expression: ceph_pg_down > 0. Severity: critical.

What is the Prometheus alert rule for "Ceph PG incomplete"?

Some Ceph placement groups are incomplete. Please ensure that all the data are available. PromQL expression: ceph_pg_incomplete > 0. Severity: critical.

What is the Prometheus alert rule for "Ceph PG inconsistent"?

Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes. PromQL expression: ceph_pg_inconsistent > 0. Severity: warning.

What is the Prometheus alert rule for "Ceph PG activation long"?

Some Ceph placement groups are too long to activate. PromQL expression: ceph_pg_activating > 0. Severity: warning. Duration: 2m.

What is the Prometheus alert rule for "Ceph PG unavailable"?

Some Ceph placement groups are unavailable. PromQL expression: ceph_pg_total - ceph_pg_active > 0. Severity: critical. Duration: 1m.

Ceph Prometheus Alert Rules

Q: What is the Prometheus alert rule for "Ceph OSD Down"?

Ceph Object Storage Daemon Down PromQL expression: ceph_osd_up == 0. Severity: critical. Duration: 1m.

13 Prometheus alerting rules for Ceph. Exported via Embedded exporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: EmbeddedExporter
  rules:
      # ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
      # This rule fires on any non-OK state. Split into ==1 (warning) and ==2 (critical) if you want separate severity levels.
    - alert: CephState
      expr: ceph_health_status != 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Ceph State (instance {{ $labels.instance }})
        description: "Ceph instance unhealthy\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephMonitorClockSkew
      expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Ceph monitor clock skew (instance {{ $labels.instance }})
        description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephMonitorLowSpace
      expr: ceph_monitor_avail_percent < 10
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Ceph monitor low space (instance {{ $labels.instance }})
        description: "Ceph monitor storage is low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephOSDDown
      expr: ceph_osd_up == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Ceph OSD Down (instance {{ $labels.instance }})
        description: "Ceph Object Storage Daemon Down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Threshold of 5000ms (5 seconds). Adjust based on your expected OSD performance.
    - alert: CephHighOSDLatency
      expr: ceph_osd_apply_latency_ms > 5000
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: Ceph high OSD latency (instance {{ $labels.instance }})
        description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
      # ceph_health_detail exposes named health checks as individual time series.
    - alert: CephOSDNearFull
      expr: ceph_health_detail{name="OSD_NEARFULL"} == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Ceph OSD near full (instance {{ $labels.instance }})
        description: "A Ceph OSD is dangerously full. Please add more disks.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephOSDReweighted
      expr: ceph_osd_weight < 1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Ceph OSD reweighted (instance {{ $labels.instance }})
        description: "Ceph Object Storage Daemon takes too much time to resize.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephPGDown
      expr: ceph_pg_down > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Ceph PG down (instance {{ $labels.instance }})
        description: "Some Ceph placement groups are down. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephPGIncomplete
      expr: ceph_pg_incomplete > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: Ceph PG incomplete (instance {{ $labels.instance }})
        description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephPGInconsistent
      expr: ceph_pg_inconsistent > 0
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Ceph PG inconsistent (instance {{ $labels.instance }})
        description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephPGActivationLong
      expr: ceph_pg_activating > 0
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Ceph PG activation long (instance {{ $labels.instance }})
        description: "Some Ceph placement groups are too long to activate.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephPGBackfillFull
      expr: ceph_pg_backfill_toofull > 0
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Ceph PG backfill full (instance {{ $labels.instance }})
        description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: CephPGUnavailable
      expr: ceph_pg_total - ceph_pg_active > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Ceph PG unavailable (instance {{ $labels.instance }})
        description: "Some Ceph placement groups are unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

10.1. Embedded exporter (13 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ceph/embedded-exporter.yml

critical

10.1.1. Ceph State

Ceph instance unhealthy

  # ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
  # This rule fires on any non-OK state. Split into ==1 (warning) and ==2 (critical) if you want separate severity levels.
- alert: CephState
  expr: ceph_health_status != 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Ceph State (instance {{ $labels.instance }})
    description: "Ceph instance unhealthy\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.2. Ceph monitor clock skew

Ceph monitor clock skew detected. Please check ntp and hardware clock settings

- alert: CephMonitorClockSkew
  expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph monitor clock skew (instance {{ $labels.instance }})
    description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.3. Ceph monitor low space

Ceph monitor storage is low.

- alert: CephMonitorLowSpace
  expr: ceph_monitor_avail_percent < 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph monitor low space (instance {{ $labels.instance }})
    description: "Ceph monitor storage is low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

10.1.4. Ceph OSD Down

Ceph Object Storage Daemon Down

- alert: CephOSDDown
  expr: ceph_osd_up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Ceph OSD Down (instance {{ $labels.instance }})
    description: "Ceph Object Storage Daemon Down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.5. Ceph high OSD latency

Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.

  # Threshold of 5000ms (5 seconds). Adjust based on your expected OSD performance.
- alert: CephHighOSDLatency
  expr: ceph_osd_apply_latency_ms > 5000
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: Ceph high OSD latency (instance {{ $labels.instance }})
    description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.6. Ceph OSD near full

A Ceph OSD is dangerously full. Please add more disks.

  # Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
  # ceph_health_detail exposes named health checks as individual time series.
- alert: CephOSDNearFull
  expr: ceph_health_detail{name="OSD_NEARFULL"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Ceph OSD near full (instance {{ $labels.instance }})
    description: "A Ceph OSD is dangerously full. Please add more disks.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.7. Ceph OSD reweighted

Ceph Object Storage Daemon takes too much time to resize.

- alert: CephOSDReweighted
  expr: ceph_osd_weight < 1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph OSD reweighted (instance {{ $labels.instance }})
    description: "Ceph Object Storage Daemon takes too much time to resize.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

10.1.8. Ceph PG down

Some Ceph placement groups are down. Please ensure that all the data are available.

- alert: CephPGDown
  expr: ceph_pg_down > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Ceph PG down (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are down. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

10.1.9. Ceph PG incomplete

Some Ceph placement groups are incomplete. Please ensure that all the data are available.

- alert: CephPGIncomplete
  expr: ceph_pg_incomplete > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Ceph PG incomplete (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.10. Ceph PG inconsistent

Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.

- alert: CephPGInconsistent
  expr: ceph_pg_inconsistent > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Ceph PG inconsistent (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.11. Ceph PG activation long

Some Ceph placement groups are too long to activate.

- alert: CephPGActivationLong
  expr: ceph_pg_activating > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph PG activation long (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are too long to activate.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

10.1.12. Ceph PG backfill full

Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.

- alert: CephPGBackfillFull
  expr: ceph_pg_backfill_toofull > 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Ceph PG backfill full (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

10.1.13. Ceph PG unavailable

Some Ceph placement groups are unavailable.

- alert: CephPGUnavailable
  expr: ceph_pg_total - ceph_pg_active > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Ceph PG unavailable (instance {{ $labels.instance }})
    description: "Some Ceph placement groups are unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Storage

ZFS OpenEBS Minio