Skip to main content
APA
Sponsored by CAST AI — Kubernetes cost optimization Better Stack — Uptime monitoring and log management
⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

IPMI Prometheus Alert Rules

17 Prometheus alerting rules for IPMI. Exported via prometheus-community/ipmi_exporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/ipmi/ipmi-exporter.yml
warning

1.4.1. IPMI collector down

IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity.

  # The ipmi_up metric is per-collector. A value of 0 means the collector could not retrieve data from the BMC.
- alert: IPMICollectorDown
  expr: ipmi_up == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI collector down (instance {{ $labels.instance }})
    description: "IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

1.4.2. IPMI temperature sensor warning

IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.

  # State values: 0=nominal, 1=warning, 2=critical. Thresholds are defined in the BMC firmware.
- alert: IPMITemperatureSensorWarning
  expr: ipmi_temperature_state == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI temperature sensor warning (instance {{ $labels.instance }})
    description: "IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.3. IPMI temperature sensor critical

IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage.

- alert: IPMITemperatureSensorCritical
  expr: ipmi_temperature_state == 2
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI temperature sensor critical (instance {{ $labels.instance }})
    description: "IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

1.4.4. IPMI fan speed sensor warning

IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.

- alert: IPMIFanSpeedSensorWarning
  expr: ipmi_fan_speed_state == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI fan speed sensor warning (instance {{ $labels.instance }})
    description: "IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.5. IPMI fan speed sensor critical

IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed.

- alert: IPMIFanSpeedSensorCritical
  expr: ipmi_fan_speed_state == 2
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI fan speed sensor critical (instance {{ $labels.instance }})
    description: "IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.6. IPMI fan speed zero

IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed.

- alert: IPMIFanSpeedZero
  expr: ipmi_fan_speed_rpm == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: IPMI fan speed zero (instance {{ $labels.instance }})
    description: "IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

1.4.7. IPMI voltage sensor warning

IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.

- alert: IPMIVoltageSensorWarning
  expr: ipmi_voltage_state == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI voltage sensor warning (instance {{ $labels.instance }})
    description: "IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.8. IPMI voltage sensor critical

IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible.

- alert: IPMIVoltageSensorCritical
  expr: ipmi_voltage_state == 2
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI voltage sensor critical (instance {{ $labels.instance }})
    description: "IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

1.4.9. IPMI current sensor warning

IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.

- alert: IPMICurrentSensorWarning
  expr: ipmi_current_state == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI current sensor warning (instance {{ $labels.instance }})
    description: "IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.10. IPMI current sensor critical

IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.

- alert: IPMICurrentSensorCritical
  expr: ipmi_current_state == 2
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI current sensor critical (instance {{ $labels.instance }})
    description: "IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

1.4.11. IPMI power sensor warning

IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.

- alert: IPMIPowerSensorWarning
  expr: ipmi_power_state == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI power sensor warning (instance {{ $labels.instance }})
    description: "IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.12. IPMI power sensor critical

IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.

- alert: IPMIPowerSensorCritical
  expr: ipmi_power_state == 2
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI power sensor critical (instance {{ $labels.instance }})
    description: "IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.13. IPMI generic sensor critical

IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state.

  # Catches any sensor type not covered by the specific temperature/fan/voltage/current/power alerts.
- alert: IPMIGenericSensorCritical
  expr: ipmi_sensor_state == 2
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: IPMI generic sensor critical (instance {{ $labels.instance }})
    description: "IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.14. IPMI chassis power off

IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly.

- alert: IPMIChassisPowerOff
  expr: ipmi_chassis_power_state == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI chassis power off (instance {{ $labels.instance }})
    description: "IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.15. IPMI chassis drive fault

IPMI reports a drive fault on {{ $labels.instance }}. Check disk health.

  # The metric uses inverted logic: 1=no fault, 0=fault detected.
- alert: IPMIChassisDriveFault
  expr: ipmi_chassis_drive_fault_state == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI chassis drive fault (instance {{ $labels.instance }})
    description: "IPMI reports a drive fault on {{ $labels.instance }}. Check disk health.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
critical

1.4.16. IPMI chassis cooling fault

IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow.

  # The metric uses inverted logic: 1=no fault, 0=fault detected.
- alert: IPMIChassisCoolingFault
  expr: ipmi_chassis_cooling_fault_state == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: IPMI chassis cooling fault (instance {{ $labels.instance }})
    description: "IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
warning

1.4.17. IPMI SEL almost full

IPMI System Event Log on {{ $labels.instance }} has only {{ printf "%.0f" $value }} bytes free. Clear the SEL to prevent loss of new events.

  # SEL storage is typically very limited (e.g., 16KB). When full, new events may be dropped.
- alert: IPMISELAlmostFull
  expr: ipmi_sel_free_space_bytes < 512
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: IPMI SEL almost full (instance {{ $labels.instance }})
    description: "IPMI System Event Log on {{ $labels.instance }} has only {{ printf \"%.0f\" $value }} bytes free. Clear the SEL to prevent loss of new events.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"