What is the Prometheus alert rule for "Hadoop Name Node Down"?

The Hadoop NameNode service is unavailable. PromQL expression: up{job="hadoop-namenode"} == 0. Severity: critical. Duration: 5m.

What is the Prometheus alert rule for "Hadoop Resource Manager Down"?

The Hadoop ResourceManager service is unavailable. PromQL expression: up{job="hadoop-resourcemanager"} == 0. Severity: critical. Duration: 5m.

What is the Prometheus alert rule for "Hadoop Data Node Out Of Service"?

The Hadoop DataNode is not sending heartbeats. PromQL expression: hadoop_datanode_last_heartbeat == 0. Severity: warning. Duration: 10m.

What is the Prometheus alert rule for "Hadoop HDFS Disk Space Low"?

Available HDFS disk space is running low. PromQL expression: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total 0. Severity: warning. Duration: 15m.

What is the Prometheus alert rule for "Hadoop Map Reduce Task Failures"?

There is an unusually high number of MapReduce task failures. PromQL expression: increase(hadoop_mapreduce_task_failures_total[1h]) > 100. Severity: critical. Duration: 10m.

What is the Prometheus alert rule for "Hadoop YARN Container Allocation Failures"?

There is a significant number of YARN container allocation failures. PromQL expression: increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10. Severity: warning. Duration: 10m.

What is the Prometheus alert rule for "Hadoop HBase Region Count High"?

The HBase cluster has an unusually high number of regions. PromQL expression: hadoop_hbase_region_count > 5000. Severity: warning. Duration: 15m.

What is the Prometheus alert rule for "Hadoop HBase Write Requests Latency High"?

HBase Write Requests are experiencing high latency. PromQL expression: hadoop_hbase_write_requests_latency_seconds > 0.5. Severity: warning. Duration: 10m.

Hadoop Prometheus Alert Rules

Q: What is the Prometheus alert rule for "Hadoop HBase Region Server Heap Low"?

HBase Region Servers are running low on heap space. PromQL expression: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8 and hadoop_hbase_region_server_max_heap_bytes > 0. Severity: warning. Duration: 10m.

10 Prometheus alerting rules for Hadoop.Exported via hadoop/jmx_exporter.These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: Jmx_exporter
  rules:
      # When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
      # so this alert may not fire. Prefer application-level availability metrics if available.
      # Rename job="hadoop-namenode" to match the actual job name in your Prometheus scrape config.
    - alert: HadoopNameNodeDown
      expr: up{job="hadoop-namenode"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Hadoop Name Node Down (instance {{ $labels.instance }})
        description: "The Hadoop NameNode service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
      # so this alert may not fire. Prefer application-level availability metrics if available.
      # Rename job="hadoop-resourcemanager" to match the actual job name in your Prometheus scrape config.
    - alert: HadoopResourceManagerDown
      expr: up{job="hadoop-resourcemanager"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Hadoop Resource Manager Down (instance {{ $labels.instance }})
        description: "The Hadoop ResourceManager service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopDataNodeOutOfService
      expr: hadoop_datanode_last_heartbeat == 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Hadoop Data Node Out Of Service (instance {{ $labels.instance }})
        description: "The Hadoop DataNode is not sending heartbeats.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopHDFSDiskSpaceLow
      expr: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1 and hadoop_hdfs_bytes_total > 0
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: Hadoop HDFS Disk Space Low (instance {{ $labels.instance }})
        description: "Available HDFS disk space is running low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopMapReduceTaskFailures
      expr: increase(hadoop_mapreduce_task_failures_total[1h]) > 100
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: Hadoop Map Reduce Task Failures (instance {{ $labels.instance }})
        description: "There is an unusually high number of MapReduce task failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopResourceManagerMemoryHigh
      expr: hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8 and hadoop_resourcemanager_memory_max_bytes > 0
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: Hadoop Resource Manager Memory High (instance {{ $labels.instance }})
        description: "The Hadoop ResourceManager is approaching its memory limit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopYARNContainerAllocationFailures
      expr: increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Hadoop YARN Container Allocation Failures (instance {{ $labels.instance }})
        description: "There is a significant number of YARN container allocation failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopHBaseRegionCountHigh
      expr: hadoop_hbase_region_count > 5000
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: Hadoop HBase Region Count High (instance {{ $labels.instance }})
        description: "The HBase cluster has an unusually high number of regions.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopHBaseRegionServerHeapLow
      expr: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8 and hadoop_hbase_region_server_max_heap_bytes > 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})
        description: "HBase Region Servers are running low on heap space.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: HadoopHBaseWriteRequestsLatencyHigh
      expr: hadoop_hbase_write_requests_latency_seconds > 0.5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Hadoop HBase Write Requests Latency High (instance {{ $labels.instance }})
        description: "HBase Write Requests are experiencing high latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

6.3.hadoop/jmx_exporter(10 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/hadoop/jmx_exporter.yml

critical

6.3.1.Hadoop Name Node Down

The Hadoop NameNode service is unavailable.

  # When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
  # so this alert may not fire. Prefer application-level availability metrics if available.
  # Rename job="hadoop-namenode" to match the actual job name in your Prometheus scrape config.
- alert: HadoopNameNodeDown
  expr: up{job="hadoop-namenode"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Hadoop Name Node Down (instance {{ $labels.instance }})
    description: "The Hadoop NameNode service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

6.3.2.Hadoop Resource Manager Down

The Hadoop ResourceManager service is unavailable.

  # When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
  # so this alert may not fire. Prefer application-level availability metrics if available.
  # Rename job="hadoop-resourcemanager" to match the actual job name in your Prometheus scrape config.
- alert: HadoopResourceManagerDown
  expr: up{job="hadoop-resourcemanager"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Hadoop Resource Manager Down (instance {{ $labels.instance }})
    description: "The Hadoop ResourceManager service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.3.Hadoop Data Node Out Of Service

The Hadoop DataNode is not sending heartbeats.

- alert: HadoopDataNodeOutOfService
  expr: hadoop_datanode_last_heartbeat == 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Hadoop Data Node Out Of Service (instance {{ $labels.instance }})
    description: "The Hadoop DataNode is not sending heartbeats.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.4.Hadoop HDFS Disk Space Low

Available HDFS disk space is running low.

- alert: HadoopHDFSDiskSpaceLow
  expr: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1 and hadoop_hdfs_bytes_total > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Hadoop HDFS Disk Space Low (instance {{ $labels.instance }})
    description: "Available HDFS disk space is running low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

6.3.5.Hadoop Map Reduce Task Failures

There is an unusually high number of MapReduce task failures.

- alert: HadoopMapReduceTaskFailures
  expr: increase(hadoop_mapreduce_task_failures_total[1h]) > 100
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: Hadoop Map Reduce Task Failures (instance {{ $labels.instance }})
    description: "There is an unusually high number of MapReduce task failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.6.Hadoop Resource Manager Memory High

The Hadoop ResourceManager is approaching its memory limit.

- alert: HadoopResourceManagerMemoryHigh
  expr: hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8 and hadoop_resourcemanager_memory_max_bytes > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Hadoop Resource Manager Memory High (instance {{ $labels.instance }})
    description: "The Hadoop ResourceManager is approaching its memory limit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.7.Hadoop YARN Container Allocation Failures

There is a significant number of YARN container allocation failures.

- alert: HadoopYARNContainerAllocationFailures
  expr: increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Hadoop YARN Container Allocation Failures (instance {{ $labels.instance }})
    description: "There is a significant number of YARN container allocation failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.8.Hadoop HBase Region Count High

The HBase cluster has an unusually high number of regions.

- alert: HadoopHBaseRegionCountHigh
  expr: hadoop_hbase_region_count > 5000
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: Hadoop HBase Region Count High (instance {{ $labels.instance }})
    description: "The HBase cluster has an unusually high number of regions.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.9.Hadoop HBase Region Server Heap Low

HBase Region Servers are running low on heap space.

- alert: HadoopHBaseRegionServerHeapLow
  expr: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8 and hadoop_hbase_region_server_max_heap_bytes > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})
    description: "HBase Region Servers are running low on heap space.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.3.10.Hadoop HBase Write Requests Latency High

HBase Write Requests are experiencing high latency.

- alert: HadoopHBaseWriteRequestsLatencyHigh
  expr: hadoop_hbase_write_requests_latency_seconds > 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Hadoop HBase Write Requests Latency High (instance {{ $labels.instance }})
    description: "HBase Write Requests are experiencing high latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Data engineering

Apache Flink Apache Spark