What is the Prometheus alert rule for "Spark no alive workers"?

No Spark workers are alive. The cluster has no processing capacity. PromQL expression: metrics_master_aliveWorkers_Value == 0. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Spark too many waiting apps"?

Spark has {{ $value }} applications waiting for resources. PromQL expression: metrics_master_waitingApps_Value > 10. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Spark worker memory exhausted"?

Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free). PromQL expression: metrics_worker_memFree_MB_Value == 0. Severity: warning. Duration: 2m.

What is the Prometheus alert rule for "Spark worker cores exhausted"?

Spark worker {{ $labels.instance }} has no free cores. PromQL expression: metrics_worker_coresFree_Value == 0. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Spark executor high GC time"?

Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC. PromQL expression: metrics_executor_totalGCTime_seconds_total / metrics_executor_totalDuration > 0.1 and metrics_executor_totalDuration > 0. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Spark executor all tasks failing"?

Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed). PromQL expression: metrics_executor_failedTasks_total > 0 and metrics_executor_completedTasks_total == 0. Severity: critical. Duration: 5m.

What is the Prometheus alert rule for "Spark executor high disk spill"?

Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory. PromQL expression: metrics_executor_diskUsed_bytes > 1e9. Severity: warning. Duration: 5m.

Apache Spark Prometheus Alert Rules

8 Prometheus alerting rules for Apache Spark.Exported via Built-in Prometheus (PrometheusServlet + PrometheusResource).These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: SparkPrometheus
  rules:
    - alert: SparkNoAliveWorkers
      expr: metrics_master_aliveWorkers_Value == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Spark no alive workers (instance {{ $labels.instance }})
        description: "No Spark workers are alive. The cluster has no processing capacity.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Adjust the threshold based on your cluster's typical queuing behavior.
    - alert: SparkTooManyWaitingApps
      expr: metrics_master_waitingApps_Value > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Spark too many waiting apps (instance {{ $labels.instance }})
        description: "Spark has {{ $value }} applications waiting for resources.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: SparkWorkerMemoryExhausted
      expr: metrics_worker_memFree_MB_Value == 0
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Spark worker memory exhausted (instance {{ $labels.instance }})
        description: "Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues.
    - alert: SparkWorkerCoresExhausted
      expr: metrics_worker_coresFree_Value == 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Spark worker cores exhausted (instance {{ $labels.instance }})
        description: "Spark worker {{ $labels.instance }} has no free cores.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Fires when more than 10% of executor time is spent in garbage collection.
      # This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/).
    - alert: SparkExecutorHighGCTime
      expr: metrics_executor_totalGCTime_seconds_total / metrics_executor_totalDuration > 0.1 and metrics_executor_totalDuration > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Spark executor high GC time (instance {{ $labels.instance }})
        description: "Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: SparkExecutorAllTasksFailing
      expr: metrics_executor_failedTasks_total > 0 and metrics_executor_completedTasks_total == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Spark executor all tasks failing (instance {{ $labels.instance }})
        description: "Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: SparkExecutorHighTaskFailureRate
      expr: metrics_executor_failedTasks_total / metrics_executor_totalTasks_total > 0.1 and metrics_executor_totalTasks_total > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Spark executor high task failure rate (instance {{ $labels.instance }})
        description: "Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # diskUsed is a gauge, not a counter — do not use rate(). Threshold of 1GB is a rough default.
      # Disk spilling indicates insufficient memory for the workload.
    - alert: SparkExecutorHighDiskSpill
      expr: metrics_executor_diskUsed_bytes > 1e9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Spark executor high disk spill (instance {{ $labels.instance }})
        description: "Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

6.2.Built-in Prometheus (PrometheusServlet + PrometheusResource)(8 rules)

Spark exposes metrics via two built-in endpoints:
- PrometheusServlet: master/worker/driver metrics at /metrics/prometheus/ (ports 8080, 8081, 4040)
- PrometheusResource: executor metrics at /metrics/executors/prometheus/ (port 4040, requires spark.ui.prometheus.enabled=true in Spark 3.x)
Metric names from PrometheusServlet include a dynamic namespace (application ID), making static PromQL queries challenging.
Configuration: spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/apache-spark/spark-prometheus.yml

critical

6.2.1.Spark no alive workers

No Spark workers are alive. The cluster has no processing capacity.

- alert: SparkNoAliveWorkers
  expr: metrics_master_aliveWorkers_Value == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Spark no alive workers (instance {{ $labels.instance }})
    description: "No Spark workers are alive. The cluster has no processing capacity.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.2.2.Spark too many waiting apps

Spark has {{ $value }} applications waiting for resources.

  # Adjust the threshold based on your cluster's typical queuing behavior.
- alert: SparkTooManyWaitingApps
  expr: metrics_master_waitingApps_Value > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Spark too many waiting apps (instance {{ $labels.instance }})
    description: "Spark has {{ $value }} applications waiting for resources.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.2.3.Spark worker memory exhausted

Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).

- alert: SparkWorkerMemoryExhausted
  expr: metrics_worker_memFree_MB_Value == 0
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: Spark worker memory exhausted (instance {{ $labels.instance }})
    description: "Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.2.4.Spark worker cores exhausted

Spark worker {{ $labels.instance }} has no free cores.

  # Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues.
- alert: SparkWorkerCoresExhausted
  expr: metrics_worker_coresFree_Value == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Spark worker cores exhausted (instance {{ $labels.instance }})
    description: "Spark worker {{ $labels.instance }} has no free cores.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.2.5.Spark executor high GC time

Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.

  # Fires when more than 10% of executor time is spent in garbage collection.
  # This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/).
- alert: SparkExecutorHighGCTime
  expr: metrics_executor_totalGCTime_seconds_total / metrics_executor_totalDuration > 0.1 and metrics_executor_totalDuration > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Spark executor high GC time (instance {{ $labels.instance }})
    description: "Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

6.2.6.Spark executor all tasks failing

Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).

- alert: SparkExecutorAllTasksFailing
  expr: metrics_executor_failedTasks_total > 0 and metrics_executor_completedTasks_total == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Spark executor all tasks failing (instance {{ $labels.instance }})
    description: "Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.2.7.Spark executor high task failure rate

Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.

- alert: SparkExecutorHighTaskFailureRate
  expr: metrics_executor_failedTasks_total / metrics_executor_totalTasks_total > 0.1 and metrics_executor_totalTasks_total > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Spark executor high task failure rate (instance {{ $labels.instance }})
    description: "Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.2.8.Spark executor high disk spill

Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.

  # diskUsed is a gauge, not a counter — do not use rate(). Threshold of 1GB is a rough default.
  # Disk spilling indicates insufficient memory for the workload.
- alert: SparkExecutorHighDiskSpill
  expr: metrics_executor_diskUsed_bytes > 1e9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Spark executor high disk spill (instance {{ $labels.instance }})
    description: "Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Data engineering

Apache Flink Hadoop