What is the Prometheus alert rule for "Flink job is not running"?

No Flink jobs are currently running. All jobs may have failed or been cancelled. PromQL expression: flink_jobmanager_numRunningJobs == 0. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Flink no TaskManagers registered"?

No TaskManagers are registered with the JobManager. The cluster has no processing capacity. PromQL expression: flink_jobmanager_numRegisteredTaskManagers == 0. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Flink all task slots used"?

All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled. PromQL expression: flink_jobmanager_taskSlotsAvailable == 0. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink job restart increasing"?

Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes. PromQL expression: delta(flink_jobmanager_job_numRestarts[5m]) > 1. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink checkpoint failures"?

Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes. PromQL expression: delta(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 1. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink checkpoint duration high"?

Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete. PromQL expression: flink_jobmanager_job_lastCheckpointDuration / 1000 > 60. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink task backpressured"?

Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured. PromQL expression: flink_taskmanager_job_task_isBackPressured == 1. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink task high backpressure time"?

Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure. PromQL expression: flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink TaskManager GC time high"?

Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection. PromQL expression: deriv(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100. Severity: warning. Duration: 5m.

What is the Prometheus alert rule for "Flink no records processed"?

Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes. PromQL expression: delta(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0. Severity: warning. Duration: 5m.

Apache Flink Prometheus Alert Rules

12 Prometheus alerting rules for Apache Flink. Exported via Built-in Prometheus reporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: FlinkPrometheusReporter
  rules:
    - alert: FlinkJobIsNotRunning
      expr: flink_jobmanager_numRunningJobs == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Flink job is not running (instance {{ $labels.instance }})
        description: "No Flink jobs are currently running. All jobs may have failed or been cancelled.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: FlinkNoTaskManagersRegistered
      expr: flink_jobmanager_numRegisteredTaskManagers == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Flink no TaskManagers registered (instance {{ $labels.instance }})
        description: "No TaskManagers are registered with the JobManager. The cluster has no processing capacity.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity.
    - alert: FlinkAllTaskSlotsUsed
      expr: flink_jobmanager_taskSlotsAvailable == 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink all task slots used (instance {{ $labels.instance }})
        description: "All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # A single restart may be normal during deployments. Adjust threshold based on restart tolerance.
    - alert: FlinkJobRestartIncreasing
      expr: delta(flink_jobmanager_job_numRestarts[5m]) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink job restart increasing (instance {{ $labels.instance }})
        description: "Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: FlinkCheckpointFailures
      expr: delta(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink checkpoint failures (instance {{ $labels.instance }})
        description: "Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Value is converted from milliseconds to seconds for correct humanizeDuration display.
      # Threshold is 60 seconds. Adjust based on your checkpoint interval and state size.
    - alert: FlinkCheckpointDurationHigh
      expr: flink_jobmanager_job_lastCheckpointDuration / 1000 > 60
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink checkpoint duration high (instance {{ $labels.instance }})
        description: "Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: FlinkTaskBackpressured
      expr: flink_taskmanager_job_task_isBackPressured == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink task backpressured (instance {{ $labels.instance }})
        description: "Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate.
    - alert: FlinkTaskHighBackpressureTime
      expr: flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink task high backpressure time (instance {{ $labels.instance }})
        description: "Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Flink TaskManagers manage their own memory pool. High JVM heap usage (outside managed memory) may indicate memory leaks or misconfiguration.
    - alert: FlinkTaskManagerHeapMemoryHigh
      expr: flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9 and flink_taskmanager_Status_JVM_Memory_Heap_Max > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink TaskManager heap memory high (instance {{ $labels.instance }})
        description: "Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: FlinkJobManagerHeapMemoryHigh
      expr: flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9 and flink_jobmanager_Status_JVM_Memory_Heap_Max > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink JobManager heap memory high (instance {{ $labels.instance }})
        description: "Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Flink exposes GC time as a gauge (cumulative milliseconds), so deriv() is used instead of rate().
      # Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.
    - alert: FlinkTaskManagerGCTimeHigh
      expr: deriv(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink TaskManager GC time high (instance {{ $labels.instance }})
        description: "Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # Only fires for tasks that have previously received records, to avoid false positives during startup.
    - alert: FlinkNoRecordsProcessed
      expr: delta(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Flink no records processed (instance {{ $labels.instance }})
        description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

6.1. Built-in Prometheus reporter (12 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/apache-flink/flink-prometheus-reporter.yml

critical

6.1.1. Flink job is not running

No Flink jobs are currently running. All jobs may have failed or been cancelled.

- alert: FlinkJobIsNotRunning
  expr: flink_jobmanager_numRunningJobs == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Flink job is not running (instance {{ $labels.instance }})
    description: "No Flink jobs are currently running. All jobs may have failed or been cancelled.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

6.1.2. Flink no TaskManagers registered

No TaskManagers are registered with the JobManager. The cluster has no processing capacity.

- alert: FlinkNoTaskManagersRegistered
  expr: flink_jobmanager_numRegisteredTaskManagers == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Flink no TaskManagers registered (instance {{ $labels.instance }})
    description: "No TaskManagers are registered with the JobManager. The cluster has no processing capacity.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.3. Flink all task slots used

All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.

  # This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity.
- alert: FlinkAllTaskSlotsUsed
  expr: flink_jobmanager_taskSlotsAvailable == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink all task slots used (instance {{ $labels.instance }})
    description: "All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.4. Flink job restart increasing

Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.

  # A single restart may be normal during deployments. Adjust threshold based on restart tolerance.
- alert: FlinkJobRestartIncreasing
  expr: delta(flink_jobmanager_job_numRestarts[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink job restart increasing (instance {{ $labels.instance }})
    description: "Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.5. Flink checkpoint failures

Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.

- alert: FlinkCheckpointFailures
  expr: delta(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink checkpoint failures (instance {{ $labels.instance }})
    description: "Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.6. Flink checkpoint duration high

Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.

  # Value is converted from milliseconds to seconds for correct humanizeDuration display.
  # Threshold is 60 seconds. Adjust based on your checkpoint interval and state size.
- alert: FlinkCheckpointDurationHigh
  expr: flink_jobmanager_job_lastCheckpointDuration / 1000 > 60
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink checkpoint duration high (instance {{ $labels.instance }})
    description: "Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.7. Flink task backpressured

Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.

- alert: FlinkTaskBackpressured
  expr: flink_taskmanager_job_task_isBackPressured == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink task backpressured (instance {{ $labels.instance }})
    description: "Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.8. Flink task high backpressure time

Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.

  # Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate.
- alert: FlinkTaskHighBackpressureTime
  expr: flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink task high backpressure time (instance {{ $labels.instance }})
    description: "Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.9. Flink TaskManager heap memory high

Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.

  # Flink TaskManagers manage their own memory pool. High JVM heap usage (outside managed memory) may indicate memory leaks or misconfiguration.
- alert: FlinkTaskManagerHeapMemoryHigh
  expr: flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9 and flink_taskmanager_Status_JVM_Memory_Heap_Max > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink TaskManager heap memory high (instance {{ $labels.instance }})
    description: "Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.10. Flink JobManager heap memory high

Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.

- alert: FlinkJobManagerHeapMemoryHigh
  expr: flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9 and flink_jobmanager_Status_JVM_Memory_Heap_Max > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink JobManager heap memory high (instance {{ $labels.instance }})
    description: "Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.11. Flink TaskManager GC time high

Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.

  # Flink exposes GC time as a gauge (cumulative milliseconds), so deriv() is used instead of rate().
  # Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.
- alert: FlinkTaskManagerGCTimeHigh
  expr: deriv(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink TaskManager GC time high (instance {{ $labels.instance }})
    description: "Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

6.1.12. Flink no records processed

Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.

  # Only fires for tasks that have previously received records, to avoid false positives during startup.
- alert: FlinkNoRecordsProcessed
  expr: delta(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Flink no records processed (instance {{ $labels.instance }})
    description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Data engineering

Apache Spark Hadoop