What is the Prometheus alert rule for "Pulsar subscription high number of backlog entries"?

The number of subscription backlog entries is over 5k PromQL expression: sum(pulsar_subscription_back_log) by (subscription) > 5000. Severity: warning. Duration: 1h.

What is the Prometheus alert rule for "Pulsar subscription very high number of backlog entries"?

The number of subscription backlog entries is over 100k PromQL expression: sum(pulsar_subscription_back_log) by (subscription) > 100000. Severity: critical. Duration: 1h.

What is the Prometheus alert rule for "Pulsar topic large backlog storage size"?

The topic backlog storage size is over 5 GB PromQL expression: sum(pulsar_storage_size) by (topic) > 5*1024*1024*1024. Severity: warning. Duration: 1h.

What is the Prometheus alert rule for "Pulsar topic very large backlog storage size"?

The topic backlog storage size is over 20 GB PromQL expression: sum(pulsar_storage_size) by (topic) > 20*1024*1024*1024. Severity: critical. Duration: 1h.

What is the Prometheus alert rule for "Pulsar high write latency"?

Pulsar topic {{ $labels.topic }} has {{ $value }} storage write operations exceeding the maximum latency bucket (> 1000ms) PromQL expression: sum(pulsar_storage_write_latency_le_overflow > 0) by (topic). Severity: critical. Duration: 1h.

What is the Prometheus alert rule for "Pulsar large message payload"?

Pulsar topic {{ $labels.topic }} has {{ $value }} message entries exceeding the maximum size bucket (> 1MB) PromQL expression: sum(pulsar_entry_size_le_overflow > 0) by (topic). Severity: warning. Duration: 1h.

What is the Prometheus alert rule for "Pulsar read only bookies"?

Observing Readonly Bookies PromQL expression: count(bookie_SERVER_STATUS{} == 0) by (pod). Severity: critical. Duration: 5m.

What is the Prometheus alert rule for "Pulsar high number of function errors"?

Pulsar function {{ $labels.name }} has more than 10 errors per second ({{ $value | printf "%.2f" }}/s) PromQL expression: sum(rate(pulsar_function_user_exceptions_total[1m]) + rate(pulsar_function_system_exceptions_total[1m])) by (name) > 10. Severity: critical. Duration: 1m.

What is the Prometheus alert rule for "Pulsar high number of sink errors"?

Pulsar sink {{ $labels.name }} has more than 10 errors per second ({{ $value | printf "%.2f" }}/s) PromQL expression: sum(rate(pulsar_sink_sink_exceptions_total[1m])) by (name) > 10. Severity: critical. Duration: 1m.

Pulsar Prometheus Alert Rules

Q: What is the Prometheus alert rule for "Pulsar high ledger disk usage"?

Observing Ledger Disk Usage (> 75%) PromQL expression: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75. Severity: critical. Duration: 1h.

10 Prometheus alerting rules for Pulsar. Exported via embedded exporter. These rules cover critical and warning conditions — copy and paste the YAML into your Prometheus configuration.

⚠️

Alert thresholds depend on the nature of your applications. Some queries may have arbitrary tolerance thresholds. Building an efficient monitoring platform takes time. 😉

groups:
- name: EmbeddedExporter
  rules:
    - alert: PulsarSubscriptionHighNumberOfBacklogEntries
      expr: sum(pulsar_subscription_back_log) by (subscription) > 5000
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: Pulsar subscription high number of backlog entries (instance {{ $labels.instance }})
        description: "The number of subscription backlog entries is over 5k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries
      expr: sum(pulsar_subscription_back_log) by (subscription) > 100000
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }})
        description: "The number of subscription backlog entries is over 100k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PulsarTopicLargeBacklogStorageSize
      expr: sum(pulsar_storage_size) by (topic) > 5*1024*1024*1024
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: Pulsar topic large backlog storage size (instance {{ $labels.instance }})
        description: "The topic backlog storage size is over 5 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PulsarTopicVeryLargeBacklogStorageSize
      expr: sum(pulsar_storage_size) by (topic) > 20*1024*1024*1024
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance }})
        description: "The topic backlog storage size is over 20 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # pulsar_storage_write_latency_le_overflow is the overflow bucket of Pulsar's non-standard histogram.
      # It counts write operations exceeding all defined latency bounds (> 1000ms).
    - alert: PulsarHighWriteLatency
      expr: sum(pulsar_storage_write_latency_le_overflow > 0) by (topic)
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: Pulsar high write latency (instance {{ $labels.instance }})
        description: "Pulsar topic {{ $labels.topic }} has {{ $value }} storage write operations exceeding the maximum latency bucket (> 1000ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # pulsar_entry_size_le_overflow is the overflow bucket of Pulsar's non-standard histogram.
      # It counts message entries exceeding all defined size bounds.
    - alert: PulsarLargeMessagePayload
      expr: sum(pulsar_entry_size_le_overflow > 0) by (topic)
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: Pulsar large message payload (instance {{ $labels.instance }})
        description: "Pulsar topic {{ $labels.topic }} has {{ $value }} message entries exceeding the maximum size bucket (> 1MB)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
      # This metric name is path-dependent and may differ based on your BookKeeper data directory configuration.
      # Adjust the metric name to match your actual ledger directory path.
    - alert: PulsarHighLedgerDiskUsage
      expr: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: Pulsar high ledger disk usage (instance {{ $labels.instance }})
        description: "Observing Ledger Disk Usage (> 75%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PulsarReadOnlyBookies
      expr: count(bookie_SERVER_STATUS{} == 0) by (pod)
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Pulsar read only bookies (instance {{ $labels.instance }})
        description: "Observing Readonly Bookies\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PulsarHighNumberOfFunctionErrors
      expr: sum(rate(pulsar_function_user_exceptions_total[1m]) + rate(pulsar_function_system_exceptions_total[1m])) by (name) > 10
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Pulsar high number of function errors (instance {{ $labels.instance }})
        description: "Pulsar function {{ $labels.name }} has more than 10 errors per second ({{ $value | printf \"%.2f\" }}/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    - alert: PulsarHighNumberOfSinkErrors
      expr: sum(rate(pulsar_sink_sink_exceptions_total[1m])) by (name) > 10
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Pulsar high number of sink errors (instance {{ $labels.instance }})
        description: "Pulsar sink {{ $labels.name }} has more than 10 errors per second ({{ $value | printf \"%.2f\" }}/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

3.4. embedded exporter (10 rules)

wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/refs/heads/master/dist/rules/pulsar/embedded-exporter.yml

warning

3.4.1. Pulsar subscription high number of backlog entries

The number of subscription backlog entries is over 5k

- alert: PulsarSubscriptionHighNumberOfBacklogEntries
  expr: sum(pulsar_subscription_back_log) by (subscription) > 5000
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: Pulsar subscription high number of backlog entries (instance {{ $labels.instance }})
    description: "The number of subscription backlog entries is over 5k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.2. Pulsar subscription very high number of backlog entries

The number of subscription backlog entries is over 100k

- alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries
  expr: sum(pulsar_subscription_back_log) by (subscription) > 100000
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }})
    description: "The number of subscription backlog entries is over 100k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

3.4.3. Pulsar topic large backlog storage size

The topic backlog storage size is over 5 GB

- alert: PulsarTopicLargeBacklogStorageSize
  expr: sum(pulsar_storage_size) by (topic) > 5*1024*1024*1024
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: Pulsar topic large backlog storage size (instance {{ $labels.instance }})
    description: "The topic backlog storage size is over 5 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.4. Pulsar topic very large backlog storage size

The topic backlog storage size is over 20 GB

- alert: PulsarTopicVeryLargeBacklogStorageSize
  expr: sum(pulsar_storage_size) by (topic) > 20*1024*1024*1024
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance }})
    description: "The topic backlog storage size is over 20 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.5. Pulsar high write latency

Pulsar topic {{ $labels.topic }} has {{ $value }} storage write operations exceeding the maximum latency bucket (> 1000ms)

  # pulsar_storage_write_latency_le_overflow is the overflow bucket of Pulsar's non-standard histogram.
  # It counts write operations exceeding all defined latency bounds (> 1000ms).
- alert: PulsarHighWriteLatency
  expr: sum(pulsar_storage_write_latency_le_overflow > 0) by (topic)
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: Pulsar high write latency (instance {{ $labels.instance }})
    description: "Pulsar topic {{ $labels.topic }} has {{ $value }} storage write operations exceeding the maximum latency bucket (> 1000ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

warning

3.4.6. Pulsar large message payload

Pulsar topic {{ $labels.topic }} has {{ $value }} message entries exceeding the maximum size bucket (> 1MB)

  # pulsar_entry_size_le_overflow is the overflow bucket of Pulsar's non-standard histogram.
  # It counts message entries exceeding all defined size bounds.
- alert: PulsarLargeMessagePayload
  expr: sum(pulsar_entry_size_le_overflow > 0) by (topic)
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: Pulsar large message payload (instance {{ $labels.instance }})
    description: "Pulsar topic {{ $labels.topic }} has {{ $value }} message entries exceeding the maximum size bucket (> 1MB)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.7. Pulsar high ledger disk usage

Observing Ledger Disk Usage (> 75%)

  # This metric name is path-dependent and may differ based on your BookKeeper data directory configuration.
  # Adjust the metric name to match your actual ledger directory path.
- alert: PulsarHighLedgerDiskUsage
  expr: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: Pulsar high ledger disk usage (instance {{ $labels.instance }})
    description: "Observing Ledger Disk Usage (> 75%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.8. Pulsar read only bookies

Observing Readonly Bookies

- alert: PulsarReadOnlyBookies
  expr: count(bookie_SERVER_STATUS{} == 0) by (pod)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Pulsar read only bookies (instance {{ $labels.instance }})
    description: "Observing Readonly Bookies\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.9. Pulsar high number of function errors

Pulsar function {{ $labels.name }} has more than 10 errors per second ({{ $value | printf "%.2f" }}/s)

- alert: PulsarHighNumberOfFunctionErrors
  expr: sum(rate(pulsar_function_user_exceptions_total[1m]) + rate(pulsar_function_system_exceptions_total[1m])) by (name) > 10
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Pulsar high number of function errors (instance {{ $labels.instance }})
    description: "Pulsar function {{ $labels.name }} has more than 10 errors per second ({{ $value | printf \"%.2f\" }}/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

critical

3.4.10. Pulsar high number of sink errors

Pulsar sink {{ $labels.name }} has more than 10 errors per second ({{ $value | printf "%.2f" }}/s)

- alert: PulsarHighNumberOfSinkErrors
  expr: sum(rate(pulsar_sink_sink_exceptions_total[1m])) by (name) > 10
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Pulsar high number of sink errors (instance {{ $labels.instance }})
    description: "Pulsar sink {{ $labels.name }} has more than 10 errors per second ({{ $value | printf \"%.2f\" }}/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

More in Message brokers

RabbitMQ Zookeeper Kafka Nats