Understanding Remote Executor Prometheus Metrics and Monitoring – DataHub Customer Support Portal

Area: Observability Issues

Sub-Area: Metrics and Monitoring

Issue

Remote Executor deployments expose numerous Prometheus metrics beyond the documented ones (ingestion_requests, ingestion_errors, resource metrics). Many of these undocumented metrics appear on the /metrics endpoint with unclear purposes, making it difficult to build comprehensive monitoring and alerting systems for production Remote Executor instances.

You Might Be Asking

What do the undocumented datahub_executor_worker_* metrics represent?
How can I detect if my Remote Executor is alive but application-level processing is stuck?
Why are some metrics always zero in my environment?
Which metrics should I monitor for production alerting?

Solution

Here's a comprehensive breakdown of Remote Executor Prometheus metrics and their meanings:

Assertion-Related Metrics

assertion_requests - Counts assertion jobs received from the queue, analogous to ingestion_requests but for data quality checks
assertion_evaluate_requests_count - Tracks on-demand assertion evaluations (triggered via API/UI for dry runs or manual checks). If this is zero while assertion_requests increments, it means only scheduled assertions are running, which is expected in most production environments

System Health Metrics

credentials_refresh_requests - Increments each time the executor refreshes credentials (database tokens, cloud access tokens). High counts are normal and reflect regular refresh intervals to ensure credentials remain valid
discovery_ping_requests_count - Increments when the executor registers or pings the control plane for discovery, typically at pod startup and during periodic health checks
monitoring_push_requests_count - Counts metrics pushes to the monitoring backend (Prometheus PushGateway). Usually increments at startup and then periodically based on metrics push interval

Configuration and Control Metrics

config_fetcher_requests_count - Tracks requests to fetch configuration from the control plane. If always zero, your deployment likely isn't using dynamic config updates or this feature isn't enabled
ingestion_fetch_signal_requests_count - Tracks requests to fetch cancellation or control signals for ingestion jobs. Zero values indicate no jobs have been cancelled or required signal fetching
ingestion_fallback_fetch_requests_count - Counts fallback fetches for ingestion jobs when primary fetch mechanism fails. Zero values indicate successful primary path fetching

Advanced Feature Metrics

monitor_bootstrap_execute_requests_count - Relates to monitor (assertion) bootstrap tasks. Zero values are normal unless using advanced observability features
monitor_training_executor_requests_count - Tracks monitor training tasks for smart assertions. Zero values are normal unless using monitor training features

Accessing Metric Descriptions

To get detailed descriptions for all metrics, exec into your Remote Executor pod and access the /metrics endpoint:

kubectl exec -it <remote-executor-pod> -- curl localhost:8080/metrics

Look for the # HELP comments above each metric for official descriptions.

Application Health Monitoring

For detecting "executor alive but application-level processing stuck" scenarios:

Monitor discovery_ping_requests_count - This metric should increment regularly (typically 2,000-3,000 times per day)
A stall in this counter is a reliable indicator that the application is frozen, distinct from pod-level health checks
Set up alerting when this metric hasn't incremented within your expected interval

Additional Notes

Metric values depend heavily on which DataHub features are enabled and how your environment is configured. Zero values for many metrics are normal if corresponding features aren't in use. These metrics are designed for internal observability and troubleshooting. When building monitoring dashboards, focus on the documented core metrics first, then add these supplementary metrics based on your specific use cases.