Area: Observability Issues
Sub-Area: Metrics and Monitoring
Issue
Remote Executor deployments expose numerous Prometheus metrics beyond the documented ones (ingestion_requests, ingestion_errors, resource metrics). Many of these undocumented metrics appear on the /metrics endpoint with unclear purposes, making it difficult to build comprehensive monitoring and alerting systems for production Remote Executor instances.
You Might Be Asking
- What do the undocumented datahub_executor_worker_* metrics represent?
- How can I detect if my Remote Executor is alive but application-level processing is stuck?
- Why are some metrics always zero in my environment?
- Which metrics should I monitor for production alerting?
Solution
Here's a comprehensive breakdown of Remote Executor Prometheus metrics and their meanings:
Assertion-Related Metrics
- assertion_requests - Counts assertion jobs received from the queue, analogous to ingestion_requests but for data quality checks
- assertion_evaluate_requests_count - Tracks on-demand assertion evaluations (triggered via API/UI for dry runs or manual checks). If this is zero while assertion_requests increments, it means only scheduled assertions are running, which is expected in most production environments
System Health Metrics
- credentials_refresh_requests - Increments each time the executor refreshes credentials (database tokens, cloud access tokens). High counts are normal and reflect regular refresh intervals to ensure credentials remain valid
- discovery_ping_requests_count - Increments when the executor registers or pings the control plane for discovery, typically at pod startup and during periodic health checks
- monitoring_push_requests_count - Counts metrics pushes to the monitoring backend (Prometheus PushGateway). Usually increments at startup and then periodically based on metrics push interval
Configuration and Control Metrics
- config_fetcher_requests_count - Tracks requests to fetch configuration from the control plane. If always zero, your deployment likely isn't using dynamic config updates or this feature isn't enabled
- ingestion_fetch_signal_requests_count - Tracks requests to fetch cancellation or control signals for ingestion jobs. Zero values indicate no jobs have been cancelled or required signal fetching
- ingestion_fallback_fetch_requests_count - Counts fallback fetches for ingestion jobs when primary fetch mechanism fails. Zero values indicate successful primary path fetching
Advanced Feature Metrics
- monitor_bootstrap_execute_requests_count - Relates to monitor (assertion) bootstrap tasks. Zero values are normal unless using advanced observability features
- monitor_training_executor_requests_count - Tracks monitor training tasks for smart assertions. Zero values are normal unless using monitor training features
Accessing Metric Descriptions
To get detailed descriptions for all metrics, exec into your Remote Executor pod and access the /metrics endpoint:
kubectl exec -it <remote-executor-pod> -- curl localhost:8080/metrics
Look for the # HELP comments above each metric for official descriptions.
Application Health Monitoring
For detecting "executor alive but application-level processing stuck" scenarios:
- Monitor
discovery_ping_requests_count- This metric should increment regularly (typically 2,000-3,000 times per day) - A stall in this counter is a reliable indicator that the application is frozen, distinct from pod-level health checks
- Set up alerting when this metric hasn't incremented within your expected interval
Additional Notes
Metric values depend heavily on which DataHub features are enabled and how your environment is configured. Zero values for many metrics are normal if corresponding features aren't in use. These metrics are designed for internal observability and troubleshooting. When building monitoring dashboards, focus on the documented core metrics first, then add these supplementary metrics based on your specific use cases.
Related Documentation
Tags: remote-executor, prometheus, metrics, monitoring, observability, alerting, assertions, health-checks, troubleshooting, production