Area: Observability Issues
Sub-Area: Assertion Configuration and Execution
Issue
Users attempting to configure assertions to run via remote executor infrastructure encounter connection errors or routing issues. This typically occurs when assertions fail to execute in the intended remote executor environment due to incorrect source configuration, missing permissions, or executor assignment problems.
Error Messages
Unable to connect to sourceUnable to fetch a row count for urn:li:dataset:(...) using volume_source_type=DatasetVolumeSourceType.INFORMATION_SCHEMAAttributeError("'Connection' object has no attribute 'closed'")Unable to find a dataset profile or rowCount for urn:li:dataset:(...)Spectrum Scan Error: S3ServiceException: User: ... is not authorized to perform: s3:ListBucket
You Might Be Asking
- How do I configure assertions to use my remote executor instead of the default executor?
- Why are my assertions failing with connection errors when my ingestion source works fine?
- Do I need special configuration for assertions to run via remote executor?
- What permissions are required for assertions to connect to my data sources?
Solution
Assertions in DataHub Cloud automatically run via remote executor when properly configured. Follow these steps to ensure correct setup:
-
Create or Configure Remote Executor Pool
- Navigate to Settings > Executors > Create
- Set a Pool Identifier (e.g.,
remote-prod) - Deploy the remote executor in your environment following the deployment guide
-
Configure Ingestion Source for Remote Executor
- Navigate to Settings > Ingestion Sources
- Create or edit an ingestion source for your data platform
- In the Finish Up step, expand Advanced settings
- Select your Remote Executor Pool (NOT the CLI executor
__datahub_cli_) - Ensure the ingestion source has valid credentials for your data platform
-
Verify Assertion Routing
DataHub automatically routes assertions based on this logic:
- Identifies the dataset's platform (e.g., Snowflake, Redshift, Databricks)
- Searches for an ingestion source matching the platform type
- Uses credentials from the ingestion source assigned to a remote executor pool
- Explicitly filters out CLI executors from assertion execution
-
Configure Required Permissions
For Redshift:
GRANT USAGE ON SCHEMA your_schema TO datahub_user; GRANT SELECT ON ALL TABLES IN SCHEMA your_schema TO datahub_user; GRANT SELECT ON pg_catalog.pg_class TO datahub_user; GRANT SELECT ON svv_table_info TO datahub_user; GRANT SELECT ON svv_external_tables TO datahub_user; GRANT SELECT ON svv_external_columns TO datahub_user;For Redshift Spectrum tables, also ensure S3 permissions:
{ "Effect": "Allow", "Action": ["s3:ListBucket"], "Resource": "arn:aws:s3:::your-bucket" }, { "Effect": "Allow", "Action": ["s3:GetObject"], "Resource": "arn:aws:s3:::your-bucket/*" }For AWS Glue (profiling required for assertions):
{ "Effect": "Allow", "Action": [ "glue:GetDatabases", "glue:GetTables", "glue:GetPartitions" ], "Resource": [ "arn:aws:glue:: :catalog", "arn:aws:glue: : :database/*", "arn:aws:glue: : :table/*" ] } For Databricks:
-
CAN USEon the SQL warehouse -
USE CATALOGon the parent catalog -
USE SCHEMAon the parent schema -
SELECTon the tables or schema-level for broader coverage
-
-
Choose Appropriate Volume Source Type
-
Query: Executes
SELECT COUNT(*)directly (recommended for most cases) - Information Schema: Uses system metadata (may not work with external tables like Spectrum)
- DataHub Dataset Profile: Uses pre-computed statistics from ingestion profiling
-
Query: Executes
-
Troubleshoot Common Issues
- Verify ingestion source is assigned to remote executor pool, not CLI executor
- Check that credentials in ingestion source have sufficient permissions
- For external tables (Spectrum, Delta), use Query volume source type
- Ensure remote executor can reach both DataHub Cloud and your data source
- For SQL assertions, use fully qualified table names (e.g.,
schema.table_name)
Additional Notes
Assertions automatically inherit the executor assignment from the ingestion source that ingested the dataset. No additional configuration is required on the assertion itself. The remote executor ensures that credentials and data never leave your network - only assertion results are sent back to DataHub Cloud. For AWS Glue, note that only self-reported assertions through DataHub Dataset Profile are currently supported, requiring profiling to be enabled during ingestion.
Related Documentation
- Setting Up Remote Ingestion Executor
- DataHub Assertions
- Redshift Connector Documentation
- AWS Glue Connector Documentation
- Operations API Tutorial
Tags: remote-executor, assertions, observability, redshift, databricks, glue, permissions, connection-errors, volume-assertions, executor-pools