Configuring Assertions to Run via Remote Executor – DataHub Customer Support Portal

Area: Observability Issues

Sub-Area: Assertion Configuration and Execution

Issue

Users attempting to configure assertions to run via remote executor infrastructure encounter connection errors or routing issues. This typically occurs when assertions fail to execute in the intended remote executor environment due to incorrect source configuration, missing permissions, or executor assignment problems.

Error Messages

Unable to connect to source
Unable to fetch a row count for urn:li:dataset:(...) using volume_source_type=DatasetVolumeSourceType.INFORMATION_SCHEMA
AttributeError("'Connection' object has no attribute 'closed'")
Unable to find a dataset profile or rowCount for urn:li:dataset:(...)
Spectrum Scan Error: S3ServiceException: User: ... is not authorized to perform: s3:ListBucket

You Might Be Asking

How do I configure assertions to use my remote executor instead of the default executor?
Why are my assertions failing with connection errors when my ingestion source works fine?
Do I need special configuration for assertions to run via remote executor?
What permissions are required for assertions to connect to my data sources?

Solution

Assertions in DataHub Cloud automatically run via remote executor when properly configured. Follow these steps to ensure correct setup:

Create or Configure Remote Executor Pool
- Navigate to Settings > Executors > Create
- Set a Pool Identifier (e.g., remote-prod)
- Deploy the remote executor in your environment following the deployment guide
Configure Ingestion Source for Remote Executor
- Navigate to Settings > Ingestion Sources
- Create or edit an ingestion source for your data platform
- In the Finish Up step, expand Advanced settings
- Select your Remote Executor Pool (NOT the CLI executor __datahub_cli_)
- Ensure the ingestion source has valid credentials for your data platform
Verify Assertion Routing
DataHub automatically routes assertions based on this logic:
- Identifies the dataset's platform (e.g., Snowflake, Redshift, Databricks)
- Searches for an ingestion source matching the platform type
- Uses credentials from the ingestion source assigned to a remote executor pool
- Explicitly filters out CLI executors from assertion execution

Configure Required Permissions

For Redshift:

GRANT USAGE ON SCHEMA your_schema TO datahub_user;
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema TO datahub_user;
GRANT SELECT ON pg_catalog.pg_class TO datahub_user;
GRANT SELECT ON svv_table_info TO datahub_user;
GRANT SELECT ON svv_external_tables TO datahub_user;
GRANT SELECT ON svv_external_columns TO datahub_user;

For Redshift Spectrum tables, also ensure S3 permissions:

{
  "Effect": "Allow",
  "Action": ["s3:ListBucket"],
  "Resource": "arn:aws:s3:::your-bucket"
},
{
  "Effect": "Allow", 
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::your-bucket/*"
}

For AWS Glue (profiling required for assertions):

{
  "Effect": "Allow",
  "Action": [
    "glue:GetDatabases",
    "glue:GetTables", 
    "glue:GetPartitions"
  ],
  "Resource": [
    "arn:aws:glue:::catalog",
    "arn:aws:glue:::database/*",
    "arn:aws:glue:::table/*"
  ]
}

For Databricks:

CAN USE on the SQL warehouse
USE CATALOG on the parent catalog
USE SCHEMA on the parent schema
SELECT on the tables or schema-level for broader coverage

Choose Appropriate Volume Source Type
- Query: Executes SELECT COUNT(*) directly (recommended for most cases)
- Information Schema: Uses system metadata (may not work with external tables like Spectrum)
- DataHub Dataset Profile: Uses pre-computed statistics from ingestion profiling
Troubleshoot Common Issues
- Verify ingestion source is assigned to remote executor pool, not CLI executor
- Check that credentials in ingestion source have sufficient permissions
- For external tables (Spectrum, Delta), use Query volume source type
- Ensure remote executor can reach both DataHub Cloud and your data source
- For SQL assertions, use fully qualified table names (e.g., schema.table_name)

Additional Notes

Assertions automatically inherit the executor assignment from the ingestion source that ingested the dataset. No additional configuration is required on the assertion itself. The remote executor ensures that credentials and data never leave your network - only assertion results are sent back to DataHub Cloud. For AWS Glue, note that only self-reported assertions through DataHub Dataset Profile are currently supported, requiring profiling to be enabled during ingestion.

Issue

Error Messages

You Might Be Asking

Solution

Additional Notes

Related Documentation