Resolving Memory Issues in Remote Executor for Large Ingestions – DataHub Customer Support Portal

Area: Deployment Issues

Sub-Area: Remote Executor Configuration

Issue

Remote executor pods experience out-of-memory (OOM) errors during large-scale ingestion jobs, particularly when using BigQuery sources with multiple projects or after upgrading to newer DataHub versions. The default 8GB memory allocation becomes insufficient when ingesting from expanded data sources or when memory overhead increases due to version changes.

Error Messages

Pod failed with OOM exception
Memory limit exceeded during ingestion

You Might Be Asking

Why are my previously working ingestion jobs now failing with OOM errors?
How can I reduce memory consumption in the remote executor?
What memory allocation is recommended for large multi-project BigQuery ingestion?

Solution

To resolve memory issues in remote executor deployments, implement the following configuration optimizations:

Reduce Ingestion Parallelism

Lower the number of concurrent workers in your BigQuery ingestion recipe:

source:
  type: bigquery
  config:
    max_workers: 2
    profiling:
      enabled: true
      max_workers: 2
    classification:
      enabled: true
      max_workers: 2

Scope Your Ingestion

Explicitly limit the scope of your ingestion to prevent loading excessive metadata:

source:
  type: bigquery
  config:
    project_ids:
      - "project-1"
      - "project-2"
    dataset_pattern:
      allow:
        - "prod_*"
        - "analytics_*"
    table_pattern:
      deny:
        - "temp_*"
        - "staging_*"

Disable Resource-Heavy Features

Turn off optional features if not actively used:

source:
  type: bigquery
  config:
    profiling:
      enabled: false
    classification:
      enabled: false
    lineage:
      use_v2_lineage_api: true

Configure Executor Task Limits
Set memory limits and task weights using environment variables:
```
{
  "EXECUTOR_TASK_MEMORY_LIMIT": "6000000",
  "EXECUTOR_TASK_WEIGHT": "1.0"
}
```
Increase Pod Memory Allocation
For large-scale ingestion with multiple projects, increase the pod memory limit:
```
resources:
  limits:
    memory: "12Gi"  # or "16Gi" for very large deployments
  requests:
    memory: "8Gi"
```
Update to Latest CLI Version
If using bundled images, ensure you're using the latest DataHub CLI version. For custom builds, replace the bundled venv creation with direct installation:
```
for x in bigquery looker lookml databricks; do
  uv venv "/opt/datahub/venvs/${x}-bundled"
  uv pip install --python "/opt/datahub/venvs/${x}-bundled/" "acryl-datahub[${x}]=="
done
```

Additional Notes

Memory usage in BigQuery ingestion grows linearly with the number of projects and tables due to stateful profiling and checkpoint data stored in memory. Newer DataHub versions may have higher baseline memory requirements due to architectural changes in credential management. The 8GB recommendation may be insufficient for multi-project BigQuery deployments and should be adjusted based on your specific ingestion scope. Always test memory configuration changes in a non-production environment first.

Issue

Error Messages

You Might Be Asking

Solution

Additional Notes

Related Documentation

Related articles