Memory Leak in Teradata Ingestion Using Python SDK – DataHub Customer Support Portal

Area: Ingestion Issues

Sub-Area: Memory Management

Issue

When performing large-scale Teradata ingestion using the DataHub Python SDK with multiple recipes running sequentially or concurrently in the same Kubernetes pod, memory consumption continuously increases and is not released between recipe executions. This leads to Out of Memory (OOM) errors and pod crashes, particularly when processing hundreds of thousands of datasets across multiple recipes.

Error Messages

OOMKilled
entity too large exception

You Might Be Asking

Why does memory keep increasing even after recipes complete?
How can I prevent OOM crashes during large Teradata ingestions?
Why doesn't memory get released between sequential recipe runs?

Solution

This issue is caused by multiple factors including class-level cache accumulation, sqlglot memory leaks, and Python's memory allocator behavior. Here are the solutions in order of effectiveness:

1. Run Each Recipe in a Separate Process (Recommended)

The most effective solution is to execute each recipe in its own subprocess, which guarantees complete memory release when the process exits:

import subprocess
import concurrent.futures
import glob
import sys

def run_recipe(recipe_file: str) -> None:
    result = subprocess.run(
        [sys.executable, "-m", "datahub", "ingest", "-c", recipe_file],
        capture_output=False,
    )
    if result.returncode != 0:
        raise RuntimeError(
            f"Recipe {recipe_file} failed with exit code {result.returncode}"
        )

recipe_files = sorted(glob.glob("/app/recipes/teradata_*.yml"))

# Run with 4 concurrent processes (matching your existing pattern)
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(run_recipe, f): f for f in recipe_files}
    for future in concurrent.futures.as_completed(futures):
        recipe_file = futures[future]
        try:
            future.result()
            print(f"Completed: {recipe_file}")
        except Exception as e:
            print(f"Failed: {recipe_file}: {e}")

2. Upgrade to Latest SDK Version

Upgrade to DataHub Python SDK version 1.5.0.17 or later, which includes fixes for:

Class-level cache clearing in TeradataSource.close()
sqlglot memory leak resolution
Improved memory management for large-scale ingestions

3. Optimize Recipe Configuration

Reduce memory pressure by adjusting these recipe parameters:

# Reduce concurrent workers
source:
  type: teradata
  config:
    max_workers: 2  # Reduced from 4

# Disable lineage if not needed (reduces sqlglot memory usage)
source:
  type: teradata
  config:
    include_view_lineage: false

# Use incremental column extraction
source:
  type: teradata
  config:
    column_extraction_days_back: 7

4. Configure Memory-Friendly Environment Variables

Add these environment variables to your Kubernetes pod spec to improve memory reclaim:

env:
- name: MALLOC_ARENA_MAX
  value: "1"
- name: PYTHONMALLOC
  value: "malloc"
- name: DATAHUB_REST_EMITTER_BATCH_MAX_PAYLOAD_BYTES
  value: "8388608"

5. Increase Pod Memory Limits

As a temporary measure, increase memory limits to accommodate the leak until process isolation can be implemented:

resources:
  limits:
    memory: "60Gi"  # Increased from original limits
  requests:
    memory: "30Gi"

Additional Notes

The memory leak was caused by multiple factors: class-level caches in TeradataSource that weren't cleared between runs, a confirmed memory leak in the sqlglot[c] C extension, and LRU schema caches that retained references. Even with fixes in 1.5.0.17+, Linux's glibc allocator may not return freed memory pages to the OS between runs, making process isolation the most reliable approach. The issue is most pronounced when processing hundreds of thousands of datasets across multiple recipes in sequence.