Area: Ingestion Issues
Sub-Area: Memory Management
Issue
When performing large-scale Teradata ingestion using the DataHub Python SDK with multiple recipes running sequentially or concurrently in the same Kubernetes pod, memory consumption continuously increases and is not released between recipe executions. This leads to Out of Memory (OOM) errors and pod crashes, particularly when processing hundreds of thousands of datasets across multiple recipes.
Error Messages
OOMKilledentity too large exception
You Might Be Asking
- Why does memory keep increasing even after recipes complete?
- How can I prevent OOM crashes during large Teradata ingestions?
- Why doesn't memory get released between sequential recipe runs?
Solution
This issue is caused by multiple factors including class-level cache accumulation, sqlglot memory leaks, and Python's memory allocator behavior. Here are the solutions in order of effectiveness:
1. Run Each Recipe in a Separate Process (Recommended)
The most effective solution is to execute each recipe in its own subprocess, which guarantees complete memory release when the process exits:
import subprocess
import concurrent.futures
import glob
import sys
def run_recipe(recipe_file: str) -> None:
result = subprocess.run(
[sys.executable, "-m", "datahub", "ingest", "-c", recipe_file],
capture_output=False,
)
if result.returncode != 0:
raise RuntimeError(
f"Recipe {recipe_file} failed with exit code {result.returncode}"
)
recipe_files = sorted(glob.glob("/app/recipes/teradata_*.yml"))
# Run with 4 concurrent processes (matching your existing pattern)
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(run_recipe, f): f for f in recipe_files}
for future in concurrent.futures.as_completed(futures):
recipe_file = futures[future]
try:
future.result()
print(f"Completed: {recipe_file}")
except Exception as e:
print(f"Failed: {recipe_file}: {e}")
2. Upgrade to Latest SDK Version
Upgrade to DataHub Python SDK version 1.5.0.17 or later, which includes fixes for:
- Class-level cache clearing in TeradataSource.close()
- sqlglot memory leak resolution
- Improved memory management for large-scale ingestions
3. Optimize Recipe Configuration
Reduce memory pressure by adjusting these recipe parameters:
# Reduce concurrent workers
source:
type: teradata
config:
max_workers: 2 # Reduced from 4
# Disable lineage if not needed (reduces sqlglot memory usage)
source:
type: teradata
config:
include_view_lineage: false
# Use incremental column extraction
source:
type: teradata
config:
column_extraction_days_back: 7
4. Configure Memory-Friendly Environment Variables
Add these environment variables to your Kubernetes pod spec to improve memory reclaim:
env:
- name: MALLOC_ARENA_MAX
value: "1"
- name: PYTHONMALLOC
value: "malloc"
- name: DATAHUB_REST_EMITTER_BATCH_MAX_PAYLOAD_BYTES
value: "8388608"
5. Increase Pod Memory Limits
As a temporary measure, increase memory limits to accommodate the leak until process isolation can be implemented:
resources:
limits:
memory: "60Gi" # Increased from original limits
requests:
memory: "30Gi"
Additional Notes
The memory leak was caused by multiple factors: class-level caches in TeradataSource that weren't cleared between runs, a confirmed memory leak in the sqlglot[c] C extension, and LRU schema caches that retained references. Even with fixes in 1.5.0.17+, Linux's glibc allocator may not return freed memory pages to the OS between runs, making process isolation the most reliable approach. The issue is most pronounced when processing hundreds of thousands of datasets across multiple recipes in sequence.
Related Documentation
Tags: teradata, memory-leak, kubernetes, oom, ingestion, python-sdk, cache, sqlglot, large-scale