Area: Ingestion Issues
Sub-Area: Payload Size Limits
Issue
DataHub CLI generates aspect payloads that exceed 15MB for certain datasets, particularly Iceberg tables, even those without nested schemas. This causes ingestion failures when the payload size exceeds infrastructure limits in Kafka, databases, or internal data registries.
Error Messages
Payload size exceeds maximum allowed limit413 Request Entity Too LargeIngestion failure due to aspect size limit
You Might Be Asking
- Why are my Iceberg table payloads so large even for simple schemas?
- Is there a CLI flag to reduce aspect payload size?
- How can I optimize DataHub ingestion for large schemas?
Solution
Large aspect payloads typically result from verbose schema metadata serialization. The Iceberg connector converts schemas to DataHub's Avro-based SchemaMetadata format, which includes fully qualified field paths, native type strings, and Avro record wrapping that can consume significant space even for flat schemas.
-
Verify profiling is disabled (most effective):
source: type: "iceberg" config: profiling: enabled: false -
Reduce batch payload size for better isolation:
sink: type: "datahub-rest" config: max_threads: 1 max_pending_requests: 1 -
Filter to essential tables only:
source: config: table_pattern: allow: - "specific_namespace.essential_table_*" -
Partition ingestion jobs by processing fewer tables per run:
# Run separate ingestion jobs for different namespaces table_pattern: allow: - "namespace1.*" # Then run another job for namespace2, etc. -
Lower client-side payload limit to force truncation:
sink: type: "datahub-rest" config: max_aspect_size_bytes: 10485760 # 10MB instead of 15MB -
Disable unnecessary metadata:
source: config: stateful_ingestion: enabled: false extract_usage_statistics: false extract_lineage: false
Additional Notes
This is an architectural characteristic of DataHub's schema serialization, not a bug. The verbose Avro representation includes field path prefixes, native type metadata, and record wrapping that scales with schema complexity. Recent pyiceberg version upgrades may have increased payload verbosity. DataHub's EnsureAspectSizeProcessor automatically truncates schemas when limits are exceeded, logging warnings. If failures occur before DataHub processing, check if your internal registry has lower limits than DataHub's 15MB default.
Related Documentation
Tags: iceberg, payload-size, aspect-limits, ingestion-failure, schema-metadata, profiling, cli, kafka-limits, avro-serialization, large-schema