Area: Best Practices
Sub-Area: Data Protection and Recovery
Issue
Organizations need to establish comprehensive backup and restore procedures for user-generated metadata in DataHub to protect against accidental data loss and enable selective recovery of specific entities or metadata aspects.
You Might Be Asking
- What metadata aspects should I back up versus leave to ingestion processes?
- Which APIs are recommended for bulk metadata extraction and selective restoration?
- Does DataHub provide built-in backup and restore capabilities?
- How can I implement automated daily backups with selective restore functionality?
Solution
1. Identify User-Edited Metadata to Back Up
Focus your backup strategy on user-edited aspects that cannot be recreated through re-ingestion:
-
editableDatasetProperties- Dataset descriptions edited via UI -
editableSchemaMetadata- Column descriptions, tags, and glossary terms -
ownership- Owner assignments made via UI -
globalTags- Dataset-level tags applied via UI -
glossaryTerms- Glossary term associations -
domains- Domain assignments -
documentation- Documentation links and annotations -
structuredProperties- Structured property values
Leave ingestion-sourced aspects to be recreated by your connectors: schemaMetadata, datasetProperties, datasetProfile, datasetUsageStatistics, dataPlatformInstance, browsePathsV2, and status.
2. Implement Bulk Metadata Export
Use the OpenAPI v3 Scroll API for efficient bulk extraction:
POST /openapi/v3/entity/scroll
{
"count": 1000,
"pitKeepAlive": "10m",
"entities": ["dataset"],
"aspects": [
"editableDatasetProperties",
"editableSchemaMetadata",
"ownership",
"globalTags",
"glossaryTerms",
"domains",
"documentation",
"structuredProperties"
],
"filter": {
"and": [
{
"field": "platform",
"values": ["snowflake"]
}
]
}
}
Paginate through results using the returned scrollId until all entities are retrieved. Store the output as date-partitioned NDJSON files with a 30-day lifecycle policy.
3. Implement Selective Restore
For targeted restoration of specific entities, use the Python SDK for batch operations:
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
import json
# Initialize emitter
emitter = DatahubRestEmitter(gms_server="https://.datahubproject.io")
# Load backup data for specific URNs
with open('backup_2024_01_15.ndjson', 'r') as f:
for line in f:
entity = json.loads(line)
urn = entity['urn']
# Filter to specific URNs you want to restore
if urn in urns_to_restore:
for aspect_name, aspect_value in entity['aspects'].items():
mcp = MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=aspect_value
)
emitter.emit_mcp(mcp)
4. Set Up Automated Backup Pipeline
- Create a scheduled job (daily) to extract metadata using the scroll API
- Store backups in cloud storage with date partitioning:
s3://metadata-backups/year=2024/month=01/day=15/ - Implement retention policies to automatically delete backups older than 30 days
- Create restore scripts that can accept URN lists and target backup dates
- Test your restore procedures regularly with non-production data
5. DataHub Cloud Managed Backups
If using DataHub Cloud, Acryl provides managed daily backups with 30-day retention as part of the service. Contact DataHub Support for selective restoration requests - they can restore specific entities or aspects from managed backups without requiring your own backup infrastructure.
Additional Notes
DataHub maintains internal aspect versioning but does not provide user-accessible time-travel functionality for metadata recovery. External backup pipelines are necessary for comprehensive data protection. The OpenAPI v3 scroll endpoint is the only supported method for bulk metadata export - avoid using Kafka MCL streams or GraphQL for backup purposes. Always test restore procedures in a non-production environment before implementing in production.
Related Documentation
Tags: backup, restore, metadata-protection, user-metadata, bulk-export, selective-restore, python-sdk, openapi-v3, data-governance, best-practices