Implementing Backup and Restore Strategies for User Metadata – DataHub Customer Support Portal

Area: Best Practices

Sub-Area: Data Protection and Recovery

Issue

Organizations need to establish comprehensive backup and restore procedures for user-generated metadata in DataHub to protect against accidental data loss and enable selective recovery of specific entities or metadata aspects.

You Might Be Asking

What metadata aspects should I back up versus leave to ingestion processes?
Which APIs are recommended for bulk metadata extraction and selective restoration?
Does DataHub provide built-in backup and restore capabilities?
How can I implement automated daily backups with selective restore functionality?

Solution

1. Identify User-Edited Metadata to Back Up

Focus your backup strategy on user-edited aspects that cannot be recreated through re-ingestion:

editableDatasetProperties - Dataset descriptions edited via UI
editableSchemaMetadata - Column descriptions, tags, and glossary terms
ownership - Owner assignments made via UI
globalTags - Dataset-level tags applied via UI
glossaryTerms - Glossary term associations
domains - Domain assignments
documentation - Documentation links and annotations
structuredProperties - Structured property values

Leave ingestion-sourced aspects to be recreated by your connectors: schemaMetadata, datasetProperties, datasetProfile, datasetUsageStatistics, dataPlatformInstance, browsePathsV2, and status.

2. Implement Bulk Metadata Export

Use the OpenAPI v3 Scroll API for efficient bulk extraction:

POST /openapi/v3/entity/scroll
{
  "count": 1000,
  "pitKeepAlive": "10m",
  "entities": ["dataset"],
  "aspects": [
    "editableDatasetProperties",
    "editableSchemaMetadata", 
    "ownership",
    "globalTags",
    "glossaryTerms",
    "domains",
    "documentation",
    "structuredProperties"
  ],
  "filter": {
    "and": [
      {
        "field": "platform",
        "values": ["snowflake"]
      }
    ]
  }
}

Paginate through results using the returned scrollId until all entities are retrieved. Store the output as date-partitioned NDJSON files with a 30-day lifecycle policy.

3. Implement Selective Restore

For targeted restoration of specific entities, use the Python SDK for batch operations:

from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
import json

# Initialize emitter
emitter = DatahubRestEmitter(gms_server="https://.datahubproject.io")

# Load backup data for specific URNs
with open('backup_2024_01_15.ndjson', 'r') as f:
    for line in f:
        entity = json.loads(line)
        urn = entity['urn']
        
        # Filter to specific URNs you want to restore
        if urn in urns_to_restore:
            for aspect_name, aspect_value in entity['aspects'].items():
                mcp = MetadataChangeProposalWrapper(
                    entityUrn=urn,
                    aspect=aspect_value
                )
                emitter.emit_mcp(mcp)

4. Set Up Automated Backup Pipeline

Create a scheduled job (daily) to extract metadata using the scroll API
Store backups in cloud storage with date partitioning: s3://metadata-backups/year=2024/month=01/day=15/
Implement retention policies to automatically delete backups older than 30 days
Create restore scripts that can accept URN lists and target backup dates
Test your restore procedures regularly with non-production data

5. DataHub Cloud Managed Backups

If using DataHub Cloud, Acryl provides managed daily backups with 30-day retention as part of the service. Contact DataHub Support for selective restoration requests - they can restore specific entities or aspects from managed backups without requiring your own backup infrastructure.

Additional Notes

DataHub maintains internal aspect versioning but does not provide user-accessible time-travel functionality for metadata recovery. External backup pipelines are necessary for comprehensive data protection. The OpenAPI v3 scroll endpoint is the only supported method for bulk metadata export - avoid using Kafka MCL streams or GraphQL for backup purposes. Always test restore procedures in a non-production environment before implementing in production.