Area: Deployment Issues
Sub-Area: Remote Executor Network Configuration
Issue
Organizations using AWS PrivateLink or other private network configurations encounter package installation failures when running DataHub ingestion recipes. The Remote Ingestion Executor (RIE) fails to create virtual environments because it cannot access PyPI (pypi.org) to download required Python packages, resulting in blocked ingestion workflows in network-restricted environments.
Error Messages
error sending request for url (https://pypi.org/simple/acryl-datahub/)Unable to reach PyPI for package installation
You Might Be Asking
- Can PyPI be added to Private DNS resolution for PrivateLink access?
- How can I run ingestion in a network-restricted environment?
- Why does the Remote Executor need internet access?
- Can I disable dynamic virtual environment creation?
Solution
The Remote Executor requires access to a Python package index to install dependencies during virtual environment creation. Here are the recommended solutions for private network environments:
Option 1: Configure Internal PyPI Mirror (Recommended)
- Set up an internal PyPI mirror using tools like Artifactory, Nexus, or AWS CodeArtifact
- Configure your Remote Executor to use the internal mirror by setting environment variables:
PIP_INDEX_URL=https://your-internal-mirror/simple PIP_EXTRA_INDEX_URL=https://your-backup-mirror/simple - Update your Remote Executor deployment configuration to include these environment variables
- Restart the Remote Executor to apply the new configuration
Option 2: Use Pre-Bundled Executor Images
- Create a custom Docker image extending the DataHub Remote Executor base image
- Pre-install required plugins in your custom image:
FROM acryldata/datahub-ingestion-slim:latest RUN pip install --upgrade acryl-datahub[] - Build and deploy the custom image to your container registry
- Update your Remote Executor deployment to use the custom image
- Configure ingestion recipes to use bundled virtual environments (requires DataHub 0.3.17+)
Option 3: Network-Level PyPI Access
- Configure your network infrastructure to allow outbound HTTPS access to pypi.org
- Ensure the Remote Executor's network security groups allow outbound connections on port 443
- Add firewall rules to permit access to PyPI's IP ranges
Configuration for Bundled Virtual Environments (DataHub 0.3.17+)
For organizations requiring fully air-gapped deployments:
- Upgrade your DataHub instance to version 0.3.17 or higher
- Create a Dockerfile with pre-installed dependencies:
FROM acryldata/datahub-ingestion:0.3.17 RUN pip install --upgrade acryl-datahub[glue,snowflake,s3] # Add other required plugins as needed - Build and tag the image:
docker build -t/datahub-executor:bundled . - Deploy the custom image to your Remote Executor infrastructure
- Update ingestion recipes to reference the bundled virtual environment configuration
Additional Notes
AWS PrivateLink is designed for private connectivity to specific services and cannot proxy arbitrary public internet destinations like PyPI. The bundled virtual environment feature requires DataHub 0.3.17+ and may have limitations on CLI version flexibility. Pre-bundled images will require manual updates when new plugin versions are released. Consider the trade-offs between security isolation and operational flexibility when choosing your approach.
Related Documentation
Tags: private-network, privatelink, remote-executor, pypi, package-installation, air-gapped, network-security, ingestion-setup, deployment-configuration, virtual-environment