cloudarchitected
Architecture

Network Isolation for Azure Databricks

For the highest level of security in an Azure Databricks deployment, clusters can be deployed in a custom Virtual Network. With the default setup, inbound traffic is locked down, but outbound traffic is unrestricted for ease of use. The network can be configured to restrict outbound traffic.

For data science and exploratory environments, it is usually advisable to use the default configuration which allows outbound communication to Internet. This allows users to download any libraries for Python, R and Maven that they may need, as well as Ubuntu packages that support them.

Locked-down Network Security Group

We will show how to spin a Databricks network configuration where traffic is highly restricted. The final configuration of Network Security Group will be as follows.

Inbound security rules

  • Inbound traffic from the Databricks control plane must be allowed on ports 22 and 5557. The IP varies depending on the Azure region where the workspace is deployed, in this case it is 52.232.19.246. See Control Plane IP Addresses to find the IP matching your region.
  • Communication within the VirtualNetwork is unrestricted, to allow workers to communicate with each other.
  • The default rules (with priorities starting with 65000) cannot be modified. The first two have no effect, and the last one will deny any other inbound traffic.

Outbound security rules

  • Databricks nodes must be allowed to communicate with the control plane.
  • Databricks nodes must be able to communicate to Storage (Blob and Azure Data Lake Storage Gen2 accounts). If you use Azure Data Lake Storage Gen1, add an extra rule with Destination AzureDataLake as destination.
  • A rule is predefined to allow communication to Azure SQL Database and Azure SQL Data Warehouse. These are often used as part of data pipelines. It’s also advisable to use an external metastore (e.g. in Azure SQL Database) in order to decouple the lifecycle of your metastore from the lifecycle of your Databricks workspace.
  • Communication within the VirtualNetwork is unrestricted, to allow workers to communicate with each other.
  • Here too, the default rules (with priorities starting with 65000) cannot be modified. In this case, they do not apply, as they are superseded with rules of higher priority.

Walkthrough

To deploy this configuration, start by downloading the Databricks VNET injection Azure Resource Manager template from .

Delete the following rule from the Network Security Group definition in the template.

          {
            "name": "databricks-worker-to-any",
            "properties": {
              "access": "Allow",
              "description": "Required for worker nodes communication with any destination.",
              "destinationAddressPrefix": "*",
              "destinationPortRange": "*",
              "direction": "Outbound",
              "priority": 140,
              "protocol": "*",
              "sourceAddressPrefix": "*",
              "sourcePortRange": "*"
            }

Add the following rule instead.

          {
            "name": "deny-outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 230,
              "direction": "Outbound"
            }
          }

Then deploy the template through the Azure Portal, PowerShell or using Azure CLI.

Service Endpoints for PaaS resources

For additional security, you can configure PaaS resources such as Storage and SQL Database / Data Warehouse to be accessible only from specific Virtual Networks, including the one containing Databricks workers. This is achieved through VNET Service Endpoints deployed in selected Virtual Network subnets.

You can configure Service Endpoints on creation of the resources, or afterwards in the Firewalls and virtual networks tab in the Azure portal. Make sure you select the “public” subnet created with the template. The “private” subnet is exclusively used for internode communication.

Note that when using ADLS Gen1 with credential passthrough, you also need to enable Allow all Azure Services to access this Data Lake Storage Gen 1 account. This is due to the way tokens are generated by Databricks and validated by ADLS Gen1’s firewall. To avoid this, use a service principal instead of passthrough (which will work with service endpoints), or use ADLS Gen2.

HDInsight Kafka clusters

HDInsight Kafka clusters are deployed in a Virtual Network, therefore the connection mechanism is different. Two main options are available:

  • Deploy the Kafka cluster in the same VNET as Databricks. Make sure you use a dedicated subnet for HDInsight Kafka (the Databricks public subnet must be reserved for Databricks, according to best practice).
  • Peer the Kafka cluster VNET with the Databricks VNET.

Key Vault-backed secret scopes

When using a Key Vault-backed secret scope in Azure Databricks, connection to the Key Vault is done by the control plane application, not by the cluster. Therefore, although Key Vault supports service endpoints, that mechanism cannot be used. Instead, toggle the Allow trusted services option on the Key Vault firewall.

Access to the Control plane

By nature of the network architecture of Azure Databricks, the Databricks portal and REST API reside within a multitenant application deployed as an Azure Web Site. Therefore, it remains accessible externally to users and orchestrators such as Azure Data Factory, even when the clusters themselves are deployed within a locked-down Virtual Network.

To apply security restrictions to the Databricks portal, set up Azure AD Conditional Access policies.

Connecting to on-premises data

This configuration is particularly suited for enterprises with existing datacenters, which want to use Azure Databricks connected to on-premises databases. This secure deployment effectively extends the datacenter into a locked down private cloud environment.

The diagram at the top of the page shows hybrid networking with the recommended hub-and-spoke model. Connectivity to the on-premises data center can be achieved either through ExpressRoute or with a Site-to-site VPN. User-defined routes should be put in place to ensure Databricks clusters can communicate directly with the control plane. Refer to the Databricks documentation for more information.

Additional outbound connectivity (optional)

Ubuntu updates

Databricks is configured to automatically perform OS updates. You can check this for yourself by running this cell in a notebook:

If you block outgoing connectivity as described above, such updates will nto be installed. If you are concerned about getting security fixes, and your clusters are long-lived, you may want to enable outgoing connectivity to Ubuntu update servers.

What about NTP?

Linux VMs on Azure are automatically configured to use the Precision Time Protocol (PTP) to synchronize their system clocks from the Hypervisor. PTP is the successor to the venerable NTP protocol. This can also be checked by running in a notebook:

Therefore, it is not necessary to open additional hosts and ports to ensure precise time synchronization.

Updated 5 June 2019 with section about ADLS Gen1 firewall.

Software Engineer at Microsoft, Data & AI, open source fan