cloudarchitected
Architecture

Securing access to shared metastore with Azure Databricks

Uses for an external metastore

Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata, including table and column names as well as storage location. By default, the metastore is managed by Azure in the shared Databricks control plane. Instead of using the default, you have the option to use your existing external Hive metastore instance. Scenarios for using an external metastore include:

  • Sharing a metastore with other services, such as Hadoop clusters. By defining external storage locations for tables such as wasb://…, abfs://…, adl://…, Hadoop and Databricks cluster can share both metadata and storage.
  • Managing the metastore lifecycle independently of the Databricks workspace lifecycle. For example, to implement a custom Disaster Recovery strategy, independent of the Microsoft-managed one (see Regional disaster recovery for Azure Databricks).

As the default metastore cannot be exported, it is not straightforward to migrate from the default metastore to an external metastore (you will need to use Spark SQL DESCRIBE commands). If the scenarios above are not needed today but might be needed in the future, it could still be worth planning ahead and starting with an external metastore strategy.

You can use any database supporting JDBC connectivity as a metastore. If you use Hortonworks, note that Azure SQL Database can be used but is not supported as a metastore.

Check the Databricks External Metastore page for a list of supported Hive Metastore versions.

Azure Databricks can also initialize an empty database as a metastore, by setting specific options in the Spark configuration.

Metastore password management

The metastore connection string must be defined in the Spark Context configuration. Therefore, the connection definition, including the password, must be defined either in the cluster properties, or in a cluster initialization script that runs on node creation. It is not possible to use Databricks managed Secrets, are those are available from Notebooks running with an already created Spark Context.

As an fallback to storing the password in clear in an initialization script, the following solution can be used.

  • Deploy Azure Databricks in your Azure Virtual Network.
  • Create an Azure Key Vault only for storing the metastore password.
  • Store the metastore password as a secret in the Key Vault.
  • Configure the Key Vault firewall to only allow connectivity from your Azure Databricks network.
  • Create a service principal in Azure Active Directory and create an access policy in the Key Vault allowing the service principal to retrieve the secret.
  • Create an initialization script for Azure Databricks cluster, to retrieve the metastore password from Key Vault and set it in the Spark properties. The script has the service principal password in clear.

Although the service principal password is now stored in clear, it can only be used to connect to the Key Vault, and that can only be done from the Azure Databricks virtual network. Arguably, the attack surface is not increased, as an attacker who has unimpeded access to the Azure Databricks virtual network and to the file system will anyway be able to retrieve the metastore password directly. This solution is more aimed at avoiding the possibility of inadvertently exposing the metastore password.

Walkthrough

This walkthrough demonstrates the proposed setup by creating an Azure Databricks workspace in a Virtual Network, and a SQL Database metastore and an Azure Key Vault with firewalls configured to only allow access from the Virtual Network. An HDInsight cluster is briefly deployed, then deleted, in order to initialize the metastore with some data and demonstrate the metastore sharing capability.

Setting up the metastore

In the Azure portal, create an Azure SQL Database. Create a new Server with Allow Azure services to access server enabled for now. Leave all other options as default.

Create a Storage account with all options as default.

Create an HDInsight cluster. In the cluster creation screen, select Cluster Type: Hadoop with HDInsight version 3.6, use the Storage account as primary storage, and use the Azure SQL Database you’ve created as a Hive metastore. HDInsight will create a Hive metastore version 1.2.0. You can continue through the next steps while the cluster is being created.

Deploying Azure Databricks in a VNET

As an alternative to the steps below, you can deploy the all-in-arm ARM template at .

Create a Virtual Network with all options as default. Take note of the allocated Address space, in this case 10.3.0.0/16 for the VNET and 10.3.0.0/24 for the first subnet.

Create a Databricks workspace. Select the Standard tier. Select the option to deploy the Workspace in your Virtual Network, and select the Virtual Network you just created. Enter ranges compatible with the selected Virtual Network. In our case, 10.3.1.0/24 and 10.3.2.0/24 are unallocated ranges in the VNET, which we can use for our subnets.

Setting up the Key Vault

In Azure Active Directory, create a service principal. Create a secret for the service principal, and copy the secret. Also take note of the Application ID (Client ID) and Azure AD Tenant ID.

Create an Azure Key Vault. Create an access policy for the AAD application you have just created, with the Get Secret permission only. Under Virtual Network Access, leave All networks can access for now.

Navigate to the Key Vault. In the Secrets tab, create a new secret containing your metastore password.

Configure the Key Vault (in the Firewall tab) to only allow traffic from the public subnet of your Databricks Virtual Network. This will create a service endpoint.

Navigate to your Databricks workspace and create a cluster. Create a Scala notebook in Databricks, an enter the following into the cell, replacing the username, password and tenant values with the ones for your service principal. Also replace DBSERVER, DBNAME and DBUSER with your metastore database server and user, and KEYVAULTNAME with the name of your Key Vault. Run the cell.

dbutils.fs.put("/databricks/configure-metastore.sh","""
#/bin/bash

#stop on errors
set -euxo pipefail

#install Azure CLI (from https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-apt)
sudo apt-get install apt-transport-https lsb-release software-properties-common dirmngr -y
AZ_REPO=$(lsb_release -cs)
echo "deb [arch=amd64] https://packages.microsoft.com/repos/azure-cli/ $AZ_REPO main" | \
    sudo tee /etc/apt/sources.list.d/azure-cli.list
sudo apt-key --keyring /etc/apt/trusted.gpg.d/Microsoft.gpg adv \
     --keyserver packages.microsoft.com \
     --recv-keys BC528686B50D79E339D3721CEB3E94ADBE1229CF
sudo apt-get update
sudo apt-get install azure-cli

#Login and retrieve Key Vault secret
az login --service-principal --allow-no-subscriptions --username 95123b3b-234de-2521-a294-9143d48cab16 --password 'AwsUHEXf9EB67GdYA4VzMD7OlZ0GiSoPYEFZr10tgGg=' --tenant 41f915bf-12f1-51af-51ab-2d7cd912db21 
METASTORE_PASSWORD=$(az keyvault secret show  --vault-name KEYVAULTNAME -n metastorepassword --query value -o tsv)

# Loads environment variables to determine the correct JDBC driver to use.
source /etc/environment
cat << EOF > /databricks/driver/conf/00-custom-spark.conf
[driver] {
    # Hive specific configuration options.
    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
    # JDBC connect string for a JDBC metastore
    "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://DBSERVER.database.windows.net:1433;database=DBNAME;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

    # Username to use against metastore database
    "spark.hadoop.javax.jdo.option.ConnectionUserName" = "DBUSER"

    # Password to use against metastore database
    #"spark.hadoop.javax.jdo.option.ConnectionPassword" = "$METASTORE_PASSWORD"

    # Driver class name for a JDBC metastore
    "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

    # Spark specific configuration options
    "spark.sql.hive.metastore.version" = "1.2.1"
    # Skip this one if hive-version is 0.13.x.
    "spark.sql.hive.metastore.jars" = "builtin"
}
EOF

echo Successfully set metastore configuration
""", overwrite = true)

Test the script by running in a second cell. The output should end with Successfully set metastore configuration.

%sh bash /dbfs/databricks/configure-metastore.sh

Shutdown the cluster. Edit the cluster configuration to enable the init script at dbfs:/databricks/configure-metastore.sh, and restart the cluster.

Check that the HDInsight cluster was deployed successfully. As part of its initialization, HDInsight will create a table called hivesampledata. Its metadata is stored in the metastore, and the data itself is stored in the storage account. To test metastore connectivity, we will query that table from Databricks.

Create a SQL notebook in Databricks and enter this command:

SELECT * from hivesampletable

The error returned by the command indicates that Spark successfully connected to the metastore, since it was able to retrieve the Azure Storage path where the data is located. Run the following cell:

%scala spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net", "<your-storage-account-access-key>")

And rerun the SELECT cell. You should see the content of the table.

Delete the HDInsight cluster. You can then configure the metastore SQL Database firewall to only allow traffic from the Databricks virtual network.

Software Engineer at Microsoft, Data & AI, open source fan