See Part 1, Using Azure AD With The Azure Databricks API, for a background on the Azure AD authentication mechanism for Databricks. Here we show how to bootstrap the provisioning of an Azure Databricks workspace and generate a PAT Token that can be used by downstream applications. Create a script generate-pat-token.sh with the following content. […]
Articles
Unit testing Databricks notebooks
A simple way to unit test notebooks is to write the logic in a notebook that accepts parameterized inputs, and a separate test notebook that contains assertions. The sample project https://github.com/algattik/databricks-unit-tests/ contains two demonstration notebooks: The normalize_orders notebook processes a list of Orders and a list of OrderDetails into a joined list, taking into account […]
Exploring stream processing with Flink on Kubernetes
(updated 2019-11-18 with streaming-at-scale repository link) Apache Flink is a popular engine for distributed stream processing. In contrast to Spark Structured Streaming which processes streams as microbatches, Flink is a pure streaming engine where messages are processed one at a time. Running Flink in a modern cloud deployment on Azure poses some challenges. Flink can […]
Using the TensorFlow Object Detection API on Azure Databricks
The easiest way to train an Object Detection model is to use the Azure Custom Vision cognitive service. That said, the Custom Vision service is optimized to quickly recognize major differences between images, which means it can be trained with small datasets, but is not optimized for detecting subtle differences in images (for example, detecting […]
PaaS integration testing with Azure DevOps
Using Azure DevOps pipelines, we can easily spin test environments to run various sorts of integration tests on PaaS resources. Azure DevOps allows powerful scripting and orchestration using familiar CLI commands, and is very useful to automatically spin entire environments using Infrastructure as Code without manual intervention. Sample project In this example, we looked at […]
DevOps in Azure with Databricks and Data Factory
Building simple deployment pipelines to synchronize Databricks notebooks across environments is easy, and such a pipeline could fit the needs of small teams working on simple projects. Yet, a more sophisticated application includes other types of resources that need to be provisioned in concert and securely connected, such as Data Factory pipeline, storage accounts and […]
Embedding Power BI content with a Service Principal
Until now, embedding Power BI reports or dashboards into a web application or automating processes with the Power BI API required a master account. A master account is an actual Power BI account with a username and password that the embedding app uses to connect to the Power BI API. The master account that is […]
Managing cost in development subscriptions
The cloud allows development and test teams to be very agile, as they can spin the resources they need in a matter of minutes, whether for quick prototyping, learning, or scalability tests. That can come with headaches if costs are left to spiral out of control. We will investigate what reactive and proactive controls can […]
A New home!
My blog has a new home in WordPress and is now officially known as Cloud Architected! I’ve migrated relevant past articles to WordPress, please let me know if you see any glitch with the content.
Choosing a Big Data Environment on Azure
I often get asked which Big Data computing environment should be chosen on Azure. The answer is heavily dependent on the workload, the legacy system (if any), and the skill set of the development and operation teams. Here is a (necessarily heavily simplified) overview of the main options and decision criteria I usually apply. Hadoop […]