Architecture Featured Highlighted

Monitoring and Logging in Azure Databricks with Azure Log Analytics and Grafana

Connecting Azure Databricks with Log Analytics allows monitoring and tracing each layer within Spark workloads, including the performance and resource usage on the host and JVM, as well as Spark metrics and application-level logging.

You can easily test this integration end-to-end by following the accompanying tutorial on Monitoring Azure Databricks with Azure Log Analytics and Grafana, that automatically deploys a Log Analytics workspace and Grafana container, configures Databricks and runs some sample workloads.

Configuration

You can find a Guide on Monitoring Azure Databricks on the Azure Architecture Center, explaining the concepts used in this article.

To provide full data collection, we combine the Spark monitoring library with a custom log4j.properties configuration. The build of the monitoring library for Spark 2.4 and the installation in Databricks is automated through the scripts referenced in the tutorial and available at https://github.com/algattik/databricks-monitoring-tutorial/.

Collecting and querying data

Spark metrics

Spark metrics are automatically collected into the SparkMetric_CL Log Analytics custom log. The Log Analytics workspace automatically deployed as part of the tutorial is already configured with dozens of predefined queries for the most common query patterns.

Executor CPU Time per job over time
Used Heap Memory by job over time
Advanced metrics in Grafana

Structured Streaming metrics

Streaming job metrics are automatically collected into the SparkListenerEvent_CL Log Analytics custom log. Here also, predefined queries are available.

Streaming job latency over time
Deep-diving into latency by job stage

Spark logs

Spark logs are available in the Databricks UI and can be delivered to a storage account. However, Log Analytics is a much more convenient log store since it indexes the logs at high scale and supports a powerful query language. Spark logs are automatically collected into the SparkLoggingEvent_CL Log Analytics custom log.

Spark logs

Application logs

You can extend the org.apache.spark.internal.Logging class to log application messages.

Custom application logs

Counters and Gauges

You can create your own Spark metrics, such as counters and gauges.

Custom counter value over time

Using Grafana

You can easily Deploy Grafana over the Log Analytics data to generate rich interactive dashboards.

Conclusion

Consider instrumenting your workloads with data collection, so that you can take the right reflexes upfront, understanding and optimizing the resource usage of your jobs already during development. In production, having a metrics baseline over time will greatly help you analyze and correct any decrease in performance or job failure.

The small cost of Log Analytics can be quickly offset, since you will be able to optimize the VM size and number, and be more productive in fixing issues.

Based on work by Adam Paternostro and Carlos Farre.

Alexandre Gattiker
Data & AI Architect at Microsoft, open source fan
https://cloudarchitected.com

Leave a Reply

Your email address will not be published. Required fields are marked *