5. Compute in Azure Databricks

5.1 The computer browser

Azure Databricks compute refers to the selection of computing resources available in the Azure Databricks workspace. One can create a new compute or connect to an existing one.

The Compute icon is found on the sidebar and clicking on it will bring up the Compute interface.

Compute interface

This interface has the following tabs:

All-purpose compute
Job compute
SQL warehouses
Vector search
Pools
Policies

You can view the compute you have access to under the All-purpose compute tab.

All-purpose compute: Provisioned compute used to analyze data in notebooks. You can create, terminate, and restart this compute using the UI, CLI, or REST API.
Job compute: Provisioned compute used to run automated jobs. The Azure Databricks job scheduler automatically creates a job compute whenever a job is configured to run on new compute. The compute terminates when the job is complete. You cannot restart a job compute.
SQL Warehouses: On-demand elastic compute used to run SQL commands on data objects in the SQL editor or interactive notebooks. You can create SQL warehouses using the UI, CLI, or REST API. The serveless compute for running NS SQL queries is ns-ii-tech-dev-sql.
Vector search: Vector Search is a serverless vector database seamlessly integrated in the Data Intelligence Platform
Pools: Databricks pools are a set of idle, ready-to-use instances. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request.
Policies: A policy is a tool workspace admins can use to limit a user or group’s compute creation permissions based on a set of policy rules. Using Policies one can:
- Limit users to creating clusters with prescribed settings.
- Limit users to creating a certain number of clusters.
- Simplify the user interface and enable more users to create their own clusters (by fixing and hiding some values).
- Control cost by limiting per cluster maximum cost (by setting limits on attributes whose values contribute to hourly price).
- Enforce cluster-scoped library installations.

Most of the time, when working with notebooks, you will be using your own All-purpose compute and when running SQL queries, the ns-ii-tech-dev-sql SQL warehouse will be used by default. At NS, the convention is to name your created or cloned All purpose computer cluster with <your-name>'s cluster.

5.2 Practical

In this practical, we shall create a compute. There are two ways to create an *All-purpose compute**. One is to create one from scratch and the second method is by cloning. For the purpose of this tutorial, we shall go with the former.

First click on the Create compute button.

Create compute

You will see a new interface like the one below.

Create compute UI

Ensure the following:

Name the All-purpose compute using this convention: <your-name>'s cluster
Policy - is set to Personal compute.
Under the Performance>Databricks runtime version set it to the ML type and the latest LTS release with Scala provided.
The checkbox for Terminate after... should be set to 60 minutes and should be checked.

Once done, click on Create compute. After a while, your all-purpose compute should be created and appear under the *All-purpose compute** tab.

Suppose you want to run the notebook gx_deploy_yml under the workspace path /Workspace/Users/<user-email>/.ide/dab-47fc1c58/development/gx_deploy_yml. The gx_deploy_yml is what is mostly used to load your validation gx yml files. To run this notebook, you will first have to run your cluster under the All-purpose compute tab. Instantiating the cluster takes like 20 minutes. The launch time depends on the configurations.

Thereafter, on the gx_deploy_yml notebook, your cluster name will be highlighted in green. You can now run the notebook.

Run notebook

For the full context of this notebook and what it does, see here.