4. Workflows
4.1 What is a workflow?
The workflows browser houses the Jobs
, Job runs
and Delta Live Tables
tabs.
The Databricks workflows browser provides a set of tools for orchestrating and scheduling your data processing tasks on Azure Databricks.
-
Jobs - A Databricks job allows you to configure tasks to run in a specified compute environment on a specified schedule. Jobs can vary in complexity from a single task running a Databricks notebook to thousands of tasks running with conditional logic and dependencies.
-
Tasks - a tasks represents a unit of logic in a job. Tasks can range in complexity and incldue the following:
- A notebook
- A JAR
- A SQL query
- A DLT pipeline
- Another job
- Control flow tasks
You can control the execution order of tasks by specifying dependencies between them. You can configure tasks to run in sequence or parallel.
- Job runs - This tab in the workflows browser shows the jobs that have been run lately. For example, below you can see there have been two failed runs, and an ongoing job at the time of writing.
- Delta Live Tables - Databricks Delta Live Tables enable data engineers to define live data pipelines using a series of Apache Spark tasks. One can schedule and monitor jobs, manage clusters, handle errors and enforce data quality standards using delta live tables. A delta live table originates from a SQL notebook which is connected to other notebooks or data via a pipeline. One can create a pipeline from the Delta Live Tables interface using the Create Pipeline button.
4.2 Practical
This practical assumes you have already run the gx_deploy_yml
notebook which loads data from AGOL to Azure Data Lake Storage Gen2. For the full context, see here.
Furthermore, at least the survey form you want to experiment with in running a databricks job should be searchable from the Jobs search bar.
Go to Workflows in your Azure account. Workflows is the tab where your data processing, machine learning and analytics pipelines are orchestrated within the Databricks platform.
Within the Workflows interface, select Jobs. Search for the form abbreviation of interest. Select the Run button at the very end, next to the ellipsis.
Running jobs can take a while.
If you click on the job name, two tabs will appear: Runs and Tasks.
Under the Runs tab, the success or failure of your job run(s) is displayed.
The Tasks tab displays the tasks that make up your job. Think of these tasks as the runnable units of your job. Click on any task and you will receive some metadata about the task.
Go back to your Runs tab and select a particular Job Run under the Start Time column.
Two tabs will appear at the top. The Graph and Timeline tabs.
The Graph tab shows the status of the tasks, either they succeeded or failed.
The Timeline tab shows how long running the task took, and whether it was successful ie. red shows failure and green indicates a success.
Using the above image as an example, we can see that our job didn’t succeed because the first task ran into an error. Clicking on this task will provide more information on the error.
It is highly recommended you check around the Workflows interface and know what each tool does as this will be very helpful during debugging.
The job run is an automated pipeline that executes the tasks of 1) landing Survey123 data into Azure and, 2) moving the landed Survey123 data to Bronze stage sequentially without running any notebook.
If your job run succeeds, you Survey123 data will appear under the bronze folder in the ns-ii-tech-dev-catalog
. You can see all of your data in the bronze stage by going to Catalog>ns-ii-tech-dev-catalog>Bronze. The data is broken down into the sublayers and subtables that make up your form as seen in ArcGIS Online.