1. Introduction

1.1. What is Databricks

Databricks is a single, cloud-based platform that can handle all of your data needs, which means it’s also a single platform on which your entire data team can collaborate. Not only does it unify and simplify your data systems, Databricks is fast, cost-effective and inherently scales to very large data. Databricks is available on top of your existing cloud, whether its Amazon Web Services (AWS), Microsoft Azure, Google Cloud or a combination of this. At Natural State (NS) we use Databricks on top of the Microsoft Azure Storage platform.

1.2. What is databricks used for?

Databricks is a platform where we can do one or more of the following:

store our data
handle both batched data and real-time data streams
transform and organise data
perform calculations on data
query data
analyse data
use the data for machine learning and AI
generate reports

At NS, we use Databricks for a whole variety of tasks. We use their notebooks to write code, join the notebooks to form tasks that can be co-joined to form pipelines, to write queries and to a (very) less extent, create visualizations.

When we are working with data using Databricks here at NS, Databricks in essence is reading data from storage and writing data to that storage, in this case the Azure Data Lake Storage Gen2 Storage.

1.3 Concept of data lake, data warehouse and data lakehouse

Data lake

A data lake is a single, centralized repository where you can store all your data, both structured and unstructured. A data lake enables your organization to quickly and more easily store, access and analyze a wide variety of data in a single location.

Data warehouse

A data warehouse is like a well-organized storage room. Data warehouses store structured data from a variety of different sources. Data is stored in a relational structure, meaning that data inside the warehouse is neatly organized into rows, columns and tables.

Data undergoes a process called data ingestion before getting stored in a data warehouse. Data ingestion involves collection, processing and preparing data for storage. Here’s how it works:

Extract data from various sources
Transform the data by cleaning, processing and converting it into the desired format.
Load the newly transformed data into your data warehouse.

Data Lakehouse

A data lakehouse is a data management architecture that combines key capabilities of data lakes and data warehouses. It brings the benefits of a data lake, such as low storage cost and broad data access, and the benefits of a data warehouse, such as data structures and management features.

Delta Lake Architecture

Delta lake architecutre is an advanced and reliable data storage and processing framework built on top of a data lake. It extends the capabilities of traditional data lakes by providing ACID (Atomicity, Consistency, Isolation, Durability) transactional properties, schema enforcement, and data versioning.

In delta lake, data is organized into a set of Parquet files, which are stored in a distributed file system. It maintains metadata about these files, enabling defficient data management and query optimization. Delta lake also offers features like time travel, which allows users to access and revert to previous versions of data, and schema evolution, which enables schema updates without interrupting existing pipelines. This architecture enhances data reliability, data quality, and data governance, making it easier for organizations to maintain data integrity and consistency throughout the data lifecycle.

Delta lake is an extension of the data lakehouse.

Delta Lake Architecture

1.4 The Azure Databricks Workspace UI

The Azure Databricks workspace UI is where an individual can access all their databricks azure objects. To get to the databricks workspace UI, you to to Azure Microsoft.

Once you sign in, you will be brought to an interface that looks like this:

Azure sign-in

Click the ns-ii-tech-dev-db-workspace.

The NS tech dev workspace

Click on Launch Workspace.

Launch workspace

The above Azure Databricks workspace UI consists the following features:

Sidebar

This contains the following tabs:

Workspace
Recents
Data
Workflows
Compute

The SQL section contains the following tabs:

SQL Editor
Queries
Dashboards
Alerts
Query History
SQL Warehouses

Data Engineering

Contains:

delta live tables

Machine Learning

This section contains the following:

Experiments
Feature Store
Models
Serving

NB: Currently at Natural State (NS), we have to workspaces, the ns-ii-tech-dev-workspace and the ns-ii-tech-stg-workspace. These refer to the dev and staging environments respectively. The difference between the former and the latter is that the latter mirrors the production environment more closely. For data ingestion purposes, you will mostly be working at the ns-ii-tech-dev-workspace.