Introduction
What is Data Build Tool (dbt)?
Data Build Tool (dbt) is a command line tool that enables users transform data in their data warehouses. Users can transform data using select statements which are turned into tables and views. A data warehouse is a central repository of information that can be analyzed to make more informed decisions.
dbt does the Transformation (T) in Extract, Load, Transform (ELT) processes, it does not extract or load data, but is designed to be performant at transforming data already inside a warehouse.
dbt uses YAML files to declare properties. seed
is a type of reference table used in dbt for static or infrequently changed data, for example country codes or lookup tables which are CSV based and typically stored in a seeds folder.
Installing dbt
Your supervisor will provide you with the project folder containing the dbt
code, or where dbt operations will be performed.
Go to the path where dbt operations will be conducted, for example:
cd ~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt
Now create a virtual environment from which you will be conducting your dbt operations.
python3 -m venv nip-dbt-venv
Proceeding onwards, activate the virtual environment.
source nip-dbt-venv/bin/activate
The namespace on your terminal should have (nip-dbt-venv
) appended to it.
Install the dbt-databricks package
To run dbt
from VS Code (and also using the bash terminal since we have WSL2 installed), run the below code to install dbt-databricks.
pip install dbt-databricks
The purpose of the above line is to integrate dbt with Databricks.
Install poetry
The poetry
package helps in installing the required dependencies for your python projects.
pip install poetry
Running the above code may take some time, therefore allow it to run to completion.
The src
folder
The src
is is the folder in which we will be running most of our dbt
code.
cd ~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/src
Inside the src
folder, run this command which will install any other required packages listed in the packages.yml
file.
dbt deps
Exporting the variables
When we were installing dbt-databricks
we mentioned it would integrate dbt
with Azure Databricks. To do so, we need to export our secrets. Your supervisor will provide the secrets.
These secrets are stored within an environment variables file outside the /dbt
folder but within the scope of the NIP-Lakehouse-Data/
folder.
source /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/.env.example
All dbt commands will mostly be run from within the src/
folder.
dbt commands
When running some dbt commands, you may notice they have -t dev
or -t stg
appended at the end. This refers to the environment dbt is connected where dev
and stg
are the development and staging environments respectively.
-
dbt run
- executest the compiled model files -
dbt --full-refresh
- executes all the models and forces thier creation when they are already created except for the models configured with full_refresh = false -
dbt run -s <model_path>
- executes the models within the path specified by-s
-
dbt test
- runs tests defined on models, sources, snapshots, and seeds. It expects that you have already created those resources through the appropriate commands. -
dbt clean
- delete the directories declared with clean-targets in dbt_project.yml -
dbt docs generate
- generate the project documentation -
dbt docs serve
- starts a webserver on port 8000 to serve your documentation locally. Make sure to point to the project profiles.yml.
Project Structure
When you open the dbt folder it consists several sub-folders and it can be confusing to know what stands for what.
├── analysis <- A way to organize analytical SQL queries in your project such as the general ledger
├── docs <- md files that are referenced in schema.Yml and used
| when generating the documentation.
│ ├── data_category
├── logs
├── macros <- contains the macros functions in sql files that are then
| used through the use of jinja templates.
├── models <- sql files containing the transfromations. These
| transformations are always in the form of SELECT statements
| resulting in the creation of a table, a view, an ephemeral
| table, ...etc. They form a sequence of execution steps by
| referencing other models.
│
│ ├── bronze
│ ├── data_category1
│ ├── bronze_table_name.sql <- ephemarals tables to be referenced
│ ├── schema_table_name.sql <- table's information
│ ├── schema_bronze.sql <- bronze tables' information
│ ├── data_category2
│ ...
│ ├── silver
│ ├── data_category1
│ ...
│
├── tests <- contains sql test queries that can be defined inside the schema.yml.
├── dbt_project.yml <- contains project configuration (default) including .sql models,
| macros, test and .yml files. If the modles have their own
| configuration, there is no need to list them in this file.
├── profiles.yml <- contains the connection details to the lakehouse, where
| several different connections can be defined for different
| targets/lakehouses.
|
└── README.md
Creating the yml
files
In dbt, YAML files (.yml) are at the core of transforming your data models. Models are the sql queries that define how your data will be transformed.
For example, taking the dvc
folder under ~/dbt/src/models/bronze/pitfall-trap
, we can create the yml
file for this form by creating first the yaml file named as: schema_pitfall_trap_sublayer_survey.yml
.
Create the structure for this yml file as below.
Let’s define it step by step.
The name
key under the sources
key defines the source of the tables. They are from the bronze
folder.
description
- in here you can provide a brief description of your data source.
tables
- this defines the data structure in our table.
name: "'pitfall_trap_sublayer_survey_' + env_var('pitfall_trap_sublayer_survey_schema_date')"
- the first bracket refers to the table name and it is followed by a _
suffix – 'pitfall_trap_sublayer_survey_'
. This table name is appended to the date variable pitfall_trap_sublayer_survey_schema_date
which must be automatically retrieved from somewhere.
description: doc(“bronze_pitfall_trap_sublayer_survey”) - this refers to the variable that stores a description of this table. In dbt, one can use jinja templating language to conduct SQL queries and even documentation. In this case, we place the file containing the description for this table, bronze_pitfall_trap_sublayer_survey.md
found in the following path: ~/src/docs/pitfall-trap/pitfall_trap_ref.md
as the description. Inside the pitfall_trap_ref.md
the description for our table is placed within a jinja template like so:
We reference the above jinja block from within our yaml file using the doc(“<name-of-descriptions-from-ref-file”>) function. Don’t forget the single quotation marks.
The columns
key contains the following name
, datatype
and description
keys. Still on the same pitfall_trap_sublayer_survey.yml
, we have the following:
NB: All documentation for our dbt work is found within the dbt/src/docs
folder.
-
name
- refers to a column name -
data_type
- refers to the type of data found within this column -
description
- this is the description you would place for this column. As mentioned earlier, we have used a jinja template doc(“objectid”) to reference the doc block containing the description for theobjectid
column. The doc block containing the description for this column is found within thereferences.md
file in this path~/dbt/src/docs/references.md
. The doc block forobjectid
column in this references file looks like below:
This pattern repeats itself across all columns until they are completed. There can also be an additional tests
key which creates assertions for your data. If your column doesn’t pass the tests defines, the dbt run fails. Tests can be any of the following:
-
unique: the values in the column should be unique
-
not_null: the values in the column should not contain null values
-
accepted_values: the status column should contain some values that you define. These values can be anything but so long as they are found within the column.
-
relationships: the value should match another in a different column or table.
The last block should be ignored. It was initially intended to automatically reference the ever-changing team member’s names.
models:
- name: 'bronze_pitfall_trap_sublayer_survey'
columns:
# - name: recorder
# tests:
# - relationships:
# to: ref("bronze_dev_ns_team_list")
# field: member_name
# - name: recorder_deploy # I added deploy to this one to match with recorder_deploy in S123
# tests:
# - relationships:
# to: ref("bronze_dev_ns_team_list")
# field: member_name
- name: recorder_retrieve # I added deploy to this one to match with recorder_retrieve in S123
tests:
- relationships:
to: ref("bronze_dev_ns_team_list")
field: member_name
Creating the sql
files
A sql files in dbt is a model. The sql file contains a single sql statement. The model name is inherited from the file name.
Depending on the sql statement, the output can be a table or a view.
Here is the sql for pitfall trap sublayer survey called bronze_pitfall_trap_sublayer_survey.sql
. The bronze
prefix refers to the source of the table, in this case the bronze
folder. The sql models are found in the same folder as the yaml files.
In a dbt sql model, the config()
sets the configurations of your model. Think of configurations as the settings that can be set in your model and how you would like them to appear in your data warehouse.
Let’s go line by line.
The config()
encapsulates the settings that define how your data will appear in the data warehouse.
materialized
- defines how you would like your data to persist in the data warehouse. In our case, the data will persist as a view.
alias="pitfall_trap_sublayer_survey"
- this overrides the name of the model, which is essentially the filename. One can use the alias to provide a different label of how the file will appear in the data warehouse.
schema="bronze"
- this is the target environment in which your models will be built. Definitely in this case it will be the bronze
folder in the ns-ii-tech-dev-catalog
in Azure.
tags=['pitfall_trap']
- a tag in dbt is used to categorize a model, test and/or other dbt artifacts. Tags have the following purposes:
-
Organization: Tags can help in organizing and grouping similar models or tests together.
-
Execution Control: You can run specific sets of models or tests based on their tags. This can be especially useful when you have a large number of models and you want to run a subset of them. For example, to only run models that have the tag
pitfall
you can do so through this:dbt run --select tag:pitfall
. dbt will run the models having the tagpitfall
even if they are in their hundreds.
Let’s go to this line:
with table_v1 as (
select * from source('bronze', get_full_table_name('pitfall_trap_sublayer_survey'))
)
The with table_v1 as
is a Common Table Expression (CTE) that creates the expression name table_v1
. This table_v1
expression picks tables from the bronze
schema and the table name defined in get_full_table_name()
block of the select statement.
After the above, the select statement follows.
select
objectid,
globalid,
date_time_details,
project,
area,
survey,
deployment,
transport,
field_team_list,
activity,
date_time_deploy,
recorder_deploy,
date_time_retrieve,
recorder_retrieve,
pitfall_trap_survey_comment,
end_time,
creator,
editor,
x,
y,
source_file_path,
source_file_modification_time,
creationdate,
editdate
from
table_v1
The above select statement picks the columns defined in the get_full_table_name()
block, which are enveloped by the table_v1
statement. The selected columns are the ones that get displayed in the pitfall_trap_sublayer_survey
view in the Azure storage.
Now that we have demonstrated how to create dbt files using just one subtable of the pitfall trap survey form, the same process will be replicated for other subtables such as repeat_deploy
and repeat_retrieve
.
Your endgame of the dbt file creation should look something like this:
Running the dbt
model
The first thing you should do is run the dbt run -t dev
in your terminal.
It should run and generate some output. If there is an error, the process will terminate and display an error similar to this:
(nip-dbt-venv) sammigachuhi@gachuhi:~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/src$ dbt run -t dev
06:11:52 Running with dbt=1.7.14
06:11:53 Registered adapter: databricks=1.7.14
06:11:54 Encountered an error:
Parsing Error
Env var required but not provided: 'xprize_sens_pkg_sublayer_survey_schema_date'
This means that an environment variable by the name of 'xprize_sens_pkg_sublayer_survey_schema_date'
is missing.
The environment variables files are found in the env_variables
folder. Within this folder are two sub-folders, dev
and stg
.
If an Env var
error appears such as in the above, the first solution would be to create a txt file in the following format schem_edit_date_<name_of_form>
. The above error is related to the xprize_sens_pkg
form abbreviation name. Therefore, our variable text file will be schem_edit_date_xprize_sens_pkg.txt
. Inside it, insert the following:
# xprize sens pkg
xprize_sens_pkg_sublayer_survey_schema_date="2024_06_04__10_05_00"
You may need to iteratively make these insertions for every error similar to the above.
One can also run dbt test
to check whether everything is okay. If the terminal cuts out with an error, then its time for some debugging.
To create documents we run the following:
dbt docs generate -t dev
The above code will generate some outputs. Let it run till it finishes. If it cuts out, then some debugging is required.
Otherwise, if successful, run the following code:
dbt docs serve --port 8080
A prompt will pop up asking you if you would wish to visit the site of localhost:8080
. You can agree. Alternatively, start a new incognito window and type localhost:8080/
on your browser.
The documentation website should show like below:
Once satisfied with your work, you can push the branch(es) you were working on to Github from where you can initiate a Pull Request to merge with say, the dev
branch.