Introduction

What is Data Build Tool (dbt)?

Data Build Tool (dbt) is a command line tool that enables users transform data in their data warehouses. Users can transform data using select statements which are turned into tables and views. A data warehouse is a central repository of information that can be analyzed to make more informed decisions.

dbt does the Transformation (T) in Extract, Load, Transform (ELT) processes, it does not extract or load data, but is designed to be performant at transforming data already inside a warehouse.

dbt uses YAML files to declare properties. seed is a type of reference table used in dbt for static or infrequently changed data, for example country codes or lookup tables which are CSV based and typically stored in a seeds folder.

Installing dbt

Your supervisor will provide you with the project folder containing the dbt code, or where dbt operations will be performed.

Go to the path where dbt operations will be conducted, for example:

cd ~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt

Now create a virtual environment from which you will be conducting your dbt operations.

python3 -m venv nip-dbt-venv

Proceeding onwards, activate the virtual environment.

source nip-dbt-venv/bin/activate

The namespace on your terminal should have (nip-dbt-venv) appended to it.

Install the dbt-databricks package

To run dbt from VS Code (and also using the bash terminal since we have WSL2 installed), run the below code to install dbt-databricks.

pip install dbt-databricks

The purpose of the above line is to integrate dbt with Databricks.

Install `poetry`

The poetry package helps in installing the required dependencies for your python projects.

pip install poetry

Running the above code may take some time, therefore allow it to run to completion.

The `src` folder

The src is is the folder in which we will be running most of our dbt code.

cd ~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/src

Inside the src folder, run this command which will install any other required packages listed in the packages.yml file.

dbt deps

Exporting the variables

When we were installing dbt-databricks we mentioned it would integrate dbt with Azure Databricks. To do so, we need to export our secrets. Your supervisor will provide the secrets.

These secrets are stored within an environment variables file outside the /dbt folder but within the scope of the NIP-Lakehouse-Data/ folder.

source /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/.env.example

All dbt commands will mostly be run from within the src/ folder.

dbt commands

When running some dbt commands, you may notice they have -t dev or -t stg appended at the end. This refers to the environment dbt is connected where dev and stg are the development and staging environments respectively.

dbt run - executest the compiled model files
dbt --full-refresh - executes all the models and forces thier creation when they are already created except for the models configured with full_refresh = false
dbt run -s <model_path> - executes the models within the path specified by -s
dbt test - runs tests defined on models, sources, snapshots, and seeds. It expects that you have already created those resources through the appropriate commands.
dbt clean - delete the directories declared with clean-targets in dbt_project.yml
dbt docs generate - generate the project documentation
dbt docs serve - starts a webserver on port 8000 to serve your documentation locally. Make sure to point to the project profiles.yml.

Project Structure

When you open the dbt folder it consists several sub-folders and it can be confusing to know what stands for what.

├── analysis                            <- A way to organize analytical SQL queries in your project such as the general ledger
├── docs                                <- md files that are referenced in schema.Yml and used
|                                          when generating the documentation.
│   ├── data_category                 
├── logs
├── macros                              <- contains the macros functions in sql files that are then 
|                                          used through the use of jinja templates.
├── models                              <- sql files containing the transfromations. These 
|                                          transformations are always in the form of SELECT statements
|                                          resulting in the creation of a table, a view, an ephemeral
|                                          table,  ...etc. They form a sequence of execution steps by 
|                                          referencing other models.
│
│   ├── bronze  
│       ├── data_category1
│           ├── bronze_table_name.sql   <- ephemarals tables to be referenced
│           ├── schema_table_name.sql   <- table's information
│           ├── schema_bronze.sql       <- bronze tables' information
│       ├── data_category2
│       ...
│   ├── silver
│       ├── data_category1
│       ...
│
├── tests                               <- contains sql test queries that can be defined inside the schema.yml.
├── dbt_project.yml                     <- contains project configuration (default) including .sql models,
|                                          macros, test and .yml files.   If the modles have their own 
|                                          configuration, there is no need to list them in this file.
├── profiles.yml                        <- contains the connection details to the lakehouse, where 
|                                          several different connections can be defined for different 
|                                          targets/lakehouses.
|
└── README.md   

Creating the `yml` files

In dbt, YAML files (.yml) are at the core of transforming your data models. Models are the sql queries that define how your data will be transformed.

For example, taking the dvc folder under ~/dbt/src/models/bronze/pitfall-trap, we can create the yml file for this form by creating first the yaml file named as: schema_pitfall_trap_sublayer_survey.yml.

Create the structure for this yml file as below.

Sources

Let’s define it step by step.

Sources

The name key under the sources key defines the source of the tables. They are from the bronze folder.

description - in here you can provide a brief description of your data source.

tables - this defines the data structure in our table.

name: "'pitfall_trap_sublayer_survey_' + env_var('pitfall_trap_sublayer_survey_schema_date')" - the first bracket refers to the table name and it is followed by a _ suffix – 'pitfall_trap_sublayer_survey_'. This table name is appended to the date variable pitfall_trap_sublayer_survey_schema_date which must be automatically retrieved from somewhere.

description: doc(“bronze_pitfall_trap_sublayer_survey”) - this refers to the variable that stores a description of this table. In dbt, one can use jinja templating language to conduct SQL queries and even documentation. In this case, we place the file containing the description for this table, bronze_pitfall_trap_sublayer_survey.md found in the following path: ~/src/docs/pitfall-trap/pitfall_trap_ref.md as the description. Inside the pitfall_trap_ref.md the description for our table is placed within a jinja template like so:

References

We reference the above jinja block from within our yaml file using the doc(“<name-of-descriptions-from-ref-file”>) function. Don’t forget the single quotation marks.

The columns key contains the following name, datatype and description keys. Still on the same pitfall_trap_sublayer_survey.yml, we have the following:

Example Keys

NB: All documentation for our dbt work is found within the dbt/src/docs folder.

name - refers to a column name
data_type - refers to the type of data found within this column
description - this is the description you would place for this column. As mentioned earlier, we have used a jinja template doc(“objectid”) to reference the doc block containing the description for the objectid column. The doc block containing the description for this column is found within the references.md file in this path ~/dbt/src/docs/references.md. The doc block for objectid column in this references file looks like below:

This pattern repeats itself across all columns until they are completed. There can also be an additional tests key which creates assertions for your data. If your column doesn’t pass the tests defines, the dbt run fails. Tests can be any of the following:

unique: the values in the column should be unique
not_null: the values in the column should not contain null values
accepted_values: the status column should contain some values that you define. These values can be anything but so long as they are found within the column.
relationships: the value should match another in a different column or table.

The last block should be ignored. It was initially intended to automatically reference the ever-changing team member’s names.

models:
  - name: 'bronze_pitfall_trap_sublayer_survey'
    columns:
      # - name: recorder 
      #   tests:
      #     - relationships:
      #         to: ref("bronze_dev_ns_team_list")
      #         field: member_name
      # - name: recorder_deploy # I added deploy to this one to match with recorder_deploy in S123
      #   tests:
      #     - relationships:
      #         to: ref("bronze_dev_ns_team_list")
      #         field: member_name
      - name: recorder_retrieve # I added deploy to this one to match with recorder_retrieve in S123
        tests:
          - relationships:
              to: ref("bronze_dev_ns_team_list")
              field: member_name

Creating the `sql` files

A sql files in dbt is a model. The sql file contains a single sql statement. The model name is inherited from the file name.

Depending on the sql statement, the output can be a table or a view.

Here is the sql for pitfall trap sublayer survey called bronze_pitfall_trap_sublayer_survey.sql. The bronze prefix refers to the source of the table, in this case the bronze folder. The sql models are found in the same folder as the yaml files.

In a dbt sql model, the config() sets the configurations of your model. Think of configurations as the settings that can be set in your model and how you would like them to appear in your data warehouse.

Let’s go line by line.

The config() encapsulates the settings that define how your data will appear in the data warehouse.

materialized - defines how you would like your data to persist in the data warehouse. In our case, the data will persist as a view.

Pitfall trap survey view

alias="pitfall_trap_sublayer_survey" - this overrides the name of the model, which is essentially the filename. One can use the alias to provide a different label of how the file will appear in the data warehouse.

schema="bronze" - this is the target environment in which your models will be built. Definitely in this case it will be the bronze folder in the ns-ii-tech-dev-catalog in Azure.

tags=['pitfall_trap'] - a tag in dbt is used to categorize a model, test and/or other dbt artifacts. Tags have the following purposes:

Organization: Tags can help in organizing and grouping similar models or tests together.
Execution Control: You can run specific sets of models or tests based on their tags. This can be especially useful when you have a large number of models and you want to run a subset of them. For example, to only run models that have the tag pitfall you can do so through this: dbt run --select tag:pitfall. dbt will run the models having the tag pitfall even if they are in their hundreds.

Let’s go to this line:

with table_v1 as (
	select * from source('bronze', get_full_table_name('pitfall_trap_sublayer_survey'))
)

The with table_v1 as is a Common Table Expression (CTE) that creates the expression name table_v1. This table_v1 expression picks tables from the bronze schema and the table name defined in get_full_table_name() block of the select statement.

After the above, the select statement follows.

select
  objectid,
  globalid,
  date_time_details,
  project,
  area,
  survey,
  deployment,
  transport,
  field_team_list,
  activity,
  date_time_deploy,
  recorder_deploy,
  date_time_retrieve,
  recorder_retrieve,
  pitfall_trap_survey_comment,
  end_time,
  creator,
  editor,
  x,
  y,
  source_file_path,
  source_file_modification_time,
  creationdate,
  editdate
from
  table_v1

The above select statement picks the columns defined in the get_full_table_name() block, which are enveloped by the table_v1 statement. The selected columns are the ones that get displayed in the pitfall_trap_sublayer_survey view in the Azure storage.

Now that we have demonstrated how to create dbt files using just one subtable of the pitfall trap survey form, the same process will be replicated for other subtables such as repeat_deploy and repeat_retrieve.

Your endgame of the dbt file creation should look something like this:

DBT files

Running the `dbt` model

The first thing you should do is run the dbt run -t dev in your terminal.

It should run and generate some output. If there is an error, the process will terminate and display an error similar to this:

(nip-dbt-venv) sammigachuhi@gachuhi:~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/src$ dbt run -t dev
06:11:52  Running with dbt=1.7.14
06:11:53  Registered adapter: databricks=1.7.14
06:11:54  Encountered an error:
Parsing Error
  Env var required but not provided: 'xprize_sens_pkg_sublayer_survey_schema_date'

This means that an environment variable by the name of 'xprize_sens_pkg_sublayer_survey_schema_date' is missing.

The environment variables files are found in the env_variables folder. Within this folder are two sub-folders, dev and stg.

If an Env var error appears such as in the above, the first solution would be to create a txt file in the following format schem_edit_date_<name_of_form>. The above error is related to the xprize_sens_pkg form abbreviation name. Therefore, our variable text file will be schem_edit_date_xprize_sens_pkg.txt. Inside it, insert the following:

# xprize sens pkg
xprize_sens_pkg_sublayer_survey_schema_date="2024_06_04__10_05_00"

You may need to iteratively make these insertions for every error similar to the above.

One can also run dbt test to check whether everything is okay. If the terminal cuts out with an error, then its time for some debugging.

To create documents we run the following:

dbt docs generate -t dev

The above code will generate some outputs. Let it run till it finishes. If it cuts out, then some debugging is required.

Otherwise, if successful, run the following code:

dbt docs serve --port 8080

A prompt will pop up asking you if you would wish to visit the site of localhost:8080. You can agree. Alternatively, start a new incognito window and type localhost:8080/ on your browser.

The documentation website should show like below:

The Documentation Website

Once satisfied with your work, you can push the branch(es) you were working on to Github from where you can initiate a Pull Request to merge with say, the dev branch.