Data build tool
Data build tool, commonly known as dbt is an open source command line tool that helps analytical engineers transform the data in their data warehouses more effectively.
Components of dbt
The dbt_project.yml
file
This YAML file serves as the project-level configuration file. It contains the name of your project, the version and configurations for models and paths. This is where you change folder names or paths and configure how you want to materialize specific files. There are four materialization types: views, tables, incremental (for fact tables) and ephemeral (for common tables).
-
Incremental models allow dbt to insert or update records into a table since the last time the model was run.
-
Ephemeral models are not directly built into the database. Instead, when a model is configured as ephemeral, dbt will inline its SQL code directly into any model that references it. Because no physical data objects are created, ephemeral models save on storage.
The models
folder
The models folder is the place where you will define your dbt models. A model is just a SQL file that defines how to transform and build your data.
One can create subfolders within the models
folder to organize models based on different logical groupings. For example, our models are primarily arranged under either the bronze and silver folders. Within either of these folders, there exist other sub-folders such as gem and si subfolders which contain the models of the Survey123 forms within that category.
macros
folder
Macros in dbt are what you would call functions in other programming languages such as Python and JavaScript. Any SQL logic placed between the opening and closing tags will be defined as a macro.
For example in our dbt folder for NS, inside the macro folder, there are macros to check table row counts, generate schema names and the like. Macros can thereafter be used in your dbt models, as in this example here.
Snapshots
Snapshots offer a method for capturing and storing the state of your datasets at specific intervals.
dbt Snapshots are models that allow you to track and store changes to a dataset over time. By capturing the state of a dataset at defined intervals, snapshots enable you to query historical data.
Analyses
Any SQL file found in the analyses/
directory of a dbt project will be compiled, but not executed. This means that analysts can use dbt functionality like ref(...)
to select from models in an environment-agnostic way.
If you run dbt compile
the SQL models will be found in the target/
directory.
tests
dbt tests are used to ensure data quality and integrity throughout your data modeling process. They allow you to enforce specific conditions and constraints on your data models
, sources
, seeds
and snapshots
.
By running dbt test
, dbt check the quality and correctness of your data transformations based on the tests you defined.
There are three types of tests in dbt:
-
Build-in column-level tests
-
Package available tests
-
Custom tests
Read about them here. One can also define their tests in a yml file inside the tests/
directory.
Seeds
Within the seeds folder, you can put small data files, e.g. a CSV file, that will be added to the data warehouse. Each seed file typically corresponds to a specific table and contains the data rows to be inserted.
If you want to execute the seeds and populate your tables with initial data, you need to run the dbt seed
command.
packages.yml
This is where you declare the packages that your project depends on. For example, to install a specific version of codegen
we insert it in this file like so.
packages:
- package: dbt-labs/codegen
version: 0.12.1
logs
When you run dbt commands like dbt run
and dbt test
, dbt will create a log file to provide detailed information about the execution process and any errors or warnings encountered.
target
This folder contains all the SQL code compiled when you run the dbt compile
or dbt run
commands.