Introduction
In Data build tool (dbt) yaml files are used to set configurations and define data models. YAML stands for YAML ain’t another markup language and is defined by the .yml
or .yaml
extension.
Think of a yml file as a cooking recipe that describes the processes of how to cook your chicken in the oven. In the recipe are the guidelines on the temperature to set, the cooking duration and how to serve it. Same thing with yml files. The yml files are what will define how our descriptions and subtable views will appear in our data warehouse.
The yml
files
The dbt
structure
Most of our dbt work will be performed inside the src
folder. The folder looks like below:
Here is a description of each folder and file.
├── analysis <- A way to organize analytical SQL queries in your project such as the general ledger from your QuickBooks.
├── docs <- md files that are referenced in schema.Yml and used
| when generating the documentation.
│ ├── data_category
├── logs
├── macros <- contains the macros functions in sql files that are then
| used through the use of jinja templates.
├── models <- sql files containing the transfromations. These
| transformations are always in the form of SELECT statements
| resulting in the creation of a table, a view, an ephemeral
| table, ...etc. They form a sequence of execution steps by
| referencing other models.
│
│ ├── bronze
│ ├── data_category1
│ ├── bronze_table_name.sql <- ephemarals tables to be refrenced
│ ├── schema_table_name.sql <- table's information
│ ├── schema_bronze.sql <- bronze tables' information
│ ├── data_category2
│ ...
│ ├── silver
│ ├── data_category1
│ ...
│
├── tests <- contains sql test queries that can be defined inside the schema.yml.
├── dbt_project.yml <- contains project configuration (default) including .sql models,
| macros, test and .yml files. If the modles have their own
| configuration, there is no need to list them in this file.
├── profiles.yml <- contains the connection details to the lakehouse, where
| several different connections can be defined for different
| targets/lakehouses.
|
└── README.md
Create the yml
file
In dbt, YAML files (.yml) are at the core of transforming your data models. Models are the sql queries that define how your data will be transformed.
For example, taking the dvc
folder under ~/dbt/src/models/bronze/pitfall-trap
, we can create the yml
file for this form by creating first the yaml file named as: schema_pitfall_trap_sublayer_survey.yml
.
Create the structure for this yml file as below.
For a full overview of the definitions in a yml file see this.
Let’s define it step by step.
The name
key under the sources
key defines the source of the tables. They are from the bronze
folder.
description
- in here you can provide a brief description of your data source.
tables
- this defines the data structure in our table.
name: "'pitfall_trap_sublayer_survey_' + env_var('pitfall_trap_sublayer_survey_schema_date')"
- the first bracket refers to the table name and it is followed by a _
suffix – 'pitfall_trap_sublayer_survey_'
. This table name is appended to the date variable pitfall_trap_sublayer_survey_schema_date
which must be automatically retrieved from the text files (.txt
) found in the env-variables
folder.
description: doc(“bronze_pitfall_trap_sublayer_survey”) - this refers to the variable that stores a description of this table. In dbt, one can use jinja templating language to conduct SQL queries and even documentation. In this case, we place the file containing the description for this table, bronze_pitfall_trap_sublayer_survey.md
found in the following path: ~/src/docs/pitfall-trap/pitfall_trap_ref.md
as the description.
The columns
key contains the following name
, datatype
and description
keys. Still on the same pitfall_trap_sublayer_survey.yml
, we have the following:
NB: All documentation for our dbt work is found within the dbt/src/docs
folder.
-
name
- refers to a column name -
data_type
- refers to the type of data found within this column -
description
- this is the description you would place for this column. As mentioned earlier, we have used a jinja template doc(“objectid”) to reference the doc block containing the description for theobjectid
column. The doc block containing the description for this column is found within thereferences.md
file in this path~/dbt/src/docs/references.md
.
This pattern repeats itself across all columns until they are completed. There can also be an additional tests
key which creates assertions for your data. If your column doesn’t pass the tests defines, the dbt run fails. Tests can be any of the following:
-
unique: the values in the column should be unique
-
not_null: the values in the column should not contain null values
-
accepted_values: the status column should contain some values that you define. These values can be anything but so long as they are found within the column.
-
relationships: the value should match another in a different column or table.
The last block should be ignored. It was initially intended to automatically reference the ever-changing team member’s names.
models:
- name: 'bronze_pitfall_trap_sublayer_survey'
columns:
# - name: recorder
# tests:
# - relationships:
# to: ref("bronze_dev_ns_team_list")
# field: member_name
# - name: recorder_deploy # I added deploy to this one to match with recorder_deploy in S123
# tests:
# - relationships:
# to: ref("bronze_dev_ns_team_list")
# field: member_name
- name: recorder_retrieve # I added deploy to this one to match with recorder_retrieve in S123
tests:
- relationships:
to: ref("bronze_dev_ns_team_list")
field: member_name
Having created one yml
file for one subtable, that is the sublayer survey, and pitfall form has five subtables in total, in essence there should be five yml files for each subtable like so:
Layer name | subtable name | yaml name |
---|---|---|
survey | pitfall_trap_sublayer_survey | schema_pitfall_trap_sublayer_survey.yml |
grid_location_repeat_deploy | pitfall_trap_sublayer_grid_location_repeat_deploy | schema_pitfall_trap_sublayer_grid_location_repeat_deploy.yml |
grid_location_repeat_retrieve | pitfall_trap_sublayer_grid_location_repeat_retrieve | schema_pitfall_trap_sublayer_grid_location_repeat_retrieve.yml |
repeat_deploy | pitfall_trap_subtable_repeat_deploy | schema_pitfall_trap_subtable_repeat_deploy.yml |
repeat_retrieve | pitfall_trap_subtable_repeat_retrieve | schema_pitfall_trap_subtable_repeat_retrieve.yml |
As you can see above, for every sublayer or subtable in any Survey123 form there will be an associated schema .yml
with documentation references for each field in the subtable.
Using a function to autogenerate the yml
files
A function has already been created to automatically generate, to a great extent, the redundant work of setting up the yml function files. The function’s name is dbt_yml_func
and is available here.
To generate the yml
configuration template using the dbt_yml_func
function, you will need the path to the Survey123 csv file (s123_path
), the path to the Arcgis excel file (xls_form_path
) and the subtable name. Using the GEM Grass NPP form as an example, the functions’ parameters would like below:
dbt_yml_func(
s123_path="../GEM_Grass_NPP_11092024/cages_repeat_1.csv",
xls_form_path="../GEM_Grass_NPP_11092024/GEM Grass NPP.xlsx",
subtable_name="gem_grass_npp_subtable_cages_repeat"
)
Here is the generated output. Note that only the source_file_path
and the source_file_modification
dbt doc
functions have been inserted, for the rest you will have to insert the name of the docs description such as "objectid"
, "globalid"
and so on. Each field will have a corresponding description from references.md
. At least the bulk of the work has already been done for you!