Introduction

In Data build tool (dbt) yaml files are used to set configurations and define data models. YAML stands for YAML ain’t another markup language and is defined by the .yml or .yaml extension.

Think of a yml file as a cooking recipe that describes the processes of how to cook your chicken in the oven. In the recipe are the guidelines on the temperature to set, the cooking duration and how to serve it. Same thing with yml files. The yml files are what will define how our descriptions and subtable views will appear in our data warehouse.

The yml files

The dbt structure

Most of our dbt work will be performed inside the src folder. The folder looks like below:

The src folder

Here is a description of each folder and file.

├── analysis                            <- A way to organize analytical SQL queries in your project such as the     general ledger from your QuickBooks.
├── docs                                <- md files that are referenced in schema.Yml and used
|                                          when generating the documentation.
│   ├── data_category                 
├── logs
├── macros                              <- contains the macros functions in sql files that are then 
|                                          used through the use of jinja templates.
├── models                              <- sql files containing the transfromations. These 
|                                          transformations are always in the form of SELECT statements
|                                          resulting in the creation of a table, a view, an ephemeral
|                                          table,  ...etc. They form a sequence of execution steps by 
|                                          referencing other models.
│
│   ├── bronze  
│       ├── data_category1
│           ├── bronze_table_name.sql   <- ephemarals tables to be refrenced
│           ├── schema_table_name.sql   <- table's information
│           ├── schema_bronze.sql       <- bronze tables' information
│       ├── data_category2
│       ...
│   ├── silver
│       ├── data_category1
│       ...
│
├── tests                               <- contains sql test queries that can be defined inside the schema.yml.
├── dbt_project.yml                     <- contains project configuration (default) including .sql models,
|                                          macros, test and .yml files.   If the modles have their own 
|                                          configuration, there is no need to list them in this file.
├── profiles.yml                        <- contains the connection details to the lakehouse, where 
|                                          several different connections can be defined for different 
|                                          targets/lakehouses.
|
└── README.md   

Create the yml file

In dbt, YAML files (.yml) are at the core of transforming your data models. Models are the sql queries that define how your data will be transformed.

For example, taking the dvc folder under ~/dbt/src/models/bronze/pitfall-trap, we can create the yml file for this form by creating first the yaml file named as: schema_pitfall_trap_sublayer_survey.yml.

Create the structure for this yml file as below.

Pitfall sources file

Pitfall sources file ending

For a full overview of the definitions in a yml file see this.

Let’s define it step by step.

sources

The name key under the sources key defines the source of the tables. They are from the bronze folder.

description - in here you can provide a brief description of your data source.

tables - this defines the data structure in our table.

name: "'pitfall_trap_sublayer_survey_' + env_var('pitfall_trap_sublayer_survey_schema_date')" - the first bracket refers to the table name and it is followed by a _ suffix – 'pitfall_trap_sublayer_survey_'. This table name is appended to the date variable pitfall_trap_sublayer_survey_schema_date which must be automatically retrieved from the text files (.txt) found in the env-variables folder.

description: doc(“bronze_pitfall_trap_sublayer_survey”) - this refers to the variable that stores a description of this table. In dbt, one can use jinja templating language to conduct SQL queries and even documentation. In this case, we place the file containing the description for this table, bronze_pitfall_trap_sublayer_survey.md found in the following path: ~/src/docs/pitfall-trap/pitfall_trap_ref.md as the description.

The columns key contains the following name, datatype and description keys. Still on the same pitfall_trap_sublayer_survey.yml, we have the following:

sources

NB: All documentation for our dbt work is found within the dbt/src/docs folder.

  • name - refers to a column name

  • data_type - refers to the type of data found within this column

  • description - this is the description you would place for this column. As mentioned earlier, we have used a jinja template doc(“objectid”) to reference the doc block containing the description for the objectid column. The doc block containing the description for this column is found within the references.md file in this path ~/dbt/src/docs/references.md.

This pattern repeats itself across all columns until they are completed. There can also be an additional tests key which creates assertions for your data. If your column doesn’t pass the tests defines, the dbt run fails. Tests can be any of the following:

  • unique: the values in the column should be unique

  • not_null: the values in the column should not contain null values

  • accepted_values: the status column should contain some values that you define. These values can be anything but so long as they are found within the column.

  • relationships: the value should match another in a different column or table.

The last block should be ignored. It was initially intended to automatically reference the ever-changing team member’s names.

models:
  - name: 'bronze_pitfall_trap_sublayer_survey'
    columns:
      # - name: recorder 
      #   tests:
      #     - relationships:
      #         to: ref("bronze_dev_ns_team_list")
      #         field: member_name
      # - name: recorder_deploy # I added deploy to this one to match with recorder_deploy in S123
      #   tests:
      #     - relationships:
      #         to: ref("bronze_dev_ns_team_list")
      #         field: member_name
      - name: recorder_retrieve # I added deploy to this one to match with recorder_retrieve in S123
        tests:
          - relationships:
              to: ref("bronze_dev_ns_team_list")
              field: member_name

Having created one yml file for one subtable, that is the sublayer survey, and pitfall form has five subtables in total, in essence there should be five yml files for each subtable like so:

Layer name subtable name yaml name
survey pitfall_trap_sublayer_survey schema_pitfall_trap_sublayer_survey.yml
grid_location_repeat_deploy pitfall_trap_sublayer_grid_location_repeat_deploy schema_pitfall_trap_sublayer_grid_location_repeat_deploy.yml
grid_location_repeat_retrieve pitfall_trap_sublayer_grid_location_repeat_retrieve schema_pitfall_trap_sublayer_grid_location_repeat_retrieve.yml
repeat_deploy pitfall_trap_subtable_repeat_deploy schema_pitfall_trap_subtable_repeat_deploy.yml
repeat_retrieve pitfall_trap_subtable_repeat_retrieve schema_pitfall_trap_subtable_repeat_retrieve.yml

As you can see above, for every sublayer or subtable in any Survey123 form there will be an associated schema .yml with documentation references for each field in the subtable.

The schema yml for pitfall trap

Using a function to autogenerate the yml files

A function has already been created to automatically generate, to a great extent, the redundant work of setting up the yml function files. The function’s name is dbt_yml_func and is available here.

To generate the yml configuration template using the dbt_yml_func function, you will need the path to the Survey123 csv file (s123_path), the path to the Arcgis excel file (xls_form_path) and the subtable name. Using the GEM Grass NPP form as an example, the functions’ parameters would like below:

dbt_yml_func(
    s123_path="../GEM_Grass_NPP_11092024/cages_repeat_1.csv",
    xls_form_path="../GEM_Grass_NPP_11092024/GEM Grass NPP.xlsx",
    subtable_name="gem_grass_npp_subtable_cages_repeat"
)

Here is the generated output. Note that only the source_file_path and the source_file_modification dbt doc functions have been inserted, for the rest you will have to insert the name of the docs description such as "objectid", "globalid" and so on. Each field will have a corresponding description from references.md. At least the bulk of the work has already been done for you!

Sources