Introduction
A model is just a sql file that tells dbt how to transform your data. At this stage, particularly when working with bronze
files, the sql files will majorly be creating views of their respective subtables in our Azure cloud storage. For example, when running a model bronze_pitfall_trap_sublayer_survey
for our pitfall form, a view by the name of pitfall_trap_sublayer_survey
will be created.
Creating the sql
models
A sql files in dbt is a model. The sql file contains a single sql statement. The model name is inherited from the file name.
Depending on the sql statement, the output can be a table or a view.
Here is the sql for pitfall trap sublayer survey called bronze_pitfall_trap_sublayer_survey.sql
. The bronze
prefix refers to the source of the table, in this case the bronze
folder. The sql models are found in the same folder as the yaml files.
In a dbt sql model, the config()
sets the configurations of your model. Think of configurations as the settings that can be set in your model and how you would like them to appear in your data warehouse.
Let’s go line by line.
The config()
encapsulates the settings that define how your data will appear in the data warehouse.
materialized
- defines how you would like your data to persist in the data warehouse. In our case, the data will persist as a view.
alias="pitfall_trap_sublayer_survey"
- this overrides the name of the model, which is essentially the filename. One can use the alias to provide a different label of how the file will appear in the data warehouse.
schema="bronze"
- this is the target environment in which your models will be built. Definitely in this case it will be the bronze
folder in the ns-ii-tech-dev-catalog
in Azure.
tags=['pitfall_trap']
- a tag in dbt is used to categorize a model, test and/or other dbt artifacts. Tags have the following purposes:
-
Organization: Tags can help in organizing and grouping similar models or tests together.
-
Execution Control: You can run specific sets of models or tests based on their tags. This can be especially useful when you have a large number of models and you want to run a subset of them. For example, to only run models that have the tag
pitfall
you can do so through this:dbt run --select tag:pitfall
. dbt will run the models having the tagpitfall
even if they are in their hundreds.
Let’s go to this line:
with table_v1 as (
select * from source('bronze', get_full_table_name('pitfall_trap_sublayer_survey'))
)
The with table_v1 as
is a Common Table Expression (CTE) that creates the expression name table_v1
. This table_v1
expression picks tables from the bronze
schema and the table name defined in get_full_table_name()
block of the select statement.
After the above, the select statement follows.
select
objectid,
globalid,
date_time_details,
project,
area,
survey,
deployment,
transport,
field_team_list,
activity,
date_time_deploy,
recorder_deploy,
date_time_retrieve,
recorder_retrieve,
pitfall_trap_survey_comment,
end_time,
creator,
editor,
x,
y,
source_file_path,
source_file_modification_time,
creationdate,
editdate
from
table_v1
The above select statement picks the columns defined in the get_full_table_name()
block, which are enveloped by the table_v1
statement. The selected columns are the ones that get displayed in the pitfall_trap_sublayer_survey
view in the Azure storage.
Now that we have demonstrated how to create dbt files using just one subtable of the pitfall trap survey form, the same process will be replicated for other subtables such as repeat_deploy
and repeat_retrieve
.
Your endgame of the dbt file creation should look something like this:
Automating the creation of sql
models
A function by the name of dbt_sql_func
has already been created for you. The function takes the following parameters: path to the Survey123 csv file (s123_path
), the path to the Arcgis excel file (xls_form_path
), the subtable name (subtable_name
) and finally the tags (tags
).
Here is the function in full swing.
dbt_sql_func(
s123_path="../GEM_Leaf_Litter_09092024/leaf_litter_details_1.csv",
xls_form_path="../GEM_Leaf_Litter_09092024/GEM Leaf Litter.xlsx",
subtable_name="gem_leaf_litter_subtable_leaf_litter_details",
tags=["gem", "leaf_litter"]
)
Here is the generated output.
The above function is available from here.