Introduction

In dbt, running a model means executing its sql files to build and transform the data in your data warehouse. So far, having worked with our bronze models and built up their yml configurations, our sql models will only build views of each Survey123 subtable.

Configurations

The environment configurations

Before running your models, it is worthwhile to check the health of your environment variables file. The environment variables file is found in this path - github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/. There are two environment variables, one for dev and the other for staging, christened as stg.

The one for dev environment is at this path - github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/env.example while the one for staging is found at github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/env.example_stg.

In either of the environment variable files, ensure that the following variables have been set correctly. Your supervisor will guide appropriately.

# # Save environment variables (for development)
export DBT_PROFILES_DIR='.'
# export DATABRICKS_TOKEN='xxx'
# This token is for DBT it is unlimited for ns-ii-teck-dev-workspace (unlimited time)
export DATABRICKS_TOKEN='xxx' # DBT
export DATABRICKS_WORKSPACE='xxx'
export DATABRICKS_CLUSTER_ID='xxx'
export DATABRICKS_HOST='xxx'
export DATABRICKS_HTTP='xxx'
export DBT_ACCESS_TOKEN='xxx'

# export gx vars
export IS_LOCAL="True"
export AZURE_CLIENT_SECRET="xxx"
export AZURE_CLIENT_ID="xxx"
export AZURE_TENANT_ID="xxx"
export AZURE_STORAGE_ACCOUNT_NAME="nsiitechstgadlsetl"


Finally, while still at the env.example environment file, the paths to schema date files for each of our forms will be defined. We shall highlight on schema dates later.

export DATABRICKS_AUTH_TYPE='pat'

export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/monitoring.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_gem.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_cpp.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_lgt.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_ct.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_dvc.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_acstc.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_bpc.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_pitfall_trap.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_bareground_dpm.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_cvat.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_xprize.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_lab.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/dev/schem_edit_date_si.txt | xargs)

For the staging environment, whose parameters are defined in the env.example_stg file, each of the above connection variables will have a unique key and the paths to our schema dates shall be set as follows:

'^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/monitoring.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_gem.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_cpp.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_lgt.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_ct.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_dvc.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_acstc.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_bpc.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_pitfall_trap.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_bareground_dpm.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_cvat.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_xprize.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_lab.txt | xargs)
export $(grep -v '^#'  /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/env_variables/stg/schem_edit_date_si.txt | xargs)

Just ensure the paths in the environment files are pointing to the right destination!

Preparations for model run

Before running the model, ensure that your virtual environment is active. You can do so by running source nip-dbt-venv/bin/activate. Ensure that you are inside the dbt folder as you do this.

Thereafter, move into your src folder via cd src/.

Call the environment variables via this:

source /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/.env.example

Running the models

The first thing you should do is run the dbt run -t dev in your terminal.

It should run and generate some output. If there is an error, the process will terminate and display an error similar to this:

(nip-dbt-venv) sammigachuhi@gachuhi:~/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/dbt/src$ dbt run -t dev
06:11:52  Running with dbt=1.7.14
06:11:53  Registered adapter: databricks=1.7.14
06:11:54  Encountered an error:
Parsing Error
  Env var required but not provided: 'xprize_sens_pkg_sublayer_survey_schema_date'

This means that an environment variable by the name of 'xprize_sens_pkg_sublayer_survey_schema_date' is missing.

The environment variables files are found in the env_variables folder. Within this folder are two sub-folders, dev and stg.

If an Env var error appears such as in the above, the first solution would be to create a txt file in the following format schem_edit_date_<name_of_form>. The above error is related to the xprize_sens_pkg form abbreviation name. Therefore, our variable text file will be schem_edit_date_xprize_sens_pkg.txt. Inside it, insert the following:

# xprize sens pkg
xprize_sens_pkg_sublayer_survey_schema_date=""

So how do we know which date and time to insert into the schem_edit_... text file?

If you go to your Azure account, proceed to the SQL Editor tab.

SQL Editor

Under the ns-ii-tech-dev-catalog>bronze folder, there is the table_metadata table. This table contains all the metadata of all the ingested tables so far.

To find the schema version date for our xprize_sens_pkg_sublayer_survey use a SQL query like below:

SELECT * FROM `ns-ii-tech-dev-catalog`.bronze.table_metadata 
WHERE survey_abbr = 'xprize_sens_pkg';

This will bring up only the row for xprize_sens_pkg_sublayer_survey. Scroll until you see the schema_edit column. Copy the date value within this column, in our case it is 2024_06_05__08_06_10. This date value is what will be pasted into the schema text file like so:

xprize_sens_pkg_sublayer_survey_schema_date="2024_06_05__08_06_10"

If you investigate in the ns-ii-tech-dev-catalog, you will notice that this is the same date value appended to the end of the survey sublayer.

Schema date

You may need to iteratively follow the same procedure of extracting the schema date for every similar Env var... error faced.

If you run dbt run -t dev and there are no major errors, the first few lines of our output will look like this:

13:28:39  Running with dbt=1.7.14
13:28:40  Registered adapter: databricks=1.7.14
13:28:41  Found 198 models, 1 seed, 352 tests, 216 sources, 0 exposures, 0 metrics, 684 macros, 0 groups, 0 semantic models
13:28:41 

Thereafter dbt will start performing transformations as per your model definitions.

Running models

More successful runs will have a better looking output like below:

13:33:59  123 of 190 START sql view model bronze.gem_soil_respiration_sublayer_survey .... [RUN]
13:34:01  123 of 190 OK created sql view model bronze.gem_soil_respiration_sublayer_survey  [OK in 1.42s]
13:34:01  124 of 190 START sql view model bronze.gem_soil_respiration_subtable_subplot_1_details  [RUN]
13:34:02  124 of 190 OK created sql view model bronze.gem_soil_respiration_subtable_subplot_1_details  [OK in 1.31s]
-- snip --

The final message will be similar to this:

13:35:41  Done. PASS=39 WARN=0 ERROR=142 SKIP=9 TOTAL=190

Afterward, if you refresh the page of your catalog in Azure, you will see views generated for the successful SQL model queries.

Successful sql queries

One can also decide to run a single model, such as that for GEM Grass NPP. To do so, simply append the name of your model after the --select keyword. For example:

dbt run -t dev --select models/bronze/gem/grass_npp

Since its only a single model, the output will be quite fast, unlike in the former where everything was run.