Introduction

On January 2025, the data ingestion process was updated. This chapter is a summary of the steps one is to follow to ensure a smooth data ingestion process.

The `dev` workspace

The great expectations configuration files

Ensure that all the great expectations configuration files are in proper order and updated.

For example, for all configuration files that are version 2, ensure that their individual configuration are in the gx_development_surveys_v2, tasks_surveys_v2, and resources_archived/resources_surveys_v2 folders. For the configuration files in the gx_development_surveys_v2 folder, ensurre the version number reads as 2 or any other updated version number like so: version: 2.

For the configuration file of say gem_metrics which was updated to version 2, the contents should match as below. Note that a v2 was added to the jobs key name, task key and conf-file paths. Also note that the survey_version key under the tags key is also set as v2. The entire file looks as below.

resources:
  jobs:
    survey123_gem_metrics_v2:
      name: survey123_gem_metrics_v2
      email_notifications:
        on_failure:
          - databricks-ci

      tasks:
        - task_key: survey123_gem_metrics_v2_ingestion_landing
          job_cluster_key: job_cluster_task
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_ingestion_landing
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

        - task_key: survey123_gem_metrics_v2_landing_bronze
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: survey123_gem_metrics_v2_ingestion_landing
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_landing_bronze
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

      job_clusters:
        - job_cluster_key: job_cluster_task
          new_cluster: ${var.arcgis_cluster}

      tags:
        job_type: ingestion_&_validation
        survey123_group: gem
        survey123_subtype: metrics
        survey123_version: v2

We have used the above as an example to serve as a template guide when faced with updated versions of other forms.

The resources_merged folder contains the configuration files of the composite pipelines. This is where when two or more pipelines are combined into a single file to run together, say survey123_gem_transect, survey123_gem_cwfl, survey123_gem_fdvg etc. they are coalesced into a single file called survey123_gem or whatever other appropriate name. Taking the survey123_gem file as an example, ensure that the settings for individual forms, those in the resources_surveys_v2 or any other updated version folder correspond to those in the resources_merged.

For example, in the resources_merged/survey123_gem file, the following key values exactly match those in the resources_surveys_v2 folder.

- task_key: survey123_gem_metrics_v2_ingestion_landing
          job_cluster_key: job_cluster_task
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_ingestion_landing
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

        - task_key: survey123_gem_metrics_v2_landing_bronze
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: survey123_gem_metrics_v2_ingestion_landing
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_landing_bronze
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

The only caveat is to ensure that the geohash block also depends to the appropriate task key also found in resources_surveys_v2. In the below code for survey123_gem the job for geohash_xy_gem_metrics depends on survey123_gem_metrics_v2_landing_bronze job.

- task_key: geohash_xy_gem_metrics
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: survey123_gem_metrics_v2_landing_bronze
          timeout_seconds: 7200
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: geohash_xy
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
              dest_geohash_domain: geohash
              dest_geohash_folder: xy
              src_geojson_domain: geojson
              src_geojson_file: NS_LLBN_level7_grid.geojson
              survey_id: 20644bcb6ef94b67a8158f1a810bb547
              survey_abbr: gem_metrics
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

For the ingestion_bronze_... block, just ensure the required values correspond to the survey name. For example, below you can see that most of the runs in the ingestion_bronze block are set for gem_metrics survey.

- task_key: ingestion_bronze_gem_metrics
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: geohash_xy_gem_metrics
          timeout_seconds: 7200
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: ingestion_bronze
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
              container: geohash
              folder: xy
              subfolder: gem_metrics
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

Finally, the most up-to-date individiual pipeline configuration version for your form, in this case gem_metrics and found in the resources_surveys_v2 folder, copy this file and also place it in the resources_dev folder. The configuration file in the resources_dev folder should also match that found in resources_surveys_v2 folder or for any other updated version folder.

Local development

Environment setup

Here we present the different steps to follow to set your local development environment.

Clone the NIP-Lakehouse-Data repository and move in the dab directory in a terminal.

Run python3 -m venv nip-dab-venv to create a new virtual environment. A virtual environment is a tool that isolates dependencies for different projects by creating separate Python environments. This ensures that your projects remain distinct from each other, even if they use different package versions, thereby minimizing conflicts.
Run source nip-dab-venv/bin/activate to activate the virtual environment.
Run pip install poetry to install the poetry package.
Run poetry install --with dev to install the project’s packages. Ignore this if you’ve done in the past, but do it if your packages are in need of updates.
Install Azure-cli, run az login and follow the steps to log in to Azure cli.
Run databricks auth login --host <url_of_the_dev_databricks_workspace> and follow the requested actions to log in to databricks cli. For the ns-ii-tech-dev-workspace which is the starting point of our work, the code will be databricks auth login --host https://adb-5442438122618419.19.azuredatabricks.net/.
Duplicate the .env.examplefile and rename it as .env. This file contains the environment variables and some secrets used by the project. You have to ask the values of the different variables to use to your supervisor. When working across different workspaces, if working in the dev workspace, paste the dev environment configurations into the .env file. When working in stg environment, paste the stg environment variables in here. It would be good if you had environment variables of both workspaces in different .env files such as .env-dev and .env-stg. NB: It is not necessary to run this step since the databricks auth login does everything for us.
If you are working in the VS Code editor, dulicate the .vscode.example folder and rename it as .vscode.
Install the [VS Code Databricks extension] and connect you Databricks account. This extension will be used to sync your files in the databricks workspace.

Apply the great expectations changes

Click on the databricks extension, at Remote Folder, ensure it is syncing to this workspace – /Workspace/Users/<personal-organisation-email>/.bundle/nip_lakehouse/dev/files. The files will be synced to a .bundle folder.

The databricks extension

Sync the files to the dbfs folder. Run source sync.sh dev.
Run the gx_deploy_yml notebook in Databricks. The gx_deploy_yml notebook being referred to is the one already synced by the databricks extension. It is found in this path: /Workspace/Users/<organisation-email>/.bundle/nip_lakehouse/dev/files/development/gx_deploy_yml. Ensure at the last cells all bars are green. This means all the great expectations were applied!

To check if the new great expectations have been applied, in case significant changes had been made on the form, proceed to the great-expectations / json_files / expectations path under the nsiitechdevadlset > Containers section in Azure. For example, for gem_metrics you can see the json for this file was updated to version 2.

Gem metrics v2 json

In case you had made some gx changes, such as adding more options or relaxing some, you can click on the json, in this case gem_metrics_v2_sublayer_survey.json and a new interface will show up. The Edit menu shows the current expectations for this form.

GEM metrics v2 interface

Deploy the job of the survey, in this case survey123_gem_metrics_v2 into databricks. This is how you will do it. Go to the databricks.yml file. Under the include key, only ensure that the up-to-date pipeline configuration for an individual form is left uncommented. In this case it is our resources_surveys_v2/survey123_gem_metrics.yml file.

include:
  # - resources/*.yml
  - resources_surveys_v2/survey123_gem_metrics.yml
  # - resources_surveys_v3/*.yml
  # - resources_merged/*.yml

The rest should be commented out. Thereafter, run this code:

databricks bundle deploy --force.

Wait until it finishes. You will see a new pipeline formed called survey123_gem_metrics_v2 in databricks.

Gem metrics v2 pipeline

Once done, ensure you return the databricks.yml file back to its previous state, that is:

include:
  - resources/*.yml
  - resources_surveys_v2/*.yml
  - resources_surveys_v3/*.yml
  - resources_merged/*.yml

Run the pipeline. Use the below code to run your pipeline.

databricks bundle run survey123_gem_metrics_v2 -t dev

Alternatively, just click and run the pipeline in databricks.

If the pipeline succeeds, request a Pull Request (PR) in dev, and then if it is approved, run Dab: deploy with everything set to dev. The following should be your Dab: deploy parameters.

Use workflow from: Branch:dev

The branch to build: dev

The environment to deploy to: dev

Running Dab: deploy will ensure the new pipelines are persisted under the new creator name of ns-ii-tech-dev-databricks-ci.

You can now go to workflows and see the new pipeline is under Created by value of ns-ii-tech-dev-databricks-ci.

This means the pipeline is now going to be persisted permanently.

If the pipeline is now persisted you are good to replicate the same in stg workspace!

NB: If the above methods do not work ie. the dev pipeline failed, then request a PR to dev. When your branch is merged to dev, run the Build GX Conf file workflow with the everything set to dev, the name of the survey as gem_metrics version number as 2, and loop as no (if you select yes it loops over everything, not recommended). Then proceed with the below steps.

Run the Gx Static website workflow with everything set to dev.

Run the Dab: deploy workflow with everything set to dev.

Thereafter proceed with step 3 onwards. Step 2 is actually supposed to ensure the great expectations are applied without having to run the above workflows.

The `stg` workspace

First of all ensure that your pipeline run successfully in dev.

Request a PR of dev to stg. Once merged, proceed.
Run the gx build workflow for the different surveys for which you made some change on great expectations.

Build gx conf in stg

Then run the Build GX Conf file workflow. The purpose of this step is to persist the expectations used to validate data. When we validate our data against the expectations we have set, gx checks whether those expectations match those stored in our Azure directory within the nsiitechdevadlset container.

When we run this step, we ensure that the expectations in the nsiitechstgadlset container are updated with the latest changes.

Run the Dab: Deploy workflow action in stage.

Use workflow from: branch: stg
The branch to build: stg
The environment to deploy to: stg

This step is crucial in creating or updating the parameters of a particular job. Running this step ensures that the parameters of a particular job are updated in Databricks. When you create a new job, running Dab: deploy will ensure that the new job also reflects on databricks.

Run the Deploy GX Azure Static Web Apps workflow.

Use workflow from: Branch: stg
The branch to build: stg
The environment to deploy to: stg

You may have relaxed or added new expectations. Running this step ensures that when a pipeline is run, and irregardless of the success or failur of the pipeline, the status of the defined expectations (success or failure) will be shown on the gx website.

If the process succeeded in dev workspace, then the next time the automated respective composite scheduled pipeline run, they will also run successfully. But you can try manually running them in stg and see if they will succeed. You can try manually by continuing from Step 3 of the Applying Great Expectations changes section

The `prd` workspace

Once data ingestion has been successful in the ns-ii-tech-stg-db-workspace, it is time to perform the same process in the production workspace. The production workspace is nip-prd-db-workspace.

You will follow the below steps.

Step 1: Perform a Pull Request of `stg` into `main` branch

Once your pipeline(s) in staging (stg) ran successfully and you are satisfied with the results, it’s time to push the changes into main branch. The production (prd) environment uses the files in the main branch.

Request someone from within the Natural State to review your code before merging with main. If there are no problems and your merge has been performed, time to proceed to step 2.

Step 2: Update the great expectations variables in `main`

The updates of great expectations options and variables in main branch is performed by running the GX: Build gx conf files Github action. Do so by setting your github action parameters as below:

GX configurations for production

In other words, you are using:

the Branch: known as main.
You are building to branch main (again).
The environment being deploy to is prd.
loopall - this asks if you would like to update the changes in all surveys. Ideally this should be no since if select yes it will loop through all the numerous surveys and it will be really time-consuming! Better to go with one survey at a time and for this reason the best answer is: no.
survey - the name of the survey of which to update the great expectations configurations.
version - the version of the survey to update. This is easily retrieved by checking the name of the great expectations file of the survey you want to update. If it has a v2 in the name, such as cpp_small_tree_v2_great_expectations.yml, then this is a version 2. If it has a v3, such as lab_bulk_density_v3_great_expectations.yml, then the value 3 goes into the version field.

Press Run workflow to initiate the process of updating the gx configurations. You will have to do this for every form that was updated in stg.

To check whether your Gx configurations workflow is running, proceed to the Job runs tab in the Workflows menu in databricks. Ensure you are in the nip-prd-db-workspace!

Job runs in production

You Gx workflows should reflect as Succeeded in the Status column. You can check the name of the workflow from the Run parameters column. It not only provides the name of the survey but values to other parameters such as loopall and version.

Once the job run is done, confirm that the great expectations json of the respective tables have been updated in nipprdadlsetl in the path nipprdadlsetl > great-expectations > json_files > expectations > <table-name>.

Step 3: Deploy the pipeline

Once you are satisfied that the Gx Configurations for all the respective surveys and their appropriate versions have been updated, it’s time to update the pipeline with these changes.

Run the Dab: Deploy github action with the following parameters:

Deploy to production

This what this Github action does:

It uses the parameters in main branch to perform the workflow.
It builds on the main branch.
It deploys the updates to the prd environment.

Once you’ve finished with the above three steps, it’s time to run the pipeline in nip-prd-db-workspace. In this case, our pipeline of interest is the survey123_cpp pipeline since we updated the configurations for survey123_cpp_small_tree_v2_great_expectations.yml. It can be a different pipeline in your case but it is imperative to ensure the pipeline ran succesfully!

At the moment of writing the GX Static website for prd was yet to be released. However, if an error occurs in prd, run the GX Static website github action and check in the GX static website if there are some validations failing in prd.

NB: Sometimes starting with the GX: conf files action may not result in visible modifications on the great expectations json files inside nipprdadlsetl. In this case, interchange Step 2 with Step 3, so that Step 2 is the last.

Introduction

The dev workspace