Introduction

On January 2025, the data ingestion process was updated. This chapter is a summary of the steps one is to follow to ensure a smooth data ingestion process.

The dev workspace

The great expectations configuration files

Ensure that all the great expectations configuration files are in proper order and updated.

For example, for all configuration files that are version 2, ensure that their individual configuration are in the gx_development_surveys_v2, tasks_surveys_v2, and resources_archived/resources_surveys_v2 folders. For the configuration files in the gx_development_surveys_v2 folder, ensurre the version number reads as 2 or any other updated version number like so: version: 2.

For the configuration file of say gem_metrics which was updated to version 2, the contents should match as below. Note that a v2 was added to the jobs key name, task key and conf-file paths. Also note that the survey_version key under the tags key is also set as v2. The entire file looks as below.

resources:
  jobs:
    survey123_gem_metrics_v2:
      name: survey123_gem_metrics_v2
      email_notifications:
        on_failure:
          - databricks-ci

      tasks:
        - task_key: survey123_gem_metrics_v2_ingestion_landing
          job_cluster_key: job_cluster_task
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_ingestion_landing
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

        - task_key: survey123_gem_metrics_v2_landing_bronze
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: survey123_gem_metrics_v2_ingestion_landing
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_landing_bronze
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

      job_clusters:
        - job_cluster_key: job_cluster_task
          new_cluster: ${var.arcgis_cluster}

      tags:
        job_type: ingestion_&_validation
        survey123_group: gem
        survey123_subtype: metrics
        survey123_version: v2

We have used the above as an example to serve as a template guide when faced with updated versions of other forms.

The resources_merged folder contains the configuration files of the composite pipelines. This is where when two or more pipelines are combined into a single file to run together, say survey123_gem_transect, survey123_gem_cwfl, survey123_gem_fdvg etc. they are coalesced into a single file called survey123_gem or whatever other appropriate name. Taking the survey123_gem file as an example, ensure that the settings for individual forms, those in the resources_surveys_v2 or any other updated version folder correspond to those in the resources_merged.

For example, in the resources_merged/survey123_gem file, the following key values exactly match those in the resources_surveys_v2 folder.

- task_key: survey123_gem_metrics_v2_ingestion_landing
          job_cluster_key: job_cluster_task
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_ingestion_landing
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

        - task_key: survey123_gem_metrics_v2_landing_bronze
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: survey123_gem_metrics_v2_ingestion_landing
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: survey123_landing_bronze
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
              gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
              data-domain: survey123
              survey-id: 20644bcb6ef94b67a8158f1a810bb547
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

The only caveat is to ensure that the geohash block also depends to the appropriate task key also found in resources_surveys_v2. In the below code for survey123_gem the job for geohash_xy_gem_metrics depends on survey123_gem_metrics_v2_landing_bronze job.

- task_key: geohash_xy_gem_metrics
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: survey123_gem_metrics_v2_landing_bronze
          timeout_seconds: 7200
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: geohash_xy
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
              dest_geohash_domain: geohash
              dest_geohash_folder: xy
              src_geojson_domain: geojson
              src_geojson_file: NS_LLBN_level7_grid.geojson
              survey_id: 20644bcb6ef94b67a8158f1a810bb547
              survey_abbr: gem_metrics
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl


For the ingestion_bronze_... block, just ensure the required values correspond to the survey name. For example, below you can see that most of the runs in the ingestion_bronze block are set for gem_metrics survey.

- task_key: ingestion_bronze_gem_metrics
          job_cluster_key: job_cluster_task
          depends_on:
            - task_key: geohash_xy_gem_metrics
          timeout_seconds: 7200
          python_wheel_task:
            package_name: nip_lakehouse
            entry_point: ingestion_bronze
            named_parameters:
              conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
              container: geohash
              folder: xy
              subfolder: gem_metrics
          libraries:
            - whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl

Finally, the most up-to-date individiual pipeline configuration version for your form, in this case gem_metrics and found in the resources_surveys_v2 folder, copy this file and also place it in the resources_dev folder. The configuration file in the resources_dev folder should also match that found in resources_surveys_v2 folder or for any other updated version folder.

Local development

Environment setup

Here we present the different steps to follow to set your local development environment.

Clone the NIP-Lakehouse-Data repository and move in the dab directory in a terminal.

  1. Run python3 -m venv nip-dab-venv to create a new virtual environment. A virtual environment is a tool that isolates dependencies for different projects by creating separate Python environments. This ensures that your projects remain distinct from each other, even if they use different package versions, thereby minimizing conflicts.

  2. Run source nip-dab-venv/bin/activate to activate the virtual environment.

  3. Run pip install poetry to install the poetry package.

  4. Run poetry install --with dev to install the project’s packages. Ignore this if you’ve done in the past, but do it if your packages are in need of updates.

  5. Install Azure-cli, run az login and follow the steps to log in to Azure cli.

  6. Run databricks auth login --host <url_of_the_dev_databricks_workspace> and follow the requested actions to log in to databricks cli. For the ns-ii-tech-dev-workspace which is the starting point of our work, the code will be databricks auth login --host https://adb-5442438122618419.19.azuredatabricks.net/.

  7. Duplicate the .env.examplefile and rename it as .env. This file contains the environment variables and some secrets used by the project. You have to ask the values of the different variables to use to your supervisor. When working across different workspaces, if working in the dev workspace, paste the dev environment configurations into the .env file. When working in stg environment, paste the stg environment variables in here. It would be good if you had environment variables of both workspaces in different .env files such as .env-dev and .env-stg. NB: It is not necessary to run this step since the databricks auth login does everything for us.

  8. If you are working in the VS Code editor, dulicate the .vscode.example folder and rename it as .vscode.

  9. Install the [VS Code Databricks extension] and connect you Databricks account. This extension will be used to sync your files in the databricks workspace.

Apply the great expectations changes

  1. Click on the databricks extension, at Remote Folder, ensure it is syncing to this workspace – /Workspace/Users/<personal-organisation-email>/.bundle/nip_lakehouse/dev/files. The files will be synced to a .bundle folder.

The databricks extension

  1. Sync the files to the dbfs folder. Run source sync.sh dev.

  2. Run the gx_deploy_yml notebook in Databricks. The gx_deploy_yml notebook being referred to is the one already synced by the databricks extension. It is found in this path: /Workspace/Users/<organisation-email>/.bundle/nip_lakehouse/dev/files/development/gx_deploy_yml. Ensure at the last cells all bars are green. This means all the great expectations were applied!

To check if the new great expectations have been applied, in case significant changes had been made on the form, proceed to the great-expectations / json_files / expectations path under the nsiitechdevadlset > Containers section in Azure. For example, for gem_metrics you can see the json for this file was updated to version 2.

Gem metrics v2 json

In case you had made some gx changes, such as adding more options or relaxing some, you can click on the json, in this case gem_metrics_v2_sublayer_survey.json and a new interface will show up. The Edit menu shows the current expectations for this form.

GEM metrics v2 interface

  1. Deploy the job of the survey, in this case survey123_gem_metrics_v2 into databricks. This is how you will do it. Go to the databricks.yml file. Under the include key, only ensure that the up-to-date pipeline configuration for an individual form is left uncommented. In this case it is our resources_surveys_v2/survey123_gem_metrics.yml file.
include:
  # - resources/*.yml
  - resources_surveys_v2/survey123_gem_metrics.yml
  # - resources_surveys_v3/*.yml
  # - resources_merged/*.yml

The rest should be commented out. Thereafter, run this code:

databricks bundle deploy --force.

Wait until it finishes. You will see a new pipeline formed called survey123_gem_metrics_v2 in databricks.

Gem metrics v2 pipeline

Once done, ensure you return the databricks.yml file back to its previous state, that is:

include:
  - resources/*.yml
  - resources_surveys_v2/*.yml
  - resources_surveys_v3/*.yml
  - resources_merged/*.yml

  1. Run the pipeline. Use the below code to run your pipeline.

databricks bundle run survey123_gem_metrics_v2 -t dev

Alternatively, just click and run the pipeline in databricks.

If the pipeline succeeds you are good to replicate the same in stg workspace!

NB: If the above methods do not work ie. the dev pipeline failed, then request a PR to dev. When your branch is merged to dev, run the Build GX Conf file workflow with the everything set to dev, the name of the survey as gem_metrics version number as 2, and loop as no (if you select yes it loops over everything, not recommended). Then proceed with the below steps.

Run the Gx Static website workflow with everything set to dev.

Run the Dab: deploy workflow with everything set to dev.

Thereafter proceed with step 3 onwards. Step 2 is actually supposed to ensure the great expectations are applied without having to run the above workflows.

The stg workspace

First of all ensure that your pipeline run successfully in dev.

  1. Request a PR of dev to stg. Once merged, proceed.

  2. Run the gx build workflow for the different surveys for which you made some change on great expectations.

Build gx conf in stg

The run the Build GX Conf file workflow.

  1. Run the Deploy GX Azure Static Web Apps workflow.
  • Use workflow from: Branch: stg
  • The branch to build: stg
  • The environment to deploy to: stg
  1. Run the Dab: Deploy workflow action in stage.
  • Use workflow from: branch: stg
  • The branch to build: stg
  • The environment to deploy to: stg

If the process succeeded in dev workspace, then the next time the automated respective composite scheduled pipeline run, they will also run successfully. But you can try manually running them in stg and see if they will succeed. You can try manually by continuing from Step 3 of the Applying Great Expectations changes section