Introduction
On January 2025, the data ingestion process was updated. This chapter is a summary of the steps one is to follow to ensure a smooth data ingestion process.
The dev
workspace
The great expectations configuration files
Ensure that all the great expectations configuration files are in proper order and updated.
For example, for all configuration files that are version 2, ensure that their individual configuration are in the gx_development_surveys_v2
, tasks_surveys_v2
, and resources_archived/resources_surveys_v2
folders. For the configuration files in the gx_development_surveys_v2
folder, ensurre the version number reads as 2
or any other updated version number like so: version: 2
.
For the configuration file of say gem_metrics
which was updated to version 2, the contents should match as below. Note that a v2
was added to the jobs key name, task key and conf-file
paths. Also note that the survey_version
key under the tags
key is also set as v2
. The entire file looks as below.
resources:
jobs:
survey123_gem_metrics_v2:
name: survey123_gem_metrics_v2
email_notifications:
on_failure:
- databricks-ci
tasks:
- task_key: survey123_gem_metrics_v2_ingestion_landing
job_cluster_key: job_cluster_task
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_ingestion_landing
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
- task_key: survey123_gem_metrics_v2_landing_bronze
job_cluster_key: job_cluster_task
depends_on:
- task_key: survey123_gem_metrics_v2_ingestion_landing
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_landing_bronze
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
job_clusters:
- job_cluster_key: job_cluster_task
new_cluster: ${var.arcgis_cluster}
tags:
job_type: ingestion_&_validation
survey123_group: gem
survey123_subtype: metrics
survey123_version: v2
We have used the above as an example to serve as a template guide when faced with updated versions of other forms.
The resources_merged
folder contains the configuration files of the composite pipelines. This is where when two or more pipelines are combined into a single file to run together, say survey123_gem_transect, survey123_gem_cwfl, survey123_gem_fdvg etc. they are coalesced into a single file called survey123_gem
or whatever other appropriate name. Taking the survey123_gem
file as an example, ensure that the settings for individual forms, those in the resources_surveys_v2
or any other updated version folder correspond to those in the resources_merged
.
For example, in the resources_merged/survey123_gem
file, the following key values exactly match those in the resources_surveys_v2
folder.
- task_key: survey123_gem_metrics_v2_ingestion_landing
job_cluster_key: job_cluster_task
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_ingestion_landing
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
- task_key: survey123_gem_metrics_v2_landing_bronze
job_cluster_key: job_cluster_task
depends_on:
- task_key: survey123_gem_metrics_v2_ingestion_landing
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_landing_bronze
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
The only caveat is to ensure that the geohash block also depends to the appropriate task key also found in resources_surveys_v2
. In the below code for survey123_gem
the job for geohash_xy_gem_metrics
depends on survey123_gem_metrics_v2_landing_bronze
job.
- task_key: geohash_xy_gem_metrics
job_cluster_key: job_cluster_task
depends_on:
- task_key: survey123_gem_metrics_v2_landing_bronze
timeout_seconds: 7200
python_wheel_task:
package_name: nip_lakehouse
entry_point: geohash_xy
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
dest_geohash_domain: geohash
dest_geohash_folder: xy
src_geojson_domain: geojson
src_geojson_file: NS_LLBN_level7_grid.geojson
survey_id: 20644bcb6ef94b67a8158f1a810bb547
survey_abbr: gem_metrics
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
For the ingestion_bronze_...
block, just ensure the required values correspond to the survey name. For example, below you can see that most of the runs in the ingestion_bronze
block are set for gem_metrics
survey.
- task_key: ingestion_bronze_gem_metrics
job_cluster_key: job_cluster_task
depends_on:
- task_key: geohash_xy_gem_metrics
timeout_seconds: 7200
python_wheel_task:
package_name: nip_lakehouse
entry_point: ingestion_bronze
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
container: geohash
folder: xy
subfolder: gem_metrics
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
Finally, the most up-to-date individiual pipeline configuration version for your form, in this case gem_metrics
and found in the resources_surveys_v2
folder, copy this file and also place it in the resources_dev
folder. The configuration file in the resources_dev
folder should also match that found in resources_surveys_v2
folder or for any other updated version folder.
Local development
Environment setup
Here we present the different steps to follow to set your local development environment.
Clone the NIP-Lakehouse-Data repository and move in the dab
directory in a terminal.
-
Run
python3 -m venv nip-dab-venv
to create a new virtual environment. A virtual environment is a tool that isolates dependencies for different projects by creating separate Python environments. This ensures that your projects remain distinct from each other, even if they use different package versions, thereby minimizing conflicts. -
Run
source nip-dab-venv/bin/activate
to activate the virtual environment. -
Run
pip install poetry
to install the poetry package. -
Run
poetry install --with dev
to install the project’s packages. Ignore this if you’ve done in the past, but do it if your packages are in need of updates. -
Install Azure-cli, run
az login
and follow the steps to log in to Azure cli. -
Run
databricks auth login --host <url_of_the_dev_databricks_workspace>
and follow the requested actions to log in to databricks cli. For thens-ii-tech-dev-workspace
which is the starting point of our work, the code will bedatabricks auth login --host https://adb-5442438122618419.19.azuredatabricks.net/
. -
Duplicate the
.env.example
file and rename it as.env
. This file contains the environment variables and some secrets used by the project. You have to ask the values of the different variables to use to your supervisor. When working across different workspaces, if working in thedev
workspace, paste thedev
environment configurations into the.env
file. When working instg
environment, paste thestg
environment variables in here. It would be good if you had environment variables of both workspaces in different.env
files such as.env-dev
and.env-stg
. NB: It is not necessary to run this step since the databricks auth login does everything for us. -
If you are working in the VS Code editor, dulicate the
.vscode.example
folder and rename it as.vscode
. -
Install the [VS Code Databricks extension] and connect you Databricks account. This extension will be used to sync your files in the databricks workspace.
Apply the great expectations changes
- Click on the databricks extension, at Remote Folder, ensure it is syncing to this workspace –
/Workspace/Users/<personal-organisation-email>/.bundle/nip_lakehouse/dev/files
. The files will be synced to a.bundle
folder.
-
Sync the files to the
dbfs
folder. Runsource sync.sh dev
. -
Run the
gx_deploy_yml
notebook in Databricks. Thegx_deploy_yml
notebook being referred to is the one already synced by the databricks extension. It is found in this path:/Workspace/Users/<organisation-email>/.bundle/nip_lakehouse/dev/files/development/gx_deploy_yml
. Ensure at the last cells all bars are green. This means all the great expectations were applied!
To check if the new great expectations have been applied, in case significant changes had been made on the form, proceed to the great-expectations / json_files / expectations
path under the nsiitechdevadlset > Containers
section in Azure. For example, for gem_metrics
you can see the json for this file was updated to version 2.
In case you had made some gx changes, such as adding more options or relaxing some, you can click on the json, in this case gem_metrics_v2_sublayer_survey.json
and a new interface will show up. The Edit menu shows the current expectations for this form.
- Deploy the job of the survey, in this case
survey123_gem_metrics_v2
into databricks. This is how you will do it. Go to thedatabricks.yml
file. Under the include key, only ensure that the up-to-date pipeline configuration for an individual form is left uncommented. In this case it is ourresources_surveys_v2/survey123_gem_metrics.yml
file.
include:
# - resources/*.yml
- resources_surveys_v2/survey123_gem_metrics.yml
# - resources_surveys_v3/*.yml
# - resources_merged/*.yml
The rest should be commented out. Thereafter, run this code:
databricks bundle deploy --force
.
Wait until it finishes. You will see a new pipeline formed called survey123_gem_metrics_v2
in databricks.
Once done, ensure you return the databricks.yml
file back to its previous state, that is:
include:
- resources/*.yml
- resources_surveys_v2/*.yml
- resources_surveys_v3/*.yml
- resources_merged/*.yml
- Run the pipeline. Use the below code to run your pipeline.
databricks bundle run survey123_gem_metrics_v2 -t dev
Alternatively, just click and run the pipeline in databricks.
If the pipeline succeeds you are good to replicate the same in stg
workspace!
NB: If the above methods do not work ie. the dev
pipeline failed, then request a PR to dev. When your branch is merged to dev
, run the Build GX Conf file
workflow with the everything set to dev, the name of the survey as gem_metrics
version number as 2
, and loop as no
(if you select yes it loops over everything, not recommended). Then proceed with the below steps.
Run the Gx Static website
workflow with everything set to dev
.
Run the Dab: deploy
workflow with everything set to dev
.
Thereafter proceed with step 3 onwards. Step 2 is actually supposed to ensure the great expectations are applied without having to run the above workflows.
The stg
workspace
First of all ensure that your pipeline run successfully in dev
.
-
Request a PR of
dev
tostg
. Once merged, proceed. -
Run the gx build workflow for the different surveys for which you made some change on great expectations.
The run the Build GX Conf file
workflow.
- Run the Deploy GX Azure Static Web Apps workflow.
- Use workflow from: Branch: stg
- The branch to build: stg
- The environment to deploy to: stg
- Run the Dab: Deploy workflow action in stage.
- Use workflow from: branch: stg
- The branch to build: stg
- The environment to deploy to: stg
If the process succeeded in dev
workspace, then the next time the automated respective composite scheduled pipeline run, they will also run successfully. But you can try manually running them in stg and see if they will succeed. You can try manually by continuing from Step 3 of the Applying Great Expectations changes section