Introduction
On January 2025, the data ingestion process was updated. This chapter is a summary of the steps one is to follow to ensure a smooth data ingestion process.
The dev
workspace
The great expectations configuration files
Ensure that all the great expectations configuration files are in proper order and updated.
For example, for all configuration files that are version 2, ensure that their individual configuration are in the gx_development_surveys_v2
, tasks_surveys_v2
, and resources_archived/resources_surveys_v2
folders. For the configuration files in the gx_development_surveys_v2
folder, ensurre the version number reads as 2
or any other updated version number like so: version: 2
.
For the configuration file of say gem_metrics
which was updated to version 2, the contents should match as below. Note that a v2
was added to the jobs key name, task key and conf-file
paths. Also note that the survey_version
key under the tags
key is also set as v2
. The entire file looks as below.
resources:
jobs:
survey123_gem_metrics_v2:
name: survey123_gem_metrics_v2
email_notifications:
on_failure:
- databricks-ci
tasks:
- task_key: survey123_gem_metrics_v2_ingestion_landing
job_cluster_key: job_cluster_task
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_ingestion_landing
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
- task_key: survey123_gem_metrics_v2_landing_bronze
job_cluster_key: job_cluster_task
depends_on:
- task_key: survey123_gem_metrics_v2_ingestion_landing
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_landing_bronze
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
job_clusters:
- job_cluster_key: job_cluster_task
new_cluster: ${var.arcgis_cluster}
tags:
job_type: ingestion_&_validation
survey123_group: gem
survey123_subtype: metrics
survey123_version: v2
We have used the above as an example to serve as a template guide when faced with updated versions of other forms.
The resources_merged
folder contains the configuration files of the composite pipelines. This is where when two or more pipelines are combined into a single file to run together, say survey123_gem_transect, survey123_gem_cwfl, survey123_gem_fdvg etc. they are coalesced into a single file called survey123_gem
or whatever other appropriate name. Taking the survey123_gem
file as an example, ensure that the settings for individual forms, those in the resources_surveys_v2
or any other updated version folder correspond to those in the resources_merged
.
For example, in the resources_merged/survey123_gem
file, the following key values exactly match those in the resources_surveys_v2
folder.
- task_key: survey123_gem_metrics_v2_ingestion_landing
job_cluster_key: job_cluster_task
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_ingestion_landing
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
- task_key: survey123_gem_metrics_v2_landing_bronze
job_cluster_key: job_cluster_task
depends_on:
- task_key: survey123_gem_metrics_v2_ingestion_landing
python_wheel_task:
package_name: nip_lakehouse
entry_point: survey123_landing_bronze
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/gem_metrics_pipeline_config.yml
gx-file-path: /dbfs/User/${workspace.current_user.userName}/gx/great_expectations.yml
data-domain: survey123
survey-id: 20644bcb6ef94b67a8158f1a810bb547
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
The only caveat is to ensure that the geohash block also depends to the appropriate task key also found in resources_surveys_v2
. In the below code for survey123_gem
the job for geohash_xy_gem_metrics
depends on survey123_gem_metrics_v2_landing_bronze
job.
- task_key: geohash_xy_gem_metrics
job_cluster_key: job_cluster_task
depends_on:
- task_key: survey123_gem_metrics_v2_landing_bronze
timeout_seconds: 7200
python_wheel_task:
package_name: nip_lakehouse
entry_point: geohash_xy
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
dest_geohash_domain: geohash
dest_geohash_folder: xy
src_geojson_domain: geojson
src_geojson_file: NS_LLBN_level7_grid.geojson
survey_id: 20644bcb6ef94b67a8158f1a810bb547
survey_abbr: gem_metrics
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
For the ingestion_bronze_...
block, just ensure the required values correspond to the survey name. For example, below you can see that most of the runs in the ingestion_bronze
block are set for gem_metrics
survey.
- task_key: ingestion_bronze_gem_metrics
job_cluster_key: job_cluster_task
depends_on:
- task_key: geohash_xy_gem_metrics
timeout_seconds: 7200
python_wheel_task:
package_name: nip_lakehouse
entry_point: ingestion_bronze
named_parameters:
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks/geohash_xy_pipeline_config.yml
container: geohash
folder: xy
subfolder: gem_metrics
libraries:
- whl: /Volumes/${var.resource_prefix}-${bundle.target}-catalog/etl/pipelines/py_packages/nip_lakehouse/nip_lakehouse-0.1.0-py3-none-any.whl
Finally, the most up-to-date individiual pipeline configuration version for your form, in this case gem_metrics
and found in the resources_surveys_v2
folder, copy this file and also place it in the resources_dev
folder. The configuration file in the resources_dev
folder should also match that found in resources_surveys_v2
folder or for any other updated version folder.
Local development
Environment setup
Here we present the different steps to follow to set your local development environment.
Clone the NIP-Lakehouse-Data repository and move in the dab
directory in a terminal.
-
Run
python3 -m venv nip-dab-venv
to create a new virtual environment. A virtual environment is a tool that isolates dependencies for different projects by creating separate Python environments. This ensures that your projects remain distinct from each other, even if they use different package versions, thereby minimizing conflicts. -
Run
source nip-dab-venv/bin/activate
to activate the virtual environment. -
Run
pip install poetry
to install the poetry package. -
Run
poetry install --with dev
to install the project’s packages. Ignore this if you’ve done in the past, but do it if your packages are in need of updates. -
Install Azure-cli, run
az login
and follow the steps to log in to Azure cli. -
Run
databricks auth login --host <url_of_the_dev_databricks_workspace>
and follow the requested actions to log in to databricks cli. For thens-ii-tech-dev-workspace
which is the starting point of our work, the code will bedatabricks auth login --host https://adb-5442438122618419.19.azuredatabricks.net/
. -
Duplicate the
.env.example
file and rename it as.env
. This file contains the environment variables and some secrets used by the project. You have to ask the values of the different variables to use to your supervisor. When working across different workspaces, if working in thedev
workspace, paste thedev
environment configurations into the.env
file. When working instg
environment, paste thestg
environment variables in here. It would be good if you had environment variables of both workspaces in different.env
files such as.env-dev
and.env-stg
. NB: It is not necessary to run this step since the databricks auth login does everything for us. -
If you are working in the VS Code editor, dulicate the
.vscode.example
folder and rename it as.vscode
. -
Install the [VS Code Databricks extension] and connect you Databricks account. This extension will be used to sync your files in the databricks workspace.
Apply the great expectations changes
- Click on the databricks extension, at Remote Folder, ensure it is syncing to this workspace –
/Workspace/Users/<personal-organisation-email>/.bundle/nip_lakehouse/dev/files
. The files will be synced to a.bundle
folder.
-
Sync the files to the
dbfs
folder. Runsource sync.sh dev
. -
Run the
gx_deploy_yml
notebook in Databricks. Thegx_deploy_yml
notebook being referred to is the one already synced by the databricks extension. It is found in this path:/Workspace/Users/<organisation-email>/.bundle/nip_lakehouse/dev/files/development/gx_deploy_yml
. Ensure at the last cells all bars are green. This means all the great expectations were applied!
To check if the new great expectations have been applied, in case significant changes had been made on the form, proceed to the great-expectations / json_files / expectations
path under the nsiitechdevadlset > Containers
section in Azure. For example, for gem_metrics
you can see the json for this file was updated to version 2.
In case you had made some gx changes, such as adding more options or relaxing some, you can click on the json, in this case gem_metrics_v2_sublayer_survey.json
and a new interface will show up. The Edit menu shows the current expectations for this form.
- Deploy the job of the survey, in this case
survey123_gem_metrics_v2
into databricks. This is how you will do it. Go to thedatabricks.yml
file. Under the include key, only ensure that the up-to-date pipeline configuration for an individual form is left uncommented. In this case it is ourresources_surveys_v2/survey123_gem_metrics.yml
file.
include:
# - resources/*.yml
- resources_surveys_v2/survey123_gem_metrics.yml
# - resources_surveys_v3/*.yml
# - resources_merged/*.yml
The rest should be commented out. Thereafter, run this code:
databricks bundle deploy --force
.
Wait until it finishes. You will see a new pipeline formed called survey123_gem_metrics_v2
in databricks.
Once done, ensure you return the databricks.yml
file back to its previous state, that is:
include:
- resources/*.yml
- resources_surveys_v2/*.yml
- resources_surveys_v3/*.yml
- resources_merged/*.yml
- Run the pipeline. Use the below code to run your pipeline.
databricks bundle run survey123_gem_metrics_v2 -t dev
Alternatively, just click and run the pipeline in databricks.
If the pipeline succeeds, request a Pull Request (PR) in dev, and then if it is approved, run Dab: deploy
with everything set to dev. The following should be your Dab: deploy
parameters.
Use workflow from: Branch:dev
The branch to build: dev
The environment to deploy to: dev
Running Dab: deploy
will ensure the new pipelines are persisted under the new creator name of ns-ii-tech-dev-databricks-ci
.
You can now go to workflows and see the new pipeline is under Created by value of ns-ii-tech-dev-databricks-ci
.
This means the pipeline is now going to be persisted permanently.
If the pipeline is now persisted you are good to replicate the same in stg
workspace!
NB: If the above methods do not work ie. the dev
pipeline failed, then request a PR to dev. When your branch is merged to dev
, run the Build GX Conf file
workflow with the everything set to dev, the name of the survey as gem_metrics
version number as 2
, and loop as no
(if you select yes it loops over everything, not recommended). Then proceed with the below steps.
Run the Gx Static website
workflow with everything set to dev
.
Run the Dab: deploy
workflow with everything set to dev
.
Thereafter proceed with step 3 onwards. Step 2 is actually supposed to ensure the great expectations are applied without having to run the above workflows.
The stg
workspace
First of all ensure that your pipeline run successfully in dev
.
-
Request a PR of
dev
tostg
. Once merged, proceed. -
Run the gx build workflow for the different surveys for which you made some change on great expectations.
Then run the Build GX Conf file
workflow. The purpose of this step is to persist the expectations used to validate data. When we validate our data against the expectations we have set, gx checks whether those expectations match those stored in our Azure directory within the nsiitechdevadlset container.
When we run this step, we ensure that the expectations in the nsiitechstgadlset
container are updated with the latest changes.
- Run the Dab: Deploy workflow action in stage.
- Use workflow from: branch: stg
- The branch to build: stg
- The environment to deploy to: stg
This step is crucial in creating or updating the parameters of a particular job. Running this step ensures that the parameters of a particular job are updated in Databricks. When you create a new job, running Dab: deploy
will ensure that the new job also reflects on databricks.
- Run the Deploy GX Azure Static Web Apps workflow.
- Use workflow from: Branch: stg
- The branch to build: stg
- The environment to deploy to: stg
You may have relaxed or added new expectations. Running this step ensures that when a pipeline is run, and irregardless of the success or failur of the pipeline, the status of the defined expectations (success or failure) will be shown on the gx website.
If the process succeeded in dev
workspace, then the next time the automated respective composite scheduled pipeline run, they will also run successfully. But you can try manually running them in stg and see if they will succeed. You can try manually by continuing from Step 3 of the Applying Great Expectations changes section
The prd
workspace
Once data ingestion has been successful in the ns-ii-tech-stg-db-workspace
, it is time to perform the same process in the production workspace. The production workspace is nip-prd-db-workspace
.
You will follow the below steps.
Step 1: Perform a Pull Request of stg
into main
branch
Once your pipeline(s) in staging (stg
) ran successfully and you are satisfied with the results, it’s time to push the changes into main
branch. The production (prd
) environment uses the files in the main
branch.
Request someone from within the Natural State to review your code before merging with main
. If there are no problems and your merge has been performed, time to proceed to step 2.
Step 2: Update the great expectations variables in main
The updates of great expectations options and variables in main
branch is performed by running the GX: Build gx conf files
Github action. Do so by setting your github action parameters as below:
In other words, you are using:
-
the
Branch
: known asmain
. -
You are building to branch
main
(again). -
The environment being deploy to is
prd
. -
loopall
- this asks if you would like to update the changes in all surveys. Ideally this should beno
since if selectyes
it will loop through all the numerous surveys and it will be really time-consuming! Better to go with one survey at a time and for this reason the best answer is:no
. -
survey
- the name of the survey of which to update the great expectations configurations. -
version
- the version of the survey to update. This is easily retrieved by checking the name of the great expectations file of the survey you want to update. If it has av2
in the name, such ascpp_small_tree_v2_great_expectations.yml
, then this is a version2
. If it has a v3, such aslab_bulk_density_v3_great_expectations.yml
, then the value3
goes into theversion
field.
Press Run workflow
to initiate the process of updating the gx configurations. You will have to do this for every form that was updated in stg
.
To check whether your Gx configurations workflow is running, proceed to the Job runs
tab in the Workflows
menu in databricks. Ensure you are in the nip-prd-db-workspace
!
You Gx workflows should reflect as Succeeded
in the Status column. You can check the name of the workflow from the Run parameters column. It not only provides the name of the survey but values to other parameters such as loopall
and version
.
Step 3: Deploy the pipeline
Once you are satisfied that the Gx Configurations for all the respective surveys and their appropriate versions have been updated, it’s time to update the pipeline with these changes.
Run the Dab: Deploy
github action with the following parameters:
This what this Github action does:
-
It uses the parameters in
main
branch to perform the workflow. -
It builds on the
main
branch. -
It deploys the updates to the
prd
environment.
Once you’ve finished with the above three steps, it’s time to run the pipeline in nip-prd-db-workspace
. In this case, our pipeline of interest is the survey123_cpp
pipeline since we updated the configurations for survey123_cpp_small_tree_v2_great_expectations.yml
. It can be a different pipeline in your case but it is imperative to ensure the pipeline ran succesfully!
At the moment of writing the GX Static website for prd
was yet to be released. However, if an error occurs in prd
, run the GX Static website github action and check in the GX static website if there are some validations failing in prd.