Data ingestion in Staging

Data ingestion in the staging environment (stg) is similar to that in the development (dev) environment.

The process begins by first of all ensuring all your changes in dev are pushed to the stg environment. This is because the stg environment should be the same as that of dev. There should be no difference as having any change in stg not in dev will cause several back-and-forth procedures thus leading to time wastage.

Prerequisites

Your work, inclusive of all the gx validation YAML files should be inside the dev branch. Inside Github, create a pull request to merge the dev branch into stg branch. Thereafter, when the merge is completed, create a new branch that will pull from the stg branch.

For example, in this case, we create a new branch called NIP-1439 that will contain all the up-to-date changes from stg.

Ensure the following are all set.

i. You virtual environment is already running - to create a virtual environment, run python3 -m venv nip-dab-venv. Thereafter, activate the virtual environment using source nip-dab-venv/bin/activate.

ii. An environment variables file.

To create a virtual environment, run python3 -m venv nip-dab-venv. Thereafter, activate the virtual environment using source nip-dab-venv/bin/activate. A virtual environment is a tool that helps to keep dependencies required by different projects separate by creating isolated Python spaces for them. This ensures that your projects are kept distinct from each other even if they use different package versions, thus minimizing conflicts.

The environment variables are found in the ~NIP-Lakehouse-Data/vars/.env.example path. These will be provided by your supervisor. Run the environment variables file using the below code:

source /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/.env.example_stg

The env.example_stg is the environment variables file for the staging environment.

Thereafter run source sync.sh stg.

Sync to staging

The sync.sh file contains the code that will integrate your files into the dev environment in Github. Runnning this will ensure that all the changes you have made in your files will land in the Databricks Browser File System (DBFS). As of November 2024, we dropped using the Databricks extension to sync with remote environments, either for dev or stg in this case.

Github

Once your remote branch has been merged to stg, go to the Github Actions menu. GitHub Actions is an automation tool that allows developers to build, test, and deploy their code directly from GitHub.

Under Github Actions, select Dab: Deploy as shown in the figure below.

Running Github Actions in staging

Click the down arrow for Run Workflow and ensure that: i. Branch is set to stg or whichever branch you were merging into.

ii. The branch to build is stg.

iii. The environment to deploy to is stg.

Thereafter select Run Workflow.

A green tick next to the action indicates it was successful.

Databricks

As earlier mentioned, the syncing process no longer requires the use of Databricks Extension. Therefore, one can run gx_deploy_yml file from their Github branch rather than from the .ide folder in Azure Databricks.

After starting your personal compute, go to workspace, and assuming your github account is properly connected to Azure Databricks, your repo should appear either under the *Workspace>Repos>* or *Workspace>Users>>*.

Repos in staging

To go to the branch which you pushed to Github and which contains the up-to-date stg branch contents, click on the button with a branch name as shown below.

Repos in staging

If you click on the button with a branch name, a new interface will appear. There are two options here, either you pull your branch, which should be the first thing you do. The other option is to push but this should only happen if you have done some changes in databricks. You can push to the same branch or a different branch. One can also create a new branch but it is hardly recommended when doing data ingestion operations.

Repos in staging

Once you’ve pulled the latest changes from your branch, go to this path in your workspace: /Workspace/Users/sgachuhi@naturalstate.org/NIP-Lakehouse-Data/dab/development/gx_deploy_yml.

NB This is not the path inside your .ide folder in Databricks. Rather, it is the path to the gx_deploy_yml file in your branch NIP-1439.

Path to the gx_deploy_yml file in stg

This is the flle that will load the great expectations yaml files that you created. The last cell contains the paths to your great expectations .yaml files.

Just a few things to consider:

i. Ensure that user_name variable corresponds with your email.

ii. Ensure that is_dev variable is set to False.

iii. Make sure that survey_abbr matches to the abbreviation of the form you want to ingest into bronze. For example, when dealing and having created the yml files for a form abbreviated as dvc_register, the survey_abbr value will be dvc_register.

Once you are satisfied every value is okay, click Run all at the top. This should run all the cells in the notebook.

If there is no issue with your yml files, the last cell should display a list of bars and all should reflect as 100%. This means that your data ingestion into bronze worked perfectly.

In addition, you should also run the correct version of the file. Sometimes, due to various updates, it is necessitated both from the databricks side and that of the developer to update the form’s version number to 2, and so on. See the versioning chapter on this. The below image shows how to run a different version of the form, in this case dvc_register.

Device register in staging

Version 2 of dvc_register

Manual Job Run

In Azure, a job is a series of steps that are run sequentially. In very simple terms, a job is the smallest unit of run that can be scheduled to run.

Go to Workflows in your Azure account. Ensure that you are within the ns-ii-tech-stg-db-workspace workspace. Workflows is the tab where your data processing, machine learning and analytics pipelines are orchestrated within the Databricks platform.

Within the Workflows interface, select Jobs. Search for the form abbreviation of interest. Select the Run button at the very end, next to the ellipsis.

Workflows in staging

Running jobs can take a while.

If you click on the job name, two tabs will appear: Runs and Tasks.

Under the Runs tab, the success or failure of your job run(s) is displayed.

Job runs in staging

The Tasks tab displays the tasks that make up your job. Think of these tasks as the runnable units of your job. Click on any task and you will receive some metadata about the task.

The Tasks tab of your job

Go back to your Runs tab and select a particular Job Run under the Start Time column.

Two tabs will appear at the top. The Graph and Timeline tabs.

The Graph tab shows the status of the tasks, either they succeeded or failed.

The Timeline tab shows how long running the task took, and whether it was successful ie. red shows failure and green indicates a success.

The Graph and Timeline of your Jobs

Using the above image as an example, we can see that our job didn’t succeed because the first task ran into an error. Clicking on this task will provide more information on the error.

Error in a Job run

It is highly recommended you check around the Workflows interface and know what each tool does as this will be very helpful during debugging.

The job run is an automated pipeline that executes the tasks of 1) landing Survey123 data into Azure and, 2) moving the landed Survey123 data to Bronze stage sequentially without running any notebook.

If your job run succeeds, you Survey123 data will appear under the bronze folder in the ns-ii-tech-stg-catalog. You can see all of your data in the bronze stage by going to Catalog>ns-ii-tech-stg-catalog>Bronze. The data is broken down into the sublayers and subtables that make up your form as seen in ArcGIS Online. Ideally, all tables in Catalog>ns-ii-tech-dev-catalog>Bronze should also appear in Catalog>ns-ii-tech-stg-catalog>Bronze.