6. The New Data Ingestion Process

This process involves landing the Survey123 data into Azure storage through the great expectations suites and the Databricks jobs we created.

This new procedure is composed of the following parts:

  • Local development
  • Deployment

Local development

Environment setup

Here we present the different steps to follow to set your local development environment.

Clone the NIP-Lakehouse-Data repository and move in the dab directory in a terminal.

  1. Run python3 -m venv nip-dab-venv to create a new virtual environment. A virtual environment is a tool that isolates dependencies for different projects by creating separate Python environments. This ensures that your projects remain distinct from each other, even if they use different package versions, thereby minimizing conflicts.

  2. Run source nip-dab-venv/bin/activate to activate the virtual environment.

  3. Run pip install poetry to install the poetry package.

  4. Run poetry install --with dev to install the project’s packages.

  5. Install Azure-cli, run az login and follow the steps to log in to Azure cli.

  6. Run databricks auth login --host <url_of_the_dev_databricks_workspace> and follow the requested actions to log in to databricks cli.

  7. Duplicate the .env.examplefile and rename it as .env. This file contains the environment variables and some secrets used by the project. You have to ask the values of the different variables to use to your supervisor.

  8. If you are working in the VS Code editor, dulicate the .vscode.example folder and rename it as .vscode.

  9. Install the [VS Code Databricks extension] and connect you Databricks account. This extension will be used to sync your files in the databricks workspace.

Testing data ingestion locally

Assuming that you have correctly created all the expectation suites for the surveys and the the Databricks jobs configuration files, you can test if the data ingestion process works with the following steps.

Apply the expectation changes

If you made some changes in the expectation suites of a given survey, this step is useful to ensure that the new version of the expectations suite will be used for the data validation during the next ingestion.

  1. After you updated the great expectations yml file for the survey you are working on, run the ./sync.sh command to upload the changes in Databricks DBFS.

  2. Click on the synchronisation button in the VS Code Databricks extension to upload the changes in your Databricks workspace. Make sure that the select target is dev. Databricks sync

You can click the button at the right of the sync button to open the folder in which the data is uploaded in Databricks.

In this folder, retrieve the great expectations yml file of the survey on which you are working and make sure that the changes you made are reflected here.

  1. Navigate into the developement folder, open the gx_deploy_ml notebook. Running this notebook will apply the great expectations changes you made. Make the follwing changes in the notebook:
  • Ensure that user_name variable corresponds with your email.

  • Ensure that is_dev variable is set to False.

  • Make sure that survey_abbr matches to the abbreviation of the form you want to ingest into bronze. For example, when dealing and having created expectations for a survey abbreviated as xprize_sens_reg, the survey_abbr value will be xprize_sens_reg.

Once you are satisfied every value is okay, click Run all at the top. This should run all the cells in the notebook.

  1. Go to the NIP-Lakehouse-Data repo on GitHub and run the Deploy GX Azure Static Web Apps workoflow setting all parameters to dev to reflect the changes you made in the great expectation website.

Test the Databricks job to ingest data

  1. In the databricks.yml file in the include block, comment the existing resources, and add a line to mathch only the yml configuration file of the job that you want to test as in the example below.
    include:
      # - resources/*.yml
      # - resources_merged/*.yml
      - resources_dev/survey_123_dvc_register.yml
    
  2. Run databricks bundle deploy --force -t dev to deploy the jobs in Databricks.

  3. Run databricks bundle run <the_job_name> -t dev to launch the Databricks job.

In Databricks, a job is a series of steps that are run sequentially. In very simple terms, a job is the smallest unit of run that can be scheduled to run.

Go to Workflows in the Databricks portal. Workflows is the tab where your data processing, machine learning and analytics pipelines are orchestrated within the Databricks platform.

Within the Workflows interface, select Jobs. Search for the form abbreviation of interest. Select the Run button at the very end, next to the ellipsis.

Workflows Jobs

Running jobs can take a while. If you click on the job name, two tabs will appear: Runs and Tasks. Under the Runs tab, the success or failure of your job run(s) is displayed.

The Runs tab of your job

The Tasks tab displays the tasks that make up your job. Think of these tasks as the runnable units of your job. Click on any task and you will receive some metadata about the task.

The Tasks tab of your job

Go back to your Runs tab and select a particular Job Run under the Start Time column. Two tabs will appear at the top. The Graph and Timeline tabs. The Graph tab shows the status of the tasks, either they succeeded or failed. The Timeline tab shows how long running the task took, and whether it was successful ie. red shows failure and green indicates a success.

The Graph and Timeline of your Jobs

Using the above image as an example, we can see that our job didn’t succeed because the first task ran into an error. Clicking on this task will provide more information on the error.

Error in a Job run

It is highly recommended you check around the Workflows interface and know what each tool does as this will be very helpful during debugging.

The job run is an automated pipeline that executes the tasks of 1) landing Survey123 data into Azure and, 2) moving the landed Survey123 data to Bronze stage sequentially without running any notebook.

If your job run succeeds, you Survey123 data will appear under the bronze schema in the ns-ii-tech-dev-catalog. You can see all of your data in the bronze stage by going to Catalog>ns-ii-tech-dev-catalog>Bronze. The data is broken down into the sublayers and subtables that make up your form as seen in ArcGIS Online.

Push modifications on the dev branch on GitHub

Ensure the branche in which you are working has been merged to the dev branch of NIP-Lakehouse-Data. You do this by first initiating a Pull Request in Github for your remote branch and thereafter a merge. If you don’t have the permissions for merge operations, ask your supervisor to do so on your behalf.

Once your remote branch has been merged to dev, go to the Github Actions menu. GitHub Actions is an automation tool that allows developers to build, test, and deploy their code directly from GitHub.

Under Github Actions, select Dab: Deploy as shown in the figure below.

Running Github Actions

Click the down arrow for Run Workflow and ensure that:

i. Branch is set to dev or whichever branch you were merging into.

ii. The branch to build is dev.

iii. The environment to deploy to is dev.

Thereafter select Run Workflow.

A green tick next to the action indicates it was successful.