6. The New Data Ingestion Process
This process involves landing the Survey123 files into Azure storage through the yml and resources files we created.
This new procedure is composed of the following parts:
- Prerequisites
- Github
- Databricks
- Manual Jobs
Prerequisites
Assuming that you have created all your .yml
files correctly and uploaded them to Github, ensure the following are all set:
i. You virtual environment is already running - to create a virtual environment, run python3 -m venv nip-dab-venv
. Thereafter, activate the virtual environment using source nip-dab-venv/bin/activate
.
ii. An environment variables file.
To create a virtual environment, run python3 -m venv nip-dab-venv
. Thereafter, activate the virtual environment using source nip-dab-venv/bin/activate
. A virtual environment is a tool that helps to keep dependencies required by different projects separate by creating isolated Python spaces for them. This ensures that your projects are kept distinct from each other even if they use different package versions, thus minimizing conflicts.
The environment variables are found in the ~NIP-Lakehouse-Data/vars/.env.example
path. These will be provided by your supervisor. Run the environment variables file using the below code:
source /home/sammigachuhi/github4/NIP-Lakehouse-Data/NIP-Lakehouse-Data/vars/.env.example
The env.example
is the environment variables file.
Thereafter run source sync.sh dev
.
The sync.sh
file contains the code that will integrate your files into the dev
environment in Github. Runnning this will ensure that all the changes you have made in your files will land in the Databricks Browser File System (DBFS). Alternatively, the Sync Destination
of the Databricks Extension will do the same thing but it is recommended to run both to prevent further errors downstream related to not running the sync.sh
file.
On your VS Code, ensure that your Databricks extension is pushing changes to the appropriate sync destination, such as dab-47fc1c58
. The dab-...
appendix means it has collected only the contents of the dab
folder.
Github
Ensure the branches in which you created the great expectations and resources .yml
files has been merged to the dev
branch of NIP-Lakehouse-Data
. You do this by first initiating a Pull Request
in Github for your remote branch and thereafter a merge. If you don’t have the permissions for merge operations, ask your supervisor to do so on your behalf.
Once your remote branch has been merged to dev
, go to the Github Actions menu. GitHub Actions is an automation tool that allows developers to build, test, and deploy their code directly from GitHub.
Under Github Actions, select Dab: Deploy
as shown in the figure below.
Click the down arrow for Run Workflow
and ensure that: i. Branch is set to dev
or whichever branch you were merging into.
ii. The branch to build is dev
.
iii. The environment to deploy to is dev
.
Thereafter select Run Workflow
.
A green tick next to the action indicates it was successful.
Databricks
Login in to your Azure portal.
In your Workspace tab, go to this path- /Workspace/Users/<user-email>/.ide/dab-47fc1c58/development/gx_deploy_yml
.
This is the flle that will load the great expectations yaml files that you created. The last cell contains the paths to your great expectations .yaml
files.
Just a few things to consider:
i. Ensure that user_name
variable corresponds with your email.
ii. Ensure that is_dev
variable is set to False.
iii. Make sure that survey_abbr
matches to the abbreviation of the form you want to ingest into bronze. For example, when dealing and having created the yml files for a form abbreviated as xprize_sens_reg
, the survey_abbr
value will be xprize_sens_reg
.
Once you are satisfied every value is okay, click Run all
at the top. This should run all the cells in the notebook.
If there is no issue with your yml files, the last cell should display a list of bars and all should reflect as 100%. This means that your data ingestion into bronze worked perfectly.
Manual Job Run
In Azure, a job is a series of steps that are run sequentially. In very simple terms, a job is the smallest unit of run that can be scheduled to run.
Go to Workflows in your Azure account. Workflows is the tab where your data processing, machine learning and analytics pipelines are orchestrated within the Databricks platform.
Within the Workflows interface, select Jobs. Search for the form abbreviation of interest. Select the Run button at the very end, next to the ellipsis.
Running jobs can take a while.
If you click on the job name, two tabs will appear: Runs and Tasks.
Under the Runs tab, the success or failure of your job run(s) is displayed.
The Tasks tab displays the tasks that make up your job. Think of these tasks as the runnable units of your job. Click on any task and you will receive some metadata about the task.
Go back to your Runs tab and select a particular Job Run under the Start Time column.
Two tabs will appear at the top. The Graph and Timeline tabs.
The Graph tab shows the status of the tasks, either they succeeded or failed.
The Timeline tab shows how long running the task took, and whether it was successful ie. red shows failure and green indicates a success.
Using the above image as an example, we can see that our job didn’t succeed because the first task ran into an error. Clicking on this task will provide more information on the error.
It is highly recommended you check around the Workflows interface and know what each tool does as this will be very helpful during debugging.
The job run is an automated pipeline that executes the tasks of 1) landing Survey123 data into Azure and, 2) moving the landed Survey123 data to Bronze stage sequentially without running any notebook.
If your job run succeeds, you Survey123 data will appear under the bronze folder in the ns-ii-tech-dev-catalog
. You can see all of your data in the bronze stage by going to Catalog>ns-ii-tech-dev-catalog>Bronze. The data is broken down into the sublayers and subtables that make up your form as seen in ArcGIS Online.