3. Great Expectations
Accessing the Great Expectations files
To enrich the great expectations files, you will be primarily working with two kinds of files, the survey123_<feature_layer>_great_expectations.py
files and the <feature_layer>_pipeline_config.yml
files. These files can be found in the following respective folders:
~/NIP-Lakehouse-Data/dab/development/gx_development
- for the great expectations.py files~NIP-Lakehouse-Data/dab/sync/user/conf/tasks
- for the .yml files.
Configuring the .yml files
To begin enriching great_expectations.py files, you will first have to begin from the .yml files.
A template for configuration of .yml files has been provided at the same workspace for other .yml files, it is called survey_abbr_pipeline_config_template.yml.
. Use this file as a reference when configuring .yml files.
secret_scope:
name: "db_ss_jobs_params"
data_domains:
survey123:
surveys:
<survey_id>:
survey_abbr: <survey_abbr>
tables:
<table_name>:
landing_bronze:
ids:
- <column1>
- <column2>
- ...
ordering_columns:
- <column>
unix_time_columns:
- <column1>
- <column2>
- ...
To demonstrate how to configure .yml files, we will use the Acoustic Survey feature layer in Arcgis Online (AGOL) Website. Go to AGOL, and log in using your credentials.
On the Home page, browse to Content>My Groups.
Assuming we will create a configured .yml file for the Acoustic Survey feature layer, you will need to create a new yml file whose initials hint (as much as possible) on the feature layer being worked on and in addition, has the suffix ..._pipeline_config.yml
appended to it.
As an example, the .yml file for acoustic is acstc_pipeline_config.yml
.
Step 1: Insert the Survey ID
If you browse down the webpage of the Acoustic Survey feature layer, you will notice it has a number of layers and subtables. The yml template has a <survey_id>
key. To get the appropriate key to feed as the value for this field, on the right there is a Details metadata section. Copy the alphanumeric digits next to ID. Insert this number next to the
Step 2: Insert the feature layer abbreviation name
The feature layer abbreviation name will be inserted next to the
Step 3: Populate the table values
For each feature layer, there can be one or several layer
s and/or table
s. Each of these in the configuration .yml files is referred to as <feature_layer>_sublayer_<name>
or <feature_layer>_subtable_<name>
. The <name>
refers to the particular sublayer or subtable name.
Insert as <table_name>
the name of a particular sublayer or subtable, (depending on which you began with) but it is recommended you start with layer. Using Acoustic Survey feature layer again, the format of inserting the sublayer Survey from AGOL is <abbreviate>_<sublayer>_<layer_name>
. For example:
acstc_sublayer_survey:
landing_bronze:
ids:
- objectid
- globalid
ordering_columns:
- EditDate
unix_time_columns:
- CreationDate
- EditDate
For the subtables, below is an example of the format:
acstc_subtable_repeat_check:
landing_bronze:
ids:
- globalid
- acoustic_designation_check
- parentglobalid
ordering_columns:
- EditDate
unix_time_columns:
- CreationDate
- EditDate
The id
s sub-section values will mostly be globalid
and parentglobalid
. For ordering_columns
sub-section, the values will mostly be EditDate
and for unix_time_columns
the values will mostly be CreationDate
and EditDate
. Nevertheless, it wise to consult your supervisor first.
Step 4: Repeat
Repeat the above procedure for every feature layer.
Configuring the great_expectations.py files
Configuring the gx_expectations files is a more iterative process than the .yml files.
Great Expectations is a tool that is used to maintain the quality of data throughout a data workflow and pipeline. One of the best places to build an intuition for Great Expectations apart from the abstract website is this Medium article and here.
The template providing guidance into configuring the great_expectation files has been provided at this path ~development/gx_development/survey123_XX_great_expectations.py
.
Step 1: Create a great_expectations.py file
To create a new great_expectations.py file, create a new python file from VS Code.
Using the Acoustic Survey feature layer as an example, create a great_expectations file using the following format survey123_<abbreviation_name>_great_expectations.py
. For example, the great_expectations file for our Acoustic Survey feature layer is survey123_acstc_great_expectations.py
.
Insert the necessary import *
statements from the provided .py template file.
# pylint: skip-file
# flake8: noqa
import os
import pandas as pd
from nip_lakehouse.gx_utils import develop_expectations, read_yaml, validate_batch
Step 2: Create the def functions for each subtable
The functions within the def ...
function point to particular functions within the gx_utils.py
file responsible for checking the data.
For each and every sublayer and/or subtable for a particular feature layer, follow the below format to insert its column structure into great expectations.
def dev_gx_<subtable_name>():
columns = []
develop_expectations(
datasource_name="X",
columns=columns,
columns_not_null=[""],
columns_positive=["CreationDate", "EditDate"],
columns_mapped_values={
"acoustic_functioning_retrieve": ["Yes", "No"],
"data_retrieved_retrieve": ["Yes", "No"],
},
columns_matching_regex=None,
)
Step 3: Extract column names, non-null and positive only column values
There are three ways to extract column names for the feature layers.
- From the Azure blob container
- From AGOL
1. From Azure Blob Container
This is the most straightforward approach. However, it requires some hacker attitude.
Go to the Azure Container page.
Browse to the Survey 123 container.
Navigate to acst_xxx
folder. We will use the Acoustic Survey here again.
Navigating to it will reveal other Acoustic (sublayer) survey folders. Keep clicking on the most recent in all cases until you finally reach the blobs of this sublayer.
Click on the last blob (the bottom).
A new interface will pop from the right.
Click on Generate SAS.
For Allowed protocols, select HTTPS and HTTP
.
Scroll down and click Generate SAS Token and URL button.
Copy the URL under the Blob SAS URL. This link is valid for 33 hours.
2. From AGOL
This requires going to the Data tab of each feature layer and extracting the backend column names for each table. For example, for the ObjectID column, the backend name is objectid
. Alternatively, one can export the tables.
This guide will primarily rely on the first method. The second method is recommended when a particular feature layer or subtable is not available in the blob storage.
To extract values for the columns
, columns_not_null
and columns_positive
keys, I developed a simple custom python script that prints out the appropriate values for each of the three key fields. This script can be copied into a new VS Code project and run from there to avoid interference with where you are developing your great expectations files.
The custom script is called column_script.py
and it is found within this path ~github/NIP-Lakehouse-Data/custom_scripts/
.
Run the above script in a Linux environment using this syntax: python3 <filename>.py
.
Below is an image of the output of the above script when we inserted the Blob SAS url for acoustic sublayer survery.
You can see the fields for comments such as ct_survey_comment
and acoustic_notes_retire
haven’t been included as non-null columns!
Step 4: Extract values for the columns_mapped_values key
Extracting the values to go into the columns_mapped_values
key in great expectations is a special case since it requires manually openning up the Survery 123 Connect form for a particular feature layer.
Open Arcgis Survey123 connect using the credentials provided by your supervisor.
Download a particular Survey123 form into your local machine. If it is already downloaded, click on the ellipsis on the particular form you want to work on, select Open Folder.
Go to the media folder. This folder contains the choices list for every file. Another alternative is to use the choices_csv
folder from the Field Division group in Teams.
For example, to map the unique values for the project
column, open the Survey123 excel (.xls) form. In this case, we will be working with the Acoustic Survey Form.
On the left of the word Project under the name column, you will see a cell with the word select_one_from_file project_list.csv
under the type column. This is a special Survey123 formatting which indicates to Survey123 that this field will use a choices list, specifically a single choice list from a different .xls file. The choices list are found within the media folder.
The unique values for the project column are found within the project_list.csv
more so under the name
column. Unless directed by your supervisor, use the name
column to extract the values to be mapped.
We have developed a script called mapped_values_script.py
that extracts the unique values for every choices list, as many as you wish:
The path to the mapped_values_script.py
is found in this path ~NIP-Lakehouse-Data/custom_scripts/
. There are also other custom scripts inside that folder.
It is recommended that the media folder for a particular Survey123 form should be in the same project as the custom scripts.
Here are the results.
Remember to format the list extracted from the custom scripts for a neat look in the great expectations files.
Step 5: Insert the final initialization functions
To intialize the functions which already have mapped values, insert the following function. Replace the placeholders with the appropriate values.
def dev_gx_<feature_layer_abbreviation>_all_tables():
dev_gx_<sublayer1>_t1()
dev_gx_<subtable1>_t1()
if __name__ == "__main__":
<!-- dev_gx_X_t1() -->
dev_gx_<feature_layer_abbreviation>_all_tables()
Step 6: Push and commit frequently
After finishing developing the expectations for every sublayer or subtable, commit and push to the appropriate branch in Github.
Step 7: Repeat
Repeat the above procedure for every feature layer. This may seem a very repetitive and ordous task, but the custom scripts are a time saver. Remember to initiate a pull request seeking the approval of your supervisor once done.