3. Great Expectations

Accessing the Great Expectations files

To enrich the great expectations files, you will be primarily working with two kinds of files, the survey123_<feature_layer>_great_expectations.py files and the <feature_layer>_pipeline_config.yml files. These files can be found in the following respective folders:

~/NIP-Lakehouse-Data/dab/development/gx_development - for the great expectations.py files
~NIP-Lakehouse-Data/dab/sync/user/conf/tasks - for the .yml files.

Configuring the .yml files

To begin enriching great_expectations.py files, you will first have to begin from the .yml files.

A template for configuration of .yml files has been provided at the same workspace for other .yml files, it is called survey_abbr_pipeline_config_template.yml.. Use this file as a reference when configuring .yml files.

secret_scope:
  name: "db_ss_jobs_params"
data_domains:
  survey123:
    surveys:
      <survey_id>:
        survey_abbr: <survey_abbr>
        tables:
          <table_name>:
            landing_bronze:
              ids:
                - <column1>
                - <column2>
                - ...
              ordering_columns:
                - <column>
              unix_time_columns:
                - <column1>
                - <column2>
                - ...

To demonstrate how to configure .yml files, we will use the Acoustic Survey feature layer in Arcgis Online (AGOL) Website. Go to AGOL, and log in using your credentials.

On the Home page, browse to Content>My Groups.

Assuming we will create a configured .yml file for the Acoustic Survey feature layer, you will need to create a new yml file whose initials hint (as much as possible) on the feature layer being worked on and in addition, has the suffix ..._pipeline_config.yml appended to it.

As an example, the .yml file for acoustic is acstc_pipeline_config.yml.

Step 1: Insert the Survey ID

If you browse down the webpage of the Acoustic Survey feature layer, you will notice it has a number of layers and subtables. The yml template has a <survey_id> key. To get the appropriate key to feed as the value for this field, on the right there is a Details metadata section. Copy the alphanumeric digits next to ID. Insert this number next to the key.

Step 2: Insert the feature layer abbreviation name

The feature layer abbreviation name will be inserted next to the key. Compared to inserting survey id, this is a more subjective exercise. As the abbreviation name, use the character name for your configuration layer without the `_pipeline_configy.yml` part.

Step 3: Populate the table values

For each feature layer, there can be one or several layers and/or tables. Each of these in the configuration .yml files is referred to as <feature_layer>_sublayer_<name> or <feature_layer>_subtable_<name>. The <name> refers to the particular sublayer or subtable name.

Insert as <table_name> the name of a particular sublayer or subtable, (depending on which you began with) but it is recommended you start with layer. Using Acoustic Survey feature layer again, the format of inserting the sublayer Survey from AGOL is <abbreviate>_<sublayer>_<layer_name>. For example:

acstc_sublayer_survey:
            landing_bronze:
              ids:
                - objectid
                - globalid
              ordering_columns:
                - EditDate
              unix_time_columns:
                - CreationDate
                - EditDate

For the subtables, below is an example of the format:

acstc_subtable_repeat_check:
            landing_bronze:
              ids:
                - globalid
                - acoustic_designation_check
                - parentglobalid
              ordering_columns:
                - EditDate
              unix_time_columns:
                - CreationDate
                - EditDate

The ids sub-section values will mostly be globalid and parentglobalid. For ordering_columns sub-section, the values will mostly be EditDate and for unix_time_columns the values will mostly be CreationDate and EditDate. Nevertheless, it wise to consult your supervisor first.

Step 4: Repeat

Repeat the above procedure for every feature layer.

Configuring the great_expectations.py files

Configuring the gx_expectations files is a more iterative process than the .yml files.

Great Expectations is a tool that is used to maintain the quality of data throughout a data workflow and pipeline. One of the best places to build an intuition for Great Expectations apart from the abstract website is this Medium article and here.

The template providing guidance into configuring the great_expectation files has been provided at this path ~development/gx_development/survey123_XX_great_expectations.py.

Step 1: Create a great_expectations.py file

To create a new great_expectations.py file, create a new python file from VS Code.

Using the Acoustic Survey feature layer as an example, create a great_expectations file using the following format survey123_<abbreviation_name>_great_expectations.py. For example, the great_expectations file for our Acoustic Survey feature layer is survey123_acstc_great_expectations.py.

Insert the necessary import * statements from the provided .py template file.

# pylint: skip-file
# flake8: noqa
import os

import pandas as pd

from nip_lakehouse.gx_utils import develop_expectations, read_yaml, validate_batch

Step 2: Create the def functions for each subtable

The functions within the def ... function point to particular functions within the gx_utils.py file responsible for checking the data.

For each and every sublayer and/or subtable for a particular feature layer, follow the below format to insert its column structure into great expectations.

def dev_gx_<subtable_name>():
    columns = []
    develop_expectations(
        datasource_name="X",
        columns=columns,
        columns_not_null=[""],
        columns_positive=["CreationDate", "EditDate"],
        columns_mapped_values={
            "acoustic_functioning_retrieve": ["Yes", "No"], 
            "data_retrieved_retrieve": ["Yes", "No"],
        },
        columns_matching_regex=None,
    )

Step 3: Extract column names, non-null and positive only column values

There are three ways to extract column names for the feature layers.

From the Azure blob container
From AGOL

1. From Azure Blob Container

This is the most straightforward approach. However, it requires some hacker attitude.

Go to the Azure Container page.

Browse to the Survey 123 container.

Landing Page

Navigate to acst_xxx folder. We will use the Acoustic Survey here again.

Acoustic survey blob folde

Navigating to it will reveal other Acoustic (sublayer) survey folders. Keep clicking on the most recent in all cases until you finally reach the blobs of this sublayer.

The Blobs

Click on the last blob (the bottom).

A new interface will pop from the right.

Blob interface

Click on Generate SAS.

For Allowed protocols, select HTTPS and HTTP.

Scroll down and click Generate SAS Token and URL button.

Copy the URL under the Blob SAS URL. This link is valid for 33 hours.

The Blob Url

2. From AGOL

This requires going to the Data tab of each feature layer and extracting the backend column names for each table. For example, for the ObjectID column, the backend name is objectid. Alternatively, one can export the tables.

This guide will primarily rely on the first method. The second method is recommended when a particular feature layer or subtable is not available in the blob storage.

To extract values for the columns, columns_not_null and columns_positive keys, I developed a simple custom python script that prints out the appropriate values for each of the three key fields. This script can be copied into a new VS Code project and run from there to avoid interference with where you are developing your great expectations files.

The custom script is called column_script.py and it is found within this path ~github/NIP-Lakehouse-Data/custom_scripts/.

Run the above script in a Linux environment using this syntax: python3 <filename>.py.

Below is an image of the output of the above script when we inserted the Blob SAS url for acoustic sublayer survery.

Custom script for extracting column names, non null and positive only values

You can see the fields for comments such as ct_survey_comment and acoustic_notes_retire haven’t been included as non-null columns!

Step 4: Extract values for the columns_mapped_values key

Extracting the values to go into the columns_mapped_values key in great expectations is a special case since it requires manually openning up the Survery 123 Connect form for a particular feature layer.

Open Arcgis Survey123 connect using the credentials provided by your supervisor.

Download a particular Survey123 form into your local machine. If it is already downloaded, click on the ellipsis on the particular form you want to work on, select Open Folder.

Go to the media folder. This folder contains the choices list for every file. Another alternative is to use the choices_csv folder from the Field Division group in Teams.

For example, to map the unique values for the project column, open the Survey123 excel (.xls) form. In this case, we will be working with the Acoustic Survey Form.

XLS form for acoustic survey

On the left of the word Project under the name column, you will see a cell with the word select_one_from_file project_list.csv under the type column. This is a special Survey123 formatting which indicates to Survey123 that this field will use a choices list, specifically a single choice list from a different .xls file. The choices list are found within the media folder.

The unique values for the project column are found within the project_list.csv more so under the name column. Unless directed by your supervisor, use the name column to extract the values to be mapped.

We have developed a script called mapped_values_script.py that extracts the unique values for every choices list, as many as you wish:

The path to the mapped_values_script.py is found in this path ~NIP-Lakehouse-Data/custom_scripts/. There are also other custom scripts inside that folder.

It is recommended that the media folder for a particular Survey123 form should be in the same project as the custom scripts.

Here are the results.

Unique values for every column of choices list

Remember to format the list extracted from the custom scripts for a neat look in the great expectations files.

Step 5: Insert the final initialization functions

To intialize the functions which already have mapped values, insert the following function. Replace the placeholders with the appropriate values.

def dev_gx_<feature_layer_abbreviation>_all_tables():
    dev_gx_<sublayer1>_t1()
    dev_gx_<subtable1>_t1()

if __name__ == "__main__": 
    <!-- dev_gx_X_t1() -->

    dev_gx_<feature_layer_abbreviation>_all_tables()

Step 6: Push and commit frequently

After finishing developing the expectations for every sublayer or subtable, commit and push to the appropriate branch in Github.

Step 7: Repeat

Repeat the above procedure for every feature layer. This may seem a very repetitive and ordous task, but the custom scripts are a time saver. Remember to initiate a pull request seeking the approval of your supervisor once done.