5. New Great Expectations Format
Rationale
The main reason of creating a new great expectations format is that this method is neater and is a prerequisite of a new data ingestion process currently under development.
Create a new .yml
file
Unlike the previous methodology which relied on survey123_<feature_abbreviation>_great_expectations.py
file for data ingestion, this one relies on a single .yml
file. The .yml
files for this new data ingestion methodology are to be saved and/or accessed from the following path ~github/NIP-Lakehouse-Data/dab/sync/user/gx_development_surveys_v1
. A .yml
file, originally known as Yet Another Markup Language before being changed to YAML Ain’t Markup Language is a form of serlalization language.
Serialization is a process where one application or service that has different data structures and is written in a different set of technologies can transfer data to another application using a standard format. In other words, .yml
is used as a configuration file, that is, it can define a program’s or application’s settings when run. As seen, YAML files have a .yml
or .yaml
extension.
The process of starting the new data ingestion process begins with creating .yml
file in the `gx_development_surveys_v1 folder. The format is shown below:
<survey_abbreviation>_great_expectations.yml
For example: acst_great_expectations.yml
.
The template
A template .yml
file is provided within the folder gx_development_surveys_v2. Use this template when designing your .yml
files for the data ingestion process.
The template is shown below:
secret_scope:
name: "db_ss_jobs_params"
survey_abbr:
- <survey_abbr>
survey_version: <survey_version>
tables:
table1:
columns:
- .
- ..
expectations:
all_columns_not_null: true
# columns_not_null: to be commented if all_columns_not_null true, otherwise use columns_not_null to list the columns
columns_positive:
- .
- ..
columns_mapped_values:
columns_matching_regex:
columns_between:
tree_hight:
- 1
- 3
col2:
- . # min
- .. # max
The feature layer name
First start by replacing the <survey_abbr>
with the name of the feature layer you are working on. Taking the Acoustic Survey feature layer as an example, you will replace it with the word acstc
.
The survey version
Considering that this is the first version of data ingestion using .yml
files, and additionally it is within a version 1 of the gx_development_surveys_**v1**
, replace the <survey_version>
with the value - 1. That is:
secret_scope:
name: "db_ss_jobs_params"
survey_abbr:
- acstc
survey_version: 1
...
The tables
Remember in the previous data ingestion process we had to define the subtables with something like dev_gx_acstc_sublayer_survey()
? In the new format, specify each table name with the format <survey_abbreviation>_<sublayer>_<name_of sublayer or subtable>
. For example, acstc_sublayer_survey
.
For the keys associated with a particular sublayer or subtable, taking that of survey sublayer as an example, using a script that automates the process is far much easier.
The script is known as y_columns_script.py
. It is found inside the following path: ~NIP-Lakehouse-Data/custom_scripts/
.
This is how it works:
Step 1: Azure Blob Storage
Just like in the previous data ingestion methodology, you first choose the particular subtable you want to work with, say the sublayer survey for Acoustic Survey feature layer.
Go to the Azure Container page.
Browse to the Survey 123 container.
Navigate to acst_xxx
folder. We will use the Acoustic Survey here again.
Navigating to it will reveal other Acoustic (sublayer) survey folders. Keep clicking on the most recent in all cases until you finally reach the blobs of this sublayer.
Click on the last blob (the bottom).
A new interface will pop from the right.
Click on Generate SAS.
For Allowed protocols, select HTTPS and HTTP
.
Scroll down and click Generate SAS Token and URL button.
Copy the URL under the Blob SAS URL. This link is valid for 33 hours.
Step 2: Filling the columns key in .yml
file
To fill the columns key in the .yml
file, we use the script - y_columns_script.py
. This a custom pythonic file for extracting the column names from the blob file.
It works as follows. Paste the SAS url as the value for the sas_url
variable like so:
sas_url = <"place_the_SAS_urL_here">
Now run the file. Remember that since we are working in a Linux environment, we will run the custom script using this syntax – python3 yml_columns_script.py
.
The output should be as follows, each denominated with a header for the .yml
key the values are printed for.
Only if there are no null values in any column will the result be all_columns_not_null: true
.
*NB**: The yml_columns_script
prints out the relevant values for the columns
, columns_not_null
and columns_positive
keys.
For the values to be filled in the columns_mapped_values
key, a different script is used.
Step 3: Filling the columns_mapped_values
key.
The columns_mapped_values
key refers to the unique values in the choices list. The choices list can be found inside the files folder of the Survey123 group Field Division in Teams.
Alternatively, every form inside the Arcgis Survey123 Connect software has a mediafolder. All the sheets in there are the choices list for each particular form, only that Teams folder contains all the choices lists.
In case you are using this new methodology for a feature layer with an existing survey123_<feature_layer>_great_expectations.py
file, just copy paste the entire columns_mapped_values
key, and paste it at the very top of the y_keys_script.py
file like so:
# This script prints out in long format all the values allocated to a dictionary key
columns_mapped_values={
"activity_check": ['Clear_Vegetation', 'Reposition_Sensor', 'Data_Collection', 'Routine_Check',
'Battery_Replacement', 'SD_Card_Malfunction', 'Structural_Change',
'Quiet_Period_Change', 'Sensor_Height_Change', 'Sampling_Rate_Change'],
"acoustic_functioning_check": ['Yes', 'No'],
"data_retrieved_check": ["Yes", "No"]}
Run the script using the following syntax python3 yml_keys_script.py
. It will print out the mapped values for every key inside the columns_mapped_values
in exception of some keys related to field team names. This is because at the time of writing a new tool will be developed to extract field teams’ names automatically. Other keys such as for dates, longitude-latitudes are objectid, globalid and et cetera contain entirely unique values all through and thus are not relevant keys for mapping. If existent, are omitted.
Remember that the yml_keys_script.py
is iterative for each subtable.
In case working on a new yml file whose feature layer does not exist inside the development/gx_development
folder, you will have to use mapped_values_script.py
and y_keys_scripts.py
in that order.
The mapped_values_script.py
is for printing out the unique values from the choices
list, specifically from the name
column.
For example, running the mapped_values_script.py
as shown below prints out the unique values for the choices list associated with activity_check
key in the columns_mapped_values
dictionary.
Copy the list printed out in the terminal and inside the y_keys_scripts.py
file, paste the list next to its appropriate key value. For example, for this list, the appropriate key will be activity_check
. Do this for the other choices lists, as well, such as for acoustic_functioning_check
and data_retrieved_check
keys as well. Runnning the y_keys_scripts.py
file containing the filled up dictionary will print out all the relevant details at once.
A completed .yml
file should like the below format:
secret_scope:
name: "db_ss_jobs_params"
survey_abbr:
- acstc
survey_version: 1
tables:
acstc_sublayer_survey:
columns:
- objectid
- globalid
- date_time_details
- project
...
expectations:
# all_columns_not_null: true
columns_not_null:
- objectid
- globalid
- date_time_details
...
columns_positive:
- objectid
- date_time_details
- date_time_deploy
...
columns_mapped_values:
project:
- Rangeland_Carbon_Project
- Rangeland_Biodiversity_Project
- Rangeland_Social_Project
area:
- LLBN
transport:
...
columns_matching_regex:
acstc_subtable_repeat_deploy:
columns:
...
expectations:
# all_columns_not_null: true
columns_not_null:
...
columns_positive:
...
columns_mapped_values:
...
Pushing and commiting
After finishing developing the expectations for every sublayer or subtable, commit and push to the appropriate branch in Github.
Repeat
Repeat the above procedure for every feature layer. This may seem a very repetitive and ordous task, but the custom scripts are a time saver. Remember to initiate a pull request seeking the approval of your supervisor once done.
From csv tables
Sometimes, you may be requested to create great expectations yml files for tables that are yet to be uploaded into Azure storage.
To perform this action, the custom_script
folder in NIP-Lakehouse-Data repository contains the columns_script_func.py
function for extracting all the required column names, non-null column names and positive only columns using just the specific table and the survey123 XLS form. The survey123 XLS form mentioned here is the one retrieved from ArcGIS Survey123 Connect folders.
For example, in the following code we pass the name of the specific table we want to extract the column groupings for (the column names, non-null columns and positive only columns) to the s123_path
parameter. Secondly, we pass the path name of the Survey123 Connect XLS form to the xls_form_path
parameter. If accessing the S123 connect form may be difficult or isn’t recommended to avoid data corruption, you can copy the S123 XLS form to another directory and insert the path in here.
gx_structure(s123_path="../CPP_SP1_Small_Tree_Survey_17-04-2024/large_tree_sp1_1.csv",
xls_form_path="../CPP_SP1_Small_Tree_Survey_17-04-2024/CPP SP1 Small Tree Survey.xlsx")
The result of passing the above parameters will be as follows:
---Check the unique values in every column--
--Column names from survey123 csv files ../CPP_SP1_Small_Tree_Survey_17-04-2024/large_tree_sp1_1.csv---
---------------testing--column_names ----------
columns:
- objectid
- globalid
- tree_species_sp1
- tree_height_sp1
- dead_alive_sp1
- stem_type_sp1
- stem_count_single_sp1
- TotalStemCount.1
- stem_count_index_result_sp1
- stem_mode_sp1
- damage_sp1
- c_cpp_damage_cause_sp1
- mid_point_height_of_damage_cm_sp1
- parentglobalid
- CreationDate
- Creator
- EditDate
- Editor
---This code prints columns that do not have any null values (columns_not_null in gx)---
------------testing for columns not null-------------------
columns_not_null:
- objectid
- globalid
- tree_species_sp1
- tree_height_sp1
- dead_alive_sp1
- stem_type_sp1
- stem_count_index_result_sp1
- stem_mode_sp1
- damage_sp1
- c_cpp_damage_cause_sp1
- mid_point_height_of_damage_cm_sp1
- parentglobalid
- CreationDate
- Creator
- EditDate
- Editor
--This code shall only print the columns with positive values only---
--Works best with csv files from landing container--
---------------the below code is for testing columns_positive---------------
columns positive:
- objectid
- tree_height_sp1
- stem_count_single_sp1
- TotalStemCount.1
- stem_count_index_result_sp1
- mid_point_height_of_damage_cm_sp1
Notice that it returns the values of our three column groupings, that is columns
, columns_not_null
and columns_positive
.
There is also an additional function within columns_script_func.py
called count_duplicate_columns
. The intention of this function is to show duplicate columns in your downloaded Survey123 csv file. Using our large_tree_sp1_1.csv
table again, we can see it has an extra column called Total Stem Count
which has been appended with a .1
.
ObjectID: appears 1 times
GlobalID: appears 1 times
Tree Species: appears 1 times
Tree Height (m): appears 1 times
Status: appears 1 times
Stem Type: appears 1 times
Total Stem Count: appears 1 times
Total Stem Count.1: appears 1 times
# Stem Count Metrics: appears 1 times
Stem Mode: appears 1 times
Damage: appears 1 times
Damage Cause: appears 1 times
Mid-Point height of damage (cm): appears 1 times
ParentGlobalID: appears 1 times
CreationDate: appears 1 times
Creator: appears 1 times
EditDate: appears 1 times
Editor: appears 1 times
Note these are the column labels and not names, where the latter appear as the schema name.
Ensure that the great expectation file is in the naming format of cpp_small_tree_v1_great_expectations.py
where the version number is the version in use. eg v1, v2, v3 etc.
The Resources files
These files are found under dab/resources/
folder. These .yml
files contain the parameters to connect with Azure. It is a crucial file and not writing the values correctly may lead to data ingestion errors.
The changes to be done in these .yml
files are as follows:
-
Ensure the
name
key reflects the abbreviated name of your form but pre-appended withsurvey123
string. eg.survey123_cpp_small_tree_v2
-
Ensure the name of your form (using the format in 1 above) reflects in the
task_key
key andconf-file
keys. -
Ensure the
survey-id
key reflects the ID of your form in ArcGIS Online.
After all the pipeline_config.py, great_expectations and resources yml
files have been created, push your work to Github.
Overview of a pipeline config yml
file
Below we provide an overview of the structure of a pipeline config file but using the one for ct
as an example.
Resources:
This section might define resources needed by the pipeline, like virtual machines or storage accounts.
jobs:
This section defines the jobs within the pipeline. In this case, there’s one job named survey123_ct.
name:
This specifies the user-friendly name of the job - “survey123_ct”.
email_notifications:
on_failure:
This defines who should receive email notifications if the job fails. Here, it’s set to databricks-ci, likely an email group related to Databricks, a data processing platform.
tasks:
This section defines the tasks executed within the job. Here, there are two tasks:
-
task_key:
This is an internal reference for the task within the pipeline. -
job_cluster_key:
This specifies the job cluster to be used for the task (defined later). -
python_wheel_task:
This indicates the task involves running a Python script using a specific wheel file. Here are additional details:package_name:
Name of the Python package to be used - “nip_lakehouse”.entry_point:
The specific Python function/script to execute within the package - “survey123_ingestion_landing” or “survey123_landing_bronze”.named_parameters:
These are parameters passed to the Python script:conf-file:
Path to a configuration file - retrieved dynamically using the user name in the workspace.data-domain:
Name of the data domain (likely related to data storage).survey-id:
Specific survey identifier - “312815883fbf4549a784eb5ad2e1e644”.
-
libraries:
This specifies the Python wheel file location used for the task. The path includes variables like bundle.target.
The second task (survey123_ct_landing_bronze) is similar, with a different entry_point and an additional named parameter gx-file-path
.
depends_on:
This is defined only for the second task (survey123_ct_landing_bronze). It specifies that this task depends on the successful completion of the first task (survey123_ct_ingestion_landing) before it can run.
job_clusters:
This section defines the job cluster configurations used by the tasks. Here, there’s one cluster defined:
job_cluster_key:
This references the key used in the tasks (job_cluster_task).new_cluster:
This indicates a new cluster should be created for the tasks.spark_version:
Spark version for the cluster - “13.3.x-scala2.12” (Spark is a data processing framework).node_type_id:
VM type used for the cluster nodes - “Standard_D3_v2”.policy_id:
This uses a variable ${var.arcgis_job_compute_policy_id}. Likely, this variable is defined elsewhere in Azure Pipelines and specifies a compute policy for the cluster.num_workers:
Number of worker nodes in the cluster - 1.
tags:
This section defines custom tags for the job:
job_type:
This specifies the type of job - “ingestion_&_validation”.survey123_group:
This groups the job by survey - “ct”.
In summary, this pipeline defines a job with two tasks. Both tasks involve running Python scripts for data processing related to a survey (survey123, group ct). The tasks use a Databricks cluster for execution and rely on configuration files and specific data locations.