5. New Great Expectations Format

Rationale

The main reason of creating a new great expectations format is that this method is neater and is a prerequisite of a new data ingestion process currently under development.

Create a new `.yml` file

Unlike the previous methodology which relied on survey123_<feature_abbreviation>_great_expectations.py file for data ingestion, this one relies on a single .yml file. The .yml files for this new data ingestion methodology are to be saved and/or accessed from the following path ~github/NIP-Lakehouse-Data/dab/sync/user/gx_development_surveys_v1. A .yml file, originally known as Yet Another Markup Language before being changed to YAML Ain’t Markup Language is a form of serlalization language.

Serialization is a process where one application or service that has different data structures and is written in a different set of technologies can transfer data to another application using a standard format. In other words, .yml is used as a configuration file, that is, it can define a program’s or application’s settings when run. As seen, YAML files have a .yml or .yaml extension.

The process of starting the new data ingestion process begins with creating .yml file in the `gx_development_surveys_v1 folder. The format is shown below:

<survey_abbreviation>_great_expectations.yml

For example: acst_great_expectations.yml.

The template

A template .yml file is provided within the folder gx_development_surveys_v2. Use this template when designing your .yml files for the data ingestion process.

The template is shown below:

secret_scope:
  name: "db_ss_jobs_params"
survey_abbr:
  - <survey_abbr>
survey_version: <survey_version>
tables:
  table1:
    columns:
      - .
      - ..
    expectations:
      all_columns_not_null: true
      # columns_not_null: to be commented if all_columns_not_null true, otherwise use columns_not_null to list the columns 
      columns_positive:
        - .
        - ..
      columns_mapped_values:
      columns_matching_regex:
      columns_between:
        tree_hight:
          - 1
          - 3
        col2:
          - .   # min
          - ..  # max

The feature layer name

First start by replacing the <survey_abbr> with the name of the feature layer you are working on. Taking the Acoustic Survey feature layer as an example, you will replace it with the word acstc.

The survey version

Considering that this is the first version of data ingestion using .yml files, and additionally it is within a version 1 of the gx_development_surveys_**v1**, replace the <survey_version> with the value - 1. That is:

secret_scope:
  name: "db_ss_jobs_params"
survey_abbr:
  - acstc
survey_version: 1
...

The tables

Remember in the previous data ingestion process we had to define the subtables with something like dev_gx_acstc_sublayer_survey()? In the new format, specify each table name with the format <survey_abbreviation>_<sublayer>_<name_of sublayer or subtable>. For example, acstc_sublayer_survey.

For the keys associated with a particular sublayer or subtable, taking that of survey sublayer as an example, using a script that automates the process is far much easier.

The script is known as y_columns_script.py. It is found inside the following path: ~NIP-Lakehouse-Data/custom_scripts/.

This is how it works:

Step 1: Azure Blob Storage

Just like in the previous data ingestion methodology, you first choose the particular subtable you want to work with, say the sublayer survey for Acoustic Survey feature layer.

Go to the Azure Container page.

Browse to the Survey 123 container.

Landing Page

Navigate to acst_xxx folder. We will use the Acoustic Survey here again.

Acoustic survey blob folder

Navigating to it will reveal other Acoustic (sublayer) survey folders. Keep clicking on the most recent in all cases until you finally reach the blobs of this sublayer.

The Blobs

Click on the last blob (the bottom).

A new interface will pop from the right.

Blob interface

Click on Generate SAS.

For Allowed protocols, select HTTPS and HTTP.

Scroll down and click Generate SAS Token and URL button.

Copy the URL under the Blob SAS URL. This link is valid for 33 hours.

The Blob Url

Step 2: Filling the columns key in `.yml` file

To fill the columns key in the .yml file, we use the script - y_columns_script.py. This a custom pythonic file for extracting the column names from the blob file.

It works as follows. Paste the SAS url as the value for the sas_url variable like so:

sas_url = <"place_the_SAS_urL_here">

Now run the file. Remember that since we are working in a Linux environment, we will run the custom script using this syntax – python3 yml_columns_script.py.

The output should be as follows, each denominated with a header for the .yml key the values are printed for.

The values for columns_not_null and columns_mapped_values

Only if there are no null values in any column will the result be all_columns_not_null: true.

*NB**: The yml_columns_script prints out the relevant values for the columns, columns_not_null and columns_positive keys.

For the values to be filled in the columns_mapped_values key, a different script is used.

Step 3: Filling the `columns_mapped_values` key.

The columns_mapped_values key refers to the unique values in the choices list. The choices list can be found inside the files folder of the Survey123 group Field Division in Teams.

Alternatively, every form inside the Arcgis Survey123 Connect software has a mediafolder. All the sheets in there are the choices list for each particular form, only that Teams folder contains all the choices lists.

In case you are using this new methodology for a feature layer with an existing survey123_<feature_layer>_great_expectations.py file, just copy paste the entire columns_mapped_values key, and paste it at the very top of the y_keys_script.py file like so:

# This script prints out in long format all the values allocated to a dictionary key
columns_mapped_values={
            "activity_check": ['Clear_Vegetation', 'Reposition_Sensor', 'Data_Collection', 'Routine_Check', 
                               'Battery_Replacement', 'SD_Card_Malfunction', 'Structural_Change', 
                               'Quiet_Period_Change', 'Sensor_Height_Change', 'Sampling_Rate_Change'],
            "acoustic_functioning_check": ['Yes', 'No'],
            "data_retrieved_check": ["Yes", "No"]}

Run the script using the following syntax python3 yml_keys_script.py. It will print out the mapped values for every key inside the columns_mapped_values in exception of some keys related to field team names. This is because at the time of writing a new tool will be developed to extract field teams’ names automatically. Other keys such as for dates, longitude-latitudes are objectid, globalid and et cetera contain entirely unique values all through and thus are not relevant keys for mapping. If existent, are omitted.

Remember that the yml_keys_script.py is iterative for each subtable.

Mapped values for acoustic survey subtable

In case working on a new yml file whose feature layer does not exist inside the development/gx_development folder, you will have to use mapped_values_script.py and y_keys_scripts.py in that order.

The mapped_values_script.py is for printing out the unique values from the choices list, specifically from the name column.

For example, running the mapped_values_script.py as shown below prints out the unique values for the choices list associated with activity_check key in the columns_mapped_values dictionary.

Choices list

Copy the list printed out in the terminal and inside the y_keys_scripts.py file, paste the list next to its appropriate key value. For example, for this list, the appropriate key will be activity_check. Do this for the other choices lists, as well, such as for acoustic_functioning_check and data_retrieved_check keys as well. Runnning the y_keys_scripts.py file containing the filled up dictionary will print out all the relevant details at once.

Mapped values for every choices list

A completed .yml file should like the below format:

secret_scope:
  name: "db_ss_jobs_params"
survey_abbr:
  - acstc
survey_version: 1
tables:
  acstc_sublayer_survey:
    columns:
      - objectid
      - globalid
      - date_time_details
      - project
      ...
      
    expectations:
      # all_columns_not_null: true
      columns_not_null:
        - objectid
        - globalid
        - date_time_details
        ...
      columns_positive:
        - objectid
        - date_time_details
        - date_time_deploy
        ...
      columns_mapped_values:
        project:
          - Rangeland_Carbon_Project
          - Rangeland_Biodiversity_Project
          - Rangeland_Social_Project
        area:
          - LLBN
        transport:
          ...
      columns_matching_regex:

  acstc_subtable_repeat_deploy:
    columns:
      ...
    expectations:
      # all_columns_not_null: true
      columns_not_null: 
        ...
      columns_positive:
        ...
      columns_mapped_values:
          ...
       

Pushing and commiting

After finishing developing the expectations for every sublayer or subtable, commit and push to the appropriate branch in Github.

Repeat

Repeat the above procedure for every feature layer. This may seem a very repetitive and ordous task, but the custom scripts are a time saver. Remember to initiate a pull request seeking the approval of your supervisor once done.

From csv tables

Sometimes, you may be requested to create great expectations yml files for tables that are yet to be uploaded into Azure storage.

To perform this action, the custom_script folder in NIP-Lakehouse-Data repository contains the columns_script_func.py function for extracting all the required column names, non-null column names and positive only columns using just the specific table and the survey123 XLS form. The survey123 XLS form mentioned here is the one retrieved from ArcGIS Survey123 Connect folders.

For example, in the following code we pass the name of the specific table we want to extract the column groupings for (the column names, non-null columns and positive only columns) to the s123_path parameter. Secondly, we pass the path name of the Survey123 Connect XLS form to the xls_form_path parameter. If accessing the S123 connect form may be difficult or isn’t recommended to avoid data corruption, you can copy the S123 XLS form to another directory and insert the path in here.

gx_structure(s123_path="../CPP_SP1_Small_Tree_Survey_17-04-2024/large_tree_sp1_1.csv",
             xls_form_path="../CPP_SP1_Small_Tree_Survey_17-04-2024/CPP SP1 Small Tree Survey.xlsx")

The result of passing the above parameters will be as follows:

---Check the unique values in every column--

--Column names from survey123 csv files ../CPP_SP1_Small_Tree_Survey_17-04-2024/large_tree_sp1_1.csv---

---------------testing--column_names ----------

columns:
  - objectid
  - globalid
  - tree_species_sp1
  - tree_height_sp1
  - dead_alive_sp1
  - stem_type_sp1
  - stem_count_single_sp1
  - TotalStemCount.1
  - stem_count_index_result_sp1
  - stem_mode_sp1
  - damage_sp1
  - c_cpp_damage_cause_sp1
  - mid_point_height_of_damage_cm_sp1
  - parentglobalid
  - CreationDate
  - Creator
  - EditDate
  - Editor

---This code prints columns that do not have any null values (columns_not_null in gx)---

------------testing for columns not null-------------------
columns_not_null:
  - objectid
  - globalid
  - tree_species_sp1
  - tree_height_sp1
  - dead_alive_sp1
  - stem_type_sp1
  - stem_count_index_result_sp1
  - stem_mode_sp1
  - damage_sp1
  - c_cpp_damage_cause_sp1
  - mid_point_height_of_damage_cm_sp1
  - parentglobalid
  - CreationDate
  - Creator
  - EditDate
  - Editor

--This code shall only print the columns with positive values only---
--Works best with csv files from landing container--

---------------the below code is for testing columns_positive---------------

columns positive:
  - objectid
  - tree_height_sp1
  - stem_count_single_sp1
  - TotalStemCount.1
  - stem_count_index_result_sp1
  - mid_point_height_of_damage_cm_sp1

Notice that it returns the values of our three column groupings, that is columns, columns_not_null and columns_positive.

There is also an additional function within columns_script_func.py called count_duplicate_columns. The intention of this function is to show duplicate columns in your downloaded Survey123 csv file. Using our large_tree_sp1_1.csv table again, we can see it has an extra column called Total Stem Count which has been appended with a .1.

ObjectID: appears 1 times
GlobalID: appears 1 times
Tree Species: appears 1 times
Tree Height (m): appears 1 times
Status: appears 1 times
Stem Type: appears 1 times
Total Stem Count: appears 1 times
Total Stem Count.1: appears 1 times
# Stem Count Metrics: appears 1 times
Stem Mode: appears 1 times
Damage: appears 1 times
Damage Cause: appears 1 times
Mid-Point height of damage (cm): appears 1 times
ParentGlobalID: appears 1 times
CreationDate: appears 1 times
Creator: appears 1 times
EditDate: appears 1 times
Editor: appears 1 times

Note these are the column labels and not names, where the latter appear as the schema name.

Ensure that the great expectation file is in the naming format of cpp_small_tree_v1_great_expectations.py where the version number is the version in use. eg v1, v2, v3 etc.

The Resources files

These files are found under dab/resources/ folder. These .yml files contain the parameters to connect with Azure. It is a crucial file and not writing the values correctly may lead to data ingestion errors.

The changes to be done in these .yml files are as follows:

Ensure the name key reflects the abbreviated name of your form but pre-appended with survey123 string. eg. survey123_cpp_small_tree_v2
Ensure the name of your form (using the format in 1 above) reflects in the task_key key and conf-file keys.
Ensure the survey-id key reflects the ID of your form in ArcGIS Online.

The Survey123 Form ID in ArcGIS Online

After all the pipeline_config.py, great_expectations and resources yml files have been created, push your work to Github.

Overview of a pipeline config `yml` file

Below we provide an overview of the structure of a pipeline config file but using the one for ct as an example.

Resources:

This section might define resources needed by the pipeline, like virtual machines or storage accounts.

jobs: This section defines the jobs within the pipeline. In this case, there’s one job named survey123_ct.

name: This specifies the user-friendly name of the job - “survey123_ct”.

email_notifications:

on_failure: This defines who should receive email notifications if the job fails. Here, it’s set to databricks-ci, likely an email group related to Databricks, a data processing platform.

tasks: This section defines the tasks executed within the job. Here, there are two tasks:

task_key: This is an internal reference for the task within the pipeline.
job_cluster_key: This specifies the job cluster to be used for the task (defined later).
python_wheel_task: This indicates the task involves running a Python script using a specific wheel file. Here are additional details:
- package_name: Name of the Python package to be used - “nip_lakehouse”.
- entry_point: The specific Python function/script to execute within the package - “survey123_ingestion_landing” or “survey123_landing_bronze”.
- named_parameters: These are parameters passed to the Python script:
  - conf-file: Path to a configuration file - retrieved dynamically using the user name in the workspace.
  - data-domain: Name of the data domain (likely related to data storage).
  - survey-id: Specific survey identifier - “312815883fbf4549a784eb5ad2e1e644”.
libraries: This specifies the Python wheel file location used for the task. The path includes variables like bundle.target.

The second task (survey123_ct_landing_bronze) is similar, with a different entry_point and an additional named parameter gx-file-path.

depends_on: This is defined only for the second task (survey123_ct_landing_bronze). It specifies that this task depends on the successful completion of the first task (survey123_ct_ingestion_landing) before it can run.

job_clusters: This section defines the job cluster configurations used by the tasks. Here, there’s one cluster defined:

job_cluster_key: This references the key used in the tasks (job_cluster_task).
new_cluster: This indicates a new cluster should be created for the tasks.
- spark_version: Spark version for the cluster - “13.3.x-scala2.12” (Spark is a data processing framework).
- node_type_id: VM type used for the cluster nodes - “Standard_D3_v2”.
- policy_id: This uses a variable ${var.arcgis_job_compute_policy_id}. Likely, this variable is defined elsewhere in Azure Pipelines and specifies a compute policy for the cluster.
- num_workers: Number of worker nodes in the cluster - 1.

tags: This section defines custom tags for the job:

job_type: This specifies the type of job - “ingestion_&_validation”.
survey123_group: This groups the job by survey - “ct”.

In summary, this pipeline defines a job with two tasks. Both tasks involve running Python scripts for data processing related to a survey (survey123, group ct). The tasks use a Databricks cluster for execution and rely on configuration files and specific data locations.