Versioning

Versioning is the management of multiple product releases, be they be improved or updated. In the data lakehouse, the tables we create through the data ingestion process are also version. That is, we also track their changes. For example, I may have a S123 form by the name of soil_spectroscopy. When one ingests this form for the first time, it is designated as version 1 in the great expectations (gx) files.

Version 1

The survey123_soil_spectroscopy.yml file within the resources folder will remain the same as in other YAML files. However, suppose there is an error during the landing bronze stage in the Azure pipeline or we have republished the same form several times, a possible issue of the pipeline files is that it could be reading a wrong or outdated version of the file. From here, it is prudent to look at the version of the table.

Table metadata

The versioning of each form ran in a pipeline is recorded in the ``ns-ii-tech-dev-catalog.bronze.table_metadata table. To access it, go to your SQL editor in Azure and type the following query:

SELECT * FROM `ns-ii-tech-dev-catalog`.bronze.table_metadata

Table metadata

Table metadata

Now, say you want to access the metadata of any form, but for this tutorial’s purposes, we want to access the table for the soil_spectroscopy form. We would do so via this query.

SELECT * FROM `ns-ii-tech-dev-catalog`.bronze.table_metadata
WHERE survey_abbr = 'soil_spectroscopy';

Soil spectroscopy versioning

We can see that in the schema_version column, the soil_spectroscopy form began from version 2 and then proceed to version 3. Version 2 was only valid for one day before version 3 took over.

NB All forms will begin from version 1, but version 2 was a special case and a problematic form to create, thus it began from version 2. A repeat: all tables must begin from version 1 except in very special cases.

Now that the table metadata shows that the most recent version is version 3, let’s proceed to create the gx files for version 2 and 3. But you may ask, “Why create for version 2, when version 3 is the latest? Based on experience, it is best to still have version 2 under the record, so as to monitor any historical change if version 3 get more updates in the future.

Creating new versions in the <survey-abbreviation>.pipeline_configy YAML files

The pipeline config files remain the same, only that they will be placed in the respective tasks_surveys_v<number> directories. For example, for version 2 of the soil_spectroscopy form, the pipeline config file - soil_spectroscopy_pipeline_config YAML will be place in the tasks_surveys_v2 folder while that of version 3 will go into the tasks_surveys_v3 folder.

Pipeline configuration file

All the contents of the pipeline configuration file remain the same just like that within the tasks folder reserverd for version 1.

Creating new versions in the survey123_<survey-abbreviation files

The process of creating new versions in the gx files basically involves creating the same gx files (survey123<survey-abbreviation>.yml, <survey-abbreviation>.pipeline_configy.yml and <survey-abbrevation-<version-number>.great_expectations.yml) but now in the respective folders. Starting with the survey123<survey-abbreviation>.yml file, any file with version 2 and/or version 3 will go into the resources_surveys_v2 and resources_surveys_v3 folders respectively. Below you can see the survey123_soil_spectroscopy YAML files for version 2 and version 3 in their respective resources directories.

Folders for versions in resources

Within each of the survey123_soil_spectroscopy YAML files for version 2 and version 3, the following should be updated:

  1. The jobs: key and the name: key under the resources tree should be appended with v2 in their names like so:
resources:
  jobs:
    survey123_soil_spectroscopy_v2:
      name: survey123_soil_spectroscopy_v2
      email_notifications:
        on_failure:
          - databricks-ci

  1. The conf-file key should be updated to read tasks_surveys_v2 or tasks_surveys_v3 depending on the version number.
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/soil_spectroscopy_pipeline_config.yml

  1. Under the tags tree, the survey123_version key should be updated with v2 value, like so:
tags:
  job_type: ingestion_&_validation
  survey123_group: soil_spectroscopy
  survey123_version: v2

The below images demonstrates this for version 2.

Configuration files for version 2

Configuration files for version 2

Only the pipeline config files remain the same during the versioning. It will be copy pasted into the respectful tasks_surveys_<version-number> directories.

The updates done for version 2 files will also be replicated for version 3, or for any other future version in that matter, but with the appropriate v<number>. For example, version 3 files will have the v3 in their files.

Configuration files for version 2

Configuration files for version 2

Creating new versions in the <survey-abbrevation-<version-number>.great_expectations YAML files

Likewise the gx files with the namespace format <survey-abbrevation-<version-number>.great_expectations.yml will go into their respective gx_development_surveys_v<number> directories. For example, version 2 gx files will go into the gx_development_surveys_v2 folder and for version 3 into the gx_development_surveys_v3 folder.

The key thing to note that the survey_version key will be updated to match the correct number. For example, for version 2 the gx file will look like this:

secret_scope:
  name: "db_ss_jobs_params"
survey_abbr:
  - soil_spectroscopy
survey_version: 2
tables:
  soil_spectroscopy_sublayer_survey:
    columns:
      - objectid
      - globalid
-- snip --

And for version 3:

secret_scope:
  name: "db_ss_jobs_params"
survey_abbr:
  - soil_spectroscopy
survey_version: 3
tables:
  soil_spectroscopy_sublayer_survey:
    columns:
      - objectid
      - globalid

Everything else remains the same.

NB: During version, only the pipeline config files remain the same throughout, even when pasted into their respective tasks_surveys_v<number> directories!

Merging to dev branch

After performing the necessary updates, the files are committed and pushed to your remote Git branch. From there they are to be merged to dev branch and the entire data ingestion process is repeated! See this section for the data ingestion process.