Versioning
Versioning is the management of multiple product releases, be they be improved or updated. In the data lakehouse, the tables we create through the data ingestion process are also version. That is, we also track their changes. For example, I may have a S123 form by the name of soil_spectroscopy
. When one ingests this form for the first time, it is designated as version 1 in the great expectations (gx) files.
The survey123_soil_spectroscopy.yml
file within the resources
folder will remain the same as in other YAML files. However, suppose there is an error during the landing bronze stage in the Azure pipeline or we have republished the same form several times, a possible issue of the pipeline files is that it could be reading a wrong or outdated version of the file. From here, it is prudent to look at the version of the table.
Table metadata
The versioning of each form ran in a pipeline is recorded in the ``ns-ii-tech-dev-catalog.bronze.table_metadata
table. To access it, go to your SQL editor in Azure and type the following query:
SELECT * FROM `ns-ii-tech-dev-catalog`.bronze.table_metadata
Now, say you want to access the metadata of any form, but for this tutorial’s purposes, we want to access the table for the soil_spectroscopy
form. We would do so via this query.
SELECT * FROM `ns-ii-tech-dev-catalog`.bronze.table_metadata
WHERE survey_abbr = 'soil_spectroscopy';
We can see that in the schema_version
column, the soil_spectroscopy form began from version 2 and then proceed to version 3. Version 2 was only valid for one day before version 3 took over.
NB All forms will begin from version 1, but version 2 was a special case and a problematic form to create, thus it began from version 2. A repeat: all tables must begin from version 1 except in very special cases.
Now that the table metadata shows that the most recent version is version 3, let’s proceed to create the gx files for version 2 and 3. But you may ask, “Why create for version 2, when version 3 is the latest? Based on experience, it is best to still have version 2 under the record, so as to monitor any historical change if version 3 get more updates in the future.
Creating new versions in the <survey-abbreviation>
.pipeline_configy YAML files
The pipeline config files remain the same, only that they will be placed in the respective tasks_surveys_v<number>
directories. For example, for version 2 of the soil_spectroscopy
form, the pipeline config file - soil_spectroscopy_pipeline_config
YAML will be place in the tasks_surveys_v2
folder while that of version 3 will go into the tasks_surveys_v3
folder.
All the contents of the pipeline configuration file remain the same just like that within the tasks
folder reserverd for version 1.
Creating new versions in the survey123_<survey-abbreviation
files
The process of creating new versions in the gx files basically involves creating the same gx files (survey123<survey-abbreviation>
.yml, <survey-abbreviation>
.pipeline_configy.yml and <survey-abbrevation-<version-number>
.great_expectations.yml) but now in the respective folders. Starting with the survey123<survey-abbreviation>
.yml file, any file with version 2 and/or version 3 will go into the resources_surveys_v2
and resources_surveys_v3
folders respectively. Below you can see the survey123_soil_spectroscopy
YAML files for version 2 and version 3 in their respective resources
directories.
Within each of the survey123_soil_spectroscopy
YAML files for version 2 and version 3, the following should be updated:
- The
jobs:
key and thename:
key under theresources
tree should be appended withv2
in their names like so:
resources:
jobs:
survey123_soil_spectroscopy_v2:
name: survey123_soil_spectroscopy_v2
email_notifications:
on_failure:
- databricks-ci
- The
conf-file
key should be updated to readtasks_surveys_v2
ortasks_surveys_v3
depending on the version number.
conf-file: /dbfs/User/${workspace.current_user.userName}/conf/tasks_surveys_v2/soil_spectroscopy_pipeline_config.yml
- Under the
tags
tree, thesurvey123_version
key should be updated withv2
value, like so:
tags:
job_type: ingestion_&_validation
survey123_group: soil_spectroscopy
survey123_version: v2
The below images demonstrates this for version 2.
Only the pipeline config files remain the same during the versioning. It will be copy pasted into the respectful tasks_surveys_<version-number>
directories.
The updates done for version 2 files will also be replicated for version 3, or for any other future version in that matter, but with the appropriate v<number>
. For example, version 3 files will have the v3
in their files.
Creating new versions in the <survey-abbrevation-<version-number>
.great_expectations YAML files
Likewise the gx files with the namespace format <survey-abbrevation-<version-number>
.great_expectations.yml will go into their respective gx_development_surveys_v<number>
directories. For example, version 2 gx files will go into the gx_development_surveys_v2
folder and for version 3 into the gx_development_surveys_v3
folder.
The key thing to note that the survey_version
key will be updated to match the correct number. For example, for version 2 the gx file will look like this:
secret_scope:
name: "db_ss_jobs_params"
survey_abbr:
- soil_spectroscopy
survey_version: 2
tables:
soil_spectroscopy_sublayer_survey:
columns:
- objectid
- globalid
-- snip --
And for version 3:
secret_scope:
name: "db_ss_jobs_params"
survey_abbr:
- soil_spectroscopy
survey_version: 3
tables:
soil_spectroscopy_sublayer_survey:
columns:
- objectid
- globalid
Everything else remains the same.
NB: During version, only the pipeline config files remain the same throughout, even when pasted into their respective tasks_surveys_v<number>
directories!
Merging to dev
branch
After performing the necessary updates, the files are committed and pushed to your remote Git branch. From there they are to be merged to dev
branch and the entire data ingestion process is repeated! See this section for the data ingestion process.