8. Significance of ArcGIS Online in the Data Ingestion process
Sometimes, errors in a job ingestion run are related to some of the settings in ArcGIS Online (AGOL). To contextualize this, consider the below job run which at first, both tasks were totally unsuccessful while in the subsequent run, the first task was successful.
In a data ingestion run (for NS), there are two tasks:
-
ingestion run: this loads data from AGOL to the NS Azure Data Lake Storage Gen2.
-
bronze run: this extracts the data from Azure Data Lake Storage Gen2 to bronze folders. The bronze folder stands for the bronze stage, the data is still raw, as is without any modifications from AGOL.
Taking a look at the first job run, the first task had failed, which led to the second task not running at all.
If you click on the first task, the one labelled survey123_gem_soil_respiration_ingestion_landing run
you will get a Error code: 400 which in the internet world is related to server issues.
If you go back one step and proceed to the second task of the pipeline, the one labelled survey123_gem_soil_respiration_landing_bronze run
, you will see that this task didn’t even start since upstream processes were unsuccessful. Although not always the case, the Error code: 400 is associated with some settings not activated in ArcGIS online.
8.1. Optimizing AGOL settings for data ingestion
To avoid such frustrating errors, we will use the example of the Gem Soil Respiration form, where at least the first task (ingestion run) was successful. We will use the AGOL settings for this as a template for how all other forms’ AGOL settings should be.
Step 1: Signing in to AGOL
Sign in to your AGOL account using your provided NS credentials.
Step 2: Locating the feature layer
Go to Content>My Groups tab of AGOL. Depending on the group you have been placed in, you can only access the forms within that particular group. Type ‘GEM Soil Respiration’ and the Feature layer, Web map and Form links should appear.
Step 3: Settings
Click on the Gem Soil Respiration Feature layer link and go to the settings tab.
Under the General tab, ensure your settings are as in the below image.
Under the Feature layer (hosted) tab, ensure the settings are as below.
For the Manage indexes subsection, the settings should match as seen below.
Leave the Field indexes subsection untouched.
Once the above settings were checked and the Gem Soil Respiration pipeline rerun, the first task (ingestion run) was successful.
The error that led to the second task (bronze landing run) to fail is related to some validation checks. More of than not, the ValueError: Encounterd error … is related to some GX expecations not being met.
Taking a look at the Great Expectations webpage shows that the multiple values per record for some of the subtables is a suspect for breaking the pipeline.
From experience, it has been noted one should not provide the values for those columns that accept multiple choices. For example, in the gx yml files, the column with the schema name reasons_sp1
and its corresponding values have been commented out under the columns_mapped_values
key. If it were a S123 field that accepts only one answer, rather than multiple, the pipeline would have worked.