Species Classification Model Training

Data Preparation

The first step in the training process is to prepare the data. This involves downloading the data from the relevant data stream, and then splitting it into training, validation, and testing sets. The training set is used to train the model, the validation set is used to evaluate the model during training, and the testing set is used to evaluate the model after training. The testing set is not used during training, and is only used to evaluate the model after training. This is done to ensure that the model is not overfitting to the training data.

The data should be split into a 70:15:15 ratio for training, validation, and testing, respectively. The data should be split randomly, and should be stratified by species. This means that the data should be split in such a way that the proportion of each species is the same across the training, validation, and testing sets. This is done to ensure that the model is trained on a representative sample of the data.

Our data is stored in the following locations on Azure cloud storage:

Subscription: natural-state-technical-sub
Storage account: nsiiaitraindatasa
Container: nsiiaiafricadatatrainsc
URL: https://nsiiaitraindatasa.blob.core.windows.net/nsiiaiafricadatatrainsc

We also have access to a very large dataset of labelled camera trap images from the Snapshot Serengeti project and Panthera projects. This dataset is stored in the following locations on Azure cloud storage:

Subscription: natural-state-technical-sub
Storage account: nsiiafricactdatasa
Container: nsiiafricactdatasc
URL: https://nsiiafricactdatasa.blob.core.windows.net/nsiiafricactdatasc

These datasets comprise approximately 12 million images, from range states across Angola, Benin, Gabon, Ghana, Kenya, Mozambique, Namibia, Senegal, South_Africa, Tanazania, Zambia, and Zimbabwe.