Often overlooked, the very first step in AI model development is to determine a combination of model and data that fulfills the objectives that we're looking for. This also involves determining exactly what will be output by the AI model and what objective is going to be optimized for.
With the inputs and outputs of the model in mind, we have an idea of the dataset required to successfully train the model. The first step is to take stock of how much data we have right now and how much more data will be necessary to collect. If we do need to collect additional data there are various trade-offs to consider, namely how much does it cost to acquire additional data points, and how much it costs to label them (which could vary depending on the level of expertise required of the human labelers).
Depending on these variables we can estimate the size of the initial training dataset, and this can be compared to the cost of developing different training techniques which may require more or less data. For example, data augmentation, regularization and transfer learning are techniques which make it feasible to train on smaller datasets, while techniques such as self-supervised learning may make it possible to train on unlabeled data. All these techniques however, will increase the time needed to develop the model.
After considering the situation with the data and having decided on a modelling technique to handle it, engineers can start implementing the model architecture. That is, an algorithm that can take in data and an output, which could be a prediction or a decision. To train it, we perform an iterative learning process on the computer where the algorithm is fed the training data, its predictions are compared to the desired output, and an error signal is calculated. From this error signal, we can update the model's parameters until it minimizes the error.
Once the model is trained it's very important to check that it hasn't memorized the data, and that it is generalizing in a way that is going to be helpful. So we need to have an analysis step where we calculate the model's results on a test set, with data it's never seen before. As well as calculate how robust it is to perturbations. And perhaps we need to optimize it by tuning the parameters or by training it again with a different architecture.
During this iterative process it may happen that we realize that some of our prior assumptions were incorrect. At that point, we might decide that we need to change either the training data or the algorithm and train it again. That means we have another, higher-level learning loop at the human stage. This is very different from a traditional software development process where we can make a test, write the code and then we can immediately check whether this code we've written is correct.
It is then crucial to understand whether we've budgeted enough time for this iterative development loop, which depends on whether this kind of model has been done before and we can take the learnings from another team, or we are developing something from scratch; as well as what are the tolerances for accuracy and robustness, which can make development time and costs much higher. In certain cases, if the way the model interfaces with the environment is controlled by a human or involves some kind of feedback loop, it might be acceptable to trade off some accuracy for a faster turnaround time in development.
From the compute perspective, this iterative process can take a long time and training the model various times for different architectures, hyper parameters and so on will expend a very large amount of computational power. Additionally, the machine learning engineer will often need to wait for an experiment's results before proceeding with the learning loop. So when we're budgeting for our development process, we need to account for both the longer time for research and development as well as account for the compute cost to train the model multiple times.