When training a model, a machine learning engineer will optimize the model's parameters in order to minimize the error on the collected training dataset. Success at this task however, does not mean the model will do well on unseen data. That's why it is common practice to test the model on a testing dataset containing data that the model has never seen before. Now, even if we see a good result on the test dataset, that doesn't necessarily mean that the model will do well in production.
Data in production can differ from the data the model has been trained on in a number of ways. Different pre-processing techniques, temporal drift or different sources of data shifting the data subtly are just a few of the reasons why the data we've collected might not exactly match production data.
For a successful deployment, there thus are a number of additional checks and tests we should perform to make sure we are deploying a robust system to the field. Just as a few examples, here are a few things to consider:
- A system to detect degradation in our model performance in real time and log data with high error or model uncertainty.
- A system that can compare features in production data with expected data features to detect whether the environment has changed and we risk making incorrect predictions.
- A system to capture novel data points that could be interesting to label and add to the training dataset.
- Traditional rule-based systems to prevent outputs that could be significantly out-of-bounds for the problem at hand and prevent catastrophic failures.
With these systems in place, we can implement a continuous monitoring and retraining system that can continuously update the model as we obtain more data. In this setting, it is crucial to have in place some way to keep a version control of the data so that we know on what data each model has been trained on. This helps us debug issues with model performance by checking whether they were trained with the correct data or not.
If human labeling is necessary, we need a process by which we can efficiently label our new data. For example, a fast labelling platform and a way to track which data has been labeled and not; a set of guidelines for human labelers to define what to do in the case of ambiguous labels or to clarify vague definitions; and a verification procedure to make sure that the humans are not making mistakes themselves; as well as data analytics so that our researchers can check that that data set level statistics make sense.
In realistic scenarios, certain labels are less frequent than others and we need to make sure that we balance our datasets. Then, we need to make sure that when we present our data to labelers, they can label the instances that we're more concerned with. For example, if we have a data set where we have 99% dogs and only one percent wolves we might not want to label 99 dog images for every wolf we label. It will be better to label only 10 dogs for every wolf to maximise the cost benefit of our labelling efforts.
Finally, when deploying an AI system we need to consider the risk profile of mistakes. Mistakes are unavoidable, and even humans make them. In fact we've observed that in general a well trained AI model will make fewer mistakes on average than a human. However when an AI model does make a mistake, it can end up missing the mark much more than a human would, and that is a problem in terms of risk mitigation.
So it is important to determine the level of autonomy that this system will have and how many safeguards as well as to what degree it needs to be tested before deployment. For example a recommendation for a video website probably can be deployed more quickly and iterated on much faster than an industrial control system, where a catastrophic error could cause a physical malfunction.
Recently there has been a great push towards "explainable AI" systems. However, in our experience these systems don't really work as advertised, since state of the art AI models work precisely by performing calculations in a high dimensional space, and projecting them back into a low dimensional space for a human to visualize ends up over simplifying the model workings and are not enough to understand whether the model will correctly generalize in situations where it encounters very unfamiliar data. While there is value in visualization techniques for debugging and to have an idea of what the model is doing, we must do more if we want to prevent catastrophic failures.
Instead, it is necessary to extensively test the model in unfamiliar situations, as well as present it with "edge cases", artificially constructed data points meant to cause the model to malfunction. For example, in the case of image classification, researchers have created a dataset of visually ambiguous images to determine whether the classification models were able to fail gracefully or would just make random predictions. This type of testing helped shape a new generation of more robust algorithms.