Restoring Credibility Of Machine Learning Pipeline Output Through Blockchain Data

All domains are going to be turned upside down by machine learning (ML). This is the consistent story that we keep hearing over the past few years. Except for the practitioners and some geeks, most people are not aware of the nuances of ML. ML is definitely related to Artificial Intelligence (AI). Whether it is a pure subset or a closely related area depends on who you ask. The dream of general AI for machines to solve previously unseen problems in all domains using cognitive skills had turned into AI winter as this approach did not yield results for more than forty or fifty years. The resurgence of ML turned the field around. ML became tractable as the horsepower of computers increased and much more data about different domains became available to train models. ML turned the focus away from trying to model the whole world using data and symbolic logic to make predictions using statistical methods on narrow domains.

In general, there are three separate approaches in ML; one is called supervised learning, the second semi-supervised learning and the third is unsupervised learning. Their differences stem from the degree of human involvement to guide the learning process. Deep learning is characterized by multiple layers of these approaches. The success of ML comes from the ability of models trained through data in a particular domain called training sets to make predictions in generic situations. In any ML pipeline a number of candidate models are trained using data. At the end of the training, an essential of amount of basic structure of the domain are encoded in the model. This allows for the ML model to generalize to create predictions in the real world. For example, a large number of cat videos and non-cat videos can be fed in to train a model to recognize cat videos. At the end of the training a certain amount of cat-videoness is encoded in successful predictors.

ML is used in many familiar systems; including movie recommendations based on viewing data, market basket analysis which suggest new products based on the current contents of shopping carts. Facial recognition, skin cancer prediction from clinical images, identifying retinal neuropathy from retinal scans, predictions of cancer from MRI scans are all in the domain of ML. Of course, recommender systems for movies are vastly different in scope and importance from those predicting skin cancer or the beginnings of retinal neuropathy and blindness.

The key idea after this training is to use an independent and identically distributed (iid) evaluation procedure using data drawn from the training distribution which the predictors have not yet encountered. This evaluation is used to choose the candidate for deployment in the real world. Many candidates can perform similarly during this phase, even though there are subtle differences between them due to the starting assumptions, number of runs, data that they trained on etc.

Ideally the iid evaluation is a proxy for the expected performance of the model. This helps separate the wheat from the chaff. The duds from the iid-optimal models. That there would be some structural misalignment between the training sets and the real world is obvious. The real world is messy, chaotic, images are blurry, the operators are not trained to capture pristine images, there are equipment breakdowns. All predictors deemed equivalent at the evaluation phase should have should have shown similar defects in the real world. A paper written by three principals and backed by about thirty other researchers all from google
, probes this theory to explain many high profile failures of ML models in the real world. This includes the highly publicized Google health fiasco where the model did not perform well in field tests in Thailand aimed at diagnosing retinal neuropathy from scans.

The paper notes that all predictors that performed similarly during the evaluation phase did not perform equally in the real world. Uh oh, this means that the duds and the good performers could not be distinguished at the end of the pipeline. This paper is a sledgehammer taken to the process of choosing a predictor and the current practices of implementation of a ML pipeline.

The paper identifies the root cause of this behavior as underspecification in ML pipelines. Underspecification is a well understood and well documented phenomenon in ML, it arises due to the presence of more unknowns than independent linear equations expressed in a training set. The first claim in the paper is that underspecification in ML pipelines is a key obstacle to reliably training models that behave as expected in deployment. The second claim is that underspecification is ubiquitous in modern applications of ML, and has substantial practical implications. There is no easy cure for the underspecification. All deployed ML predictors using the current pipeline are tainted.

The solution is to be aware of the perils of underspecification and choose multiple predictors, and then subject them to stress tests using more real world data and choose the best performer; in other words, expand the testing regime. All this points to the need for better quality data to be used in both the training and evaluation set, which brings us to the use of blockchains and smart contracts to implement solutions. Access to higher quality and varied training data may reduce underspecification and hence create a pathway to better ML models, faster.