By Marla Rosner and Keith Moore
Previously, we’ve covered the basics of machine learning, including AI, deep learning, and neural networks. This time, we’re going to take a deep dive into another term that takes a little longer to explain, but is creating an exciting new field within modern machine learning: automated model building.
Automated model building, often also referred to as meta learning, is AI that designs AI systems. Humans can design AI models, of course, but it’s a lengthy and tricky process. These man-made models often struggle to scale across large operations and cannot effectively handle edge cases that occur under extreme or unusual operating parameters.
Thus, automated model building has become increasingly important to machine learning, as it creates dynamic, accurate models that take less time to develop and can adapt to changing conditions without needing a human in the loop every step of the way.
There are four major steps in the process of automated model building: cleaning, feature generation, feature selection, and the construction of either a supervised or unsupervised model.
Not all data sets are created equal. To make a data set algorithmically usable, an automated model building software needs to fill in missing data points, convert all data into a usable format, scale the data, and in some cases rebalance sample sizes.
Imputation, or the process of filling in missing data, can be handled a variety of ways depending on the needs of the given data set. This can include filling the missing points in using the mean of other data points in the set, predicting the missing point based on other variables, and carrying the last seen observation forward, among numerous other techniques.
Data also comes in a number of formats, such as categorical, numerical, and date/time. The software must convert all data into a usable format, so it can be used together in a single model. Often this means converting items like categorical data to numeric data, or recognizing a date/time feature and ensuring that the data is treated as a time series.
Finally, data needs to be scaled and rebalanced so that all data is in similar sample sizes and a single scale, without too wide a variation in in the range of values and similar sample sizes. This ensures that all of the data can be meaningfully compared and used together.
Once data has been cleaned, it must be manipulated to generate more appropriate features for solving a particular problem. A feature is a piece of measurable information used in machine learning–for instance, the age, weight, and yearly income for a set of people would be three possible features of that set.
The processes by which automated model building approaches can generate features include automated windowing and automated risk generation.
Automated risk generation refers to the construction of a risk index, a derived feature that increases over time leading up to a specific event, whether it be an asset failure or a cyber attack. This helps to improve event predictability, and can be generated using examples of the event type the model needs to predict and a lead warning time the user needs.
Automated windowing is a method for dealing with time series data. In time series data, current data points are dependent on previous data points within a time window. Before it can work with time series data, automated model building software must select the optimal time window to use. Only then can it extract features from the data in the frequency and time domains.
Once automated cleaning and feature generation have taken place, the data set is ready to be used to build a model. There are two types of learning models: supervised and unsupervised. The process of feature selection and model building will differ based on whether the resulting model is supervised or unsupervised.
Supervised Learning Models
Supervised learning models are created by feeding them pre-labeled training data, from which the model learns how to label new data points.
There are a variety of ways for automated model building software to create a supervised model. One such method uses an evolutionary process that begins by generating thousands of neural network models and scoring them on their performance. After creation, the first generation of models is speciated, or clustered, based on shared characteristics. Within each species, the software identifies the elite models that performed best towards solving the problem. These elite models can then be genetically mutated and optimized by deep learning-based backpropagation.
Once this is done, the altered elites are reintroduced to the general population of models. Over a number of generations, the models are refined until they are complex and sophisticated enough to accurately solve the problem at hand.
Unsupervised Learning Models
Unsupervised models are models that do not use training data, but are generated by simply feeding an algorithm unlabeled data and allowing it to determine independently how best the data should be grouped.
As with supervised models, there are a variety of approaches that can be used to create an unsupervised model. A method SparkCognition has used involves a technology with a name that’s a bit of a mouthful: Gaussian mixture model variational autoencoder, or GMMVAE.
The best way to understand this term is to work backwards. An autoencoder is a neural network-based approach for compressing a feature set to the smallest size possible and then decompressing that small feature set with as little loss as possible. Think of this as a neural-network-based way to “zip” and “unzip” data like you would on a computer.
A variational autoencoder specifically tries to zip the data up in a way that makes the overall distribution as close as possible to a Gaussian, or normal, distribution.
A Gaussian mixture model variational autoencoder distinguishes itself further by encoding the data to fit not just one, but multiple Gaussian distributions. These multiple Gaussian distributions can then be used to define clusters for the data. In a model intended to cluster types of fruit, for example, the data might be separated into a Gaussian distribution of fruit with banana-like qualities, a Gaussian distribution of fruit with apple or peach-like qualities, and so on.
All of these processes are what occur inside automated model building software when it creates a new model. A human can perform these tasks as well, but with so many complex steps needed to create a quality model, it’s easy to see why it often takes a team of data scientists months or longer to build a model. Automated model building, on the other hand, can shorten this process to weeks or even days. It’s time to let data scientists handle the human tasks that require critical and creative thinking, and let AI handle building more AI.