How I learned to stop worry and let the script do the heavy work

11 min readMar 15, 2020

Or how to automatically create machine learning models via a personal autoML

Machine learning is great but it is also a difficult and somewhat cryptic task and can be hard to democratize. On the other hand, even taking into account the no free lunch theorem, there are some techniques and process to automate the creation of optimized models. This article presents easyML a script for automation of features selection, features engineering and training of a model.

Automated Machine Learning or AutoML is the process of automating the creation of a machine learning model. With more and more data produced every days and more computing power available for training, it is possible to solve a broader range of issue and real life problems via machine learning. It enables non experts to apply machine learning models and techniques.
It also offers the advantages of:

· Producing simpler solutions
· Creating models faster
· Outperform hand-designed models

General Process of the machine learning process

https://www.slideshare.net/sklavit1/modern-frameworks-for-machine-learning

Features Selection and cleaning

A first step after exploring and analyzing the dataset, an important task is to identify the features related to the target and keep the relevant ones and remove the irrelevant or less important features which do not contribute to the target. It allows to achieve better performances for the model.
Features selection is a process to automatically or manually select the features that contribute the most to the label and get a better accuracy of the models.
It offers the following advantages:
· Reduces Overfitting: Removing redundant features also removes some noise in the dataset and therefore lowering the likelihood to learn on noise and overfitting.
· Improves Accuracy: Less misleading data improves modeling accuracy.
· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.
http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf

The script easyML works as follow:

1: Dropping duplicates in the dataset to avoid duplicated data points to have more importance than they really have. If the same data points repeat too much, it may introduce a bias in the dataset that will transfer to the weights and biases of the trained model and may prefer a class over an other.

df = df.drop_duplicates()

2: Removing columns containing only one value, bringing useless noise to the dataset. If a column is in the dataset but contains only one value, it won’t bring any additional informations about the target and won’t help to discriminate between the classes. Moreover, it will add more dimensions to the training dataset and make computations longer and more difficult.

def remove_unique_feature(df): 
    i = 0
    features_list = df.columns 
    while i < len(features_list): 
        if len(df[features_list[i]].unique()) == 1: 
            print(‘dropping: ‘,features_list[i]) 
            df.drop(features_list[i], 1, inplace=True) 
        i += 1 
    return df

3: Removing proper nouns from the dataset. Maybe one of your columns contains words, sentences, nouns… It can sometimes be usefull, for instance if it corresponds to a category, gender, city, skills… But if it is a proper noun, more apart from the confidentiality issues that it implies, it can once again add some not needed dimensions to the dataset making computations longer and add noise to the dataset as those data don’t bring any discriminative power.

def remove_name(nlp,df): 
    columns_to_remove = [] 
    for column in df.columns: 
        print(column) 
        if df[column].dtypes == ‘object’: 
            if hasNumbers(str(df[column].values.tolist()[0]).lower()) == True: 
                pass 
            else: 
                doc = nlp(re.sub(“[^a-z]”,” “,str(df[column].values.tolist()[0]).lower())) 
                    for token in doc: 
                        if token.pos_ == ‘PROPN’ and token.tag_ == ‘NNP’ and token.dep_ == ‘compound’: 
                            columns_to_remove.append(column)
    return(list(set(columns_to_remove)))

4: One hot encoding of categorical data. A model needs numerical data to train on. So, if the dataset contains categorical data which are words, the model won’t be able to train on those data. Those data points have to be translated into a numerical data. But, the classic label encoding may be misleading for the training . Therefore, one hot encoding is a better option.

def one_hot_encoder(df): 
    le = LabelEncoder() le_count = 0 
    for col in df: 
        if df[col].dtype == ‘object’: 
            if len(list(df[col].unique())) <= 2: 
            le.fit(df[col]) 
            df[col] = le.transform(df[col]) 
            le_count += 1 
    print(‘%d columns were label encoded.’ % le_count) 
    df = pd.get_dummies(df) 
    print(‘Training Features shape: ‘, df.shape) 
    return(df)

5: Processing of NaNs values. Those value represent missing data points for some features. They can be treated in different ways, either drop, mean, forward fill or backward fill

def missing_values(df,policy): 
    i = 0 
    features_list = df.columns 
    print(len(features_list)) 
    while i < len(features_list): 
        if df[features_list[i]].isnull().sum() != 0: 
            print(‘treating feature: ‘,features_list[i]) 
            if policy == ‘drop’: 
                df = df[pd.notnull(df[features_list[i]])] 
            elif policy == ‘forwardfill’:            
                df[features_list[i]].fillna(method=’ffill’) 
            elif policy == ‘backwardfill’:
                df[features_list[i]].fillna(method=’bfill’) 
            elif policy == ‘median_fill’:
                df[features_list[i]].fillna(df[features_list[i]].mean()) 
    i += 1 
    print(‘treatment is over’) 
    return df

6: Keeping the most correlated matrix. Correlation shows how to features are correlated to the target or are correlated with each others. Correlation can either be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable). Te closer to 1 (in absolute value is the correlation between two variables, the more correlated they are.
easyML will keep the features that are the most correlated to the target but not correlated together (given the threshold that is specified by the user). Once again to keep the data that discriminates the most between the different classes without introducing bias.

Disclaimer the script easyML uses the pearson correlation which only measures linear correlation and may miss some correlation

Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the strength and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Removing features correlated between each other will also make the Principal COmponent Analysis more reliable.

Features engineering

1: Converting float values to log(min(x)+1) if the distribution is skewed this will allow to correct distribution to be gaussian for better outliers removal. The script will measure the skewness of the distribution of the different features and if the skewness is not null, the script will correct the distribution to be gaussian by using the log transformation.
The script takes the log of the data point, adds to it the min value of the column and adds 1.
This applies the log and adds a constant (which does not modify the direction and trend of the log) to avoid null and negative values that the logarithm function cannot compute.

https://upload.wikimedia.org/wikipedia/commons/8/81/Logarithm_plots.png

The normal distribution is widely used in basic and clinical research studies to model continuous outcomes. Unfortunately, the symmetric bell-shaped distribution often does not adequately describe the observed data from research projects. Quite often data arising in real studies are so skewed that standard statistical analyses of these data yield invalid results. Many methods have been developed to test the normality assumption of observed data. When the distribution of the continuous data is non-normal, transformations of data are applied to make the data as “normal” as possible and, thus, increase the validity of the associated statistical analyses. The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality.
Another popular use of the log transformation is to reduce the variability of data, especially in data sets that include outlying observations.

Disclaimer

If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution. In this case, the log-transformation does remove or reduce skewness. Unfortunately, data arising from many studies do not approximate the log-normal distribution so applying this transformation does not reduce the skewness of the distribution. In fact, in some cases applying the transformation can make the distribution more skewed than the original data.
Again, contrary to this popular belief, log transformation can often increase — not reduce — the variability of data whether or not there are outliers.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/

Keeping that in mind, I still decided to go with this technic.

2: Outliers removal. Once the distribution is normalized, it is easier to remove the outliers. The outliers are here interpreted as data points that are located at more than 1.5 times the mean value. Outliers have to be removed in order to get a more general and representative dataset. This will allow to create a more representative model

def outliers_removal(data): 
    for feature in data.keys(): 
        if data[feature].dtype == ‘float64’: 
            Q1 = np.percentile(data[feature], q=25) 
            Q3 = np.percentile(data[feature], q=75) 
            interquartile_range = Q3 — Q1 
            step = 1.5 * interquartile_range 
            print(“Data points considered outliers for the feature ‘{}’:”.format(feature)) 
            display(data[~((data[feature] >= Q1 — step) & (data[feature] <= Q3 + step))]) 
            outliers = [] 
            good_data = data.drop(data.index[outliers]).reset_index(drop = True)    
        else: 
             pass 
    return good_data

What makes the normal distribution a good candidate for this is the fact that the median, mode and mean are equals, skewness and kurtosis are equals to 0. Therefore, the distribution is symetric and most of the cases are located in the interval [Q1–1.5xIQR; Q3+1.5xIQR] Therefore, taking only the data in this interval allows to remove outliers that could make the model overfit on non relevant data and not generalized well.

3: Principal Component Analysis or PCA is a linear dimensioonality reduction method that takes information from a high-dimensional space and project it to a lower-dimension sub-space. This transformation keeps the essential variations of the dataset and removes the non-essential parts with less variations.
PCA uses orthogonal transfomations to convert distinct observations to a set of linearly uncorelated variables, the principal components. They represent the underneath structure of the data. They are the components that capture most of the variance of the data and therefore carry the most information.
Those components have direction and magnitude. The direction stands for the axes where the data mostly spread and has the most variance and the magnitude is the amount of variance this component captures of the data when projected on that axis. The principal components are straight lines The first principal component holds the most variance in the data. Each subsequent principal component is orthogonal to the last and has a lesser variance.
In this way, given a set of x correlated variables over y samples you achieve a set of u uncorrelated principal components over the same y samples. The components are uncorrelated because as correlated features contribute to the same principal component, they are projected in the same direction, englobing all those components and reducing the number of initial features into uncorrelated principal components, each representing a different set of correlated features with different amounts of variation. This also helps reducing noise in the dataset. Moreover, having deleted features that were already very correlated between each other previously, the discriminating power of the PCA will be greater.

illustration of the projection of the PCA

Finally, PCAs’ dimensionality reduction properties allow to speed up the modelisation by having less features to compute on, making training and testing faster and less computation heavy.

This function will take the previously cleaned dataset, then take a variance threshold to know which quantity of information to retain from the dataset before projecting it, then project the datapoints into the selected number of components.

nb_components = utils.PCA_generator(df,threshold) 
utils.pca_components(df, nb_components)

illustration of the component projection and explained and cumulated variance of each component

Modelisation

Once the data are cleaned and denoised so that the resulting dataset and features are the most representative of the phenomenom to model as possible, the data can be fit to a model.
To get the best model possible, it must be fined tuned to a given situation and dataset. The no free lunch theorem tells us that there is no one size fits all and that each model is linked to a specific phenomena. Therefore, to get a good model, one must modify the parameters and hyperameters.

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in k-Nearest Neighbors, the number of hidden layers in Neural Networks.
In contrast, a parameter is an internal characteristic of the model and its value can be estimated from data. Example, beta coefficients of linear/logistic regression or support vectors in Support Vector Machines.
https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e

Atraditional method of hyperparameters optimization is grid search. The classic grid search exhaustively considers all parameter combinations and random search sample a given number of candidates from a parameter space with a specified distribution. This approach is great but can be very slow to converge and not always give the best results.
On the other hand, evolution algorithm seem to provide better results.

One of the main applications of Evolutionary Algorithms in Machine Learning is Hyperparameters Optimization. For example, let’s imagine we create a population of N Machine Learning models with some predefined Hyperparameters. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that perform best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that to get again a population of N models. At this point, we can again calculate the accuracy of each model and repeat the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.
https://towardsdatascience.com/introduction-to-evolutionary-algorithms-1278f335ead6

Using TPOT, a library based on sklearn, it is possible to easily use evolution algorithm for parameter optimization.

Conclusion

This method is still a work in progress and still has drawbacks (when the input space is too big, the code sometime crashes but it is probably a code issue). Moreover, it does not proposes any image recognition, text analysis or time series yet, only basic classification and regression. Finally, TPOT only relies on accuracy to measure the efficiency of the model it creates which can be a misleading metric.
Anyways, for situation where a little finesse is required, such techniques are stil too bluent to be really efficient and will still require field knowledge and human tinkering. But it can be a usefull tool to speed up the process.

If you want to dig deeper into the code, repository of the project: https://github.com/elBichon/easyML