Code Walkthrough in An anonymous Machine Learning Hiring Hackathon — Part 2

Raghuvansh Tahlan
Analytics Vidhya
Published in
7 min readNov 19, 2020

--

Results Compared from all Model

In Part 1 of this article, we looked at how to import the data and the libraries, preprocessing the data, using a CatBoost classifier model with Grid Search to find optimal parameters and tuning the classifier cutoff by maximizing the sum of Sensitivity and Specificity and maximizing the Roc_Auc Score.

In this Part, we will focus on Feature Engineering, Feature Selection and Automated Machine Learning(AutoML).

Feature Engineering

Feature Engineering is an important step in an ML pipeline because a Machine Learning model can only be as good as it’s data. Feature Engineering is learned through experience, domain knowledge and some hit and trial. We will use the techniques which will generate a combination of columns or some mathematical transformation to generates new features because they can be used in any ML competition without much brainstorming about features.

It is carried out after pre-processing of the data but before splitting it for model training. We will branch-off from the previous article when we have recovered the training and testing data from the combined data.

Combination of Columns

Most of the columns are ‘categorical’ or ‘object’ type, so features are created, which are a combination of two columns.

Let’s consider a case where one column is ‘Office_PIN’ and ‘Manager_Num_Products’ then the combination of the two will be called ‘Office_PIN_and_Manager_Num_Products’ and all the values will be joined using an ‘_’ so many new categories will be created.

Combination of Columns

Logs of Odds

It is a technique used in classification problems to convert the categorical variable into a numerical one. The numerical variable can be further used for feature engineering.

The ‘column_defaultratio_cut’ function takes the dataframe, column name on which log of odds transformation is to be performed and the target variable and returns dictionary.

Square, Square root and Log Transformation

Since after performing the log of odds transformation, all columns are numerical, so we can perform Square, Square root and Log transformation because sometimes these transformed variables have better-predicting power.

Since there is one column where the minimum value is negative, so the square root and log transformation may show some error, so it is advisable to add a constant value to avoid the error.

All transformations are calculated and merged with the original dataframes and checked for any null values created in this process.

After these transformations, there are 1768 columns in our dataframes, so we need to perform feature selection to select some of the features with most predicting power.

Feature Selection

There can be many techniques to select the best features. Still, I have used a CatBoost Classifier and sorted the featured based on feature importance and selected the features till their cumulative sum is around 98%. This process is repeated many times to reduce the number of variables. Other techniques, like using VIF or Correlation, can also be used for features selection.

This is a time-consuming process, so a loop to automate this process will be very beneficial. I have selected 175 features after repeating this process for 5–6 times.

Difference between ROC_AUC before and after feature Engineering.

Initially when we have not performed cutoff tuning our Testing ROC_AUC score was 0.51, which has now increased to 0.61. However, it can still be increased after using AutoML technique, finding optimal parameters and cutoff tuning.

Randomised Search for Optimal Parameters

It is similar to grid search as both use range of parameters. Still, in grid search, we define the range of parameters in whole numbers, and all the combinations are tried. In contrast, in Randomised search, a limited number of combinations defined by the ‘niter’ parameter are tried. Grid Search is effective but slow; the randomised search is fast but not always effective.

ROC_AUC score after Randomised Search

Cutoff Tuning by Maximizing ROC_AUC ScoreUsing the ‘predict_proba’ function to predict the probabilities on the training and validation set and then trying the different cutoff values to find the cutoff where ROC_AUC is maximum.

Training and Testing ROC_AUC Score

Using the tuned cutoff as 0.65, we get training and testing ROC_AUC score as 0.74 and 0.66.

We have seen quite some improvement till now in our Testing ROC_AUC score, but we still have something up our sleeve, and that is AutoML. Some people believe AutoML will eat up jobs others say it has a long way. Let us see it as a tool and use it if it gives good results. There are many flavours of AutoML like Google, AWS, Azure, H2O, AutoKeras, Auto Sklearn, TPot and Pycaret, some paid and some open-sourced, but I keep myself away from using paid tools in Hackathons. I have found MLJAR quite useful and easy to use. They have open-sourced code as well as paid web services. I will recommend using MLJAR on Google Colab or in a virtual environment.

So, saving our dataset for use on Colab.

MLJAR

Most of the libraries we need are pre-installed on Google Colab, but we have to install MLJAR. It is quite simple using the command below, and after the installation is done, there will be prompt to restart the runtime that completes our installation.

!pip install mljar-supervised

We will have to import some libraries to get started.

from google.colab import drivedrive.mount('/content/drive')

We can upload our dataset on Colab as well as on our Google drive, I have used the later one, and then we can import our drive using the above command.

After successfully importing the drive, we can read the data files.

MLJAR has many modes since we are using it for a Hackathon/Competition we will use ‘Compete’ mode. Information on different modes is available on their Github repository.

It will take some time depending upon the size of the dataset. There is a significant improvement in the testing ROC_AUC score, and we will check if tuning cutoff could show some more improvement.

Fine Tuning Cutoff on AutoML Model

Both methods earlier used for tuning cutoff have been executed in the Jupyter Notebook and results were similar, both improving the Testing ROC_AUC scores a little to 0.693.

Results

Results Compared from all Models

Over the two articles, we have build many models starting from our first model, where we used a grid search with Catboost to AutoML with Cutoff fine-tuning. This journey showed us around 30% improvement in the testing score increasing from 0.53 to 0.69.

Last Words

Although I participated in this Hackathon, still I could not achieve this good result. I similarly did all steps except Feature Engineering where I added a step while compiling this article and that did make a difference. All the data files and Jupyter Notebook are to the Github repository. Comment and suggestion are welcomed.

--

--

Raghuvansh Tahlan
Analytics Vidhya

Passionate about Data Science. Stock Market and Sports Analytics is what keeps me going. Writer at Analytics Vidhya Publication. https://github.com/rvt123