Employee Attrition Factors Prediction

University of Toronto Machine Intelligence Team

7 min readJun 12, 2022

A UTMIST Project by Maliha Lodi, Afnan Rahman, Kevin Qu, Omer Raza Khan, Yinan Zhao, and Matthew Zhu.

Project Overview

An organization is only as good as its employees, as they provide the company with its competitive advantage. As a result, it is imperative that companies understand how to retain their top talent. Using machine learning and data analytics, this project aims to identify areas of improvement across all departments of a company by determining factors that might be responsible for employee attrition in different teams.

To this date, no automated process to predict employee attrition is known to be used by any company. Yet, organizations face large amounts of tangible and intangible costs due to high employee turnover rates. Thus, we constructed a pipeline for predicting employee attrition using machine learning. In this project, we predicted combinations of factors that may be responsible for significant employee turnover rates. We provided insights to Human Resources teams on how to increase their employee retention rate by making necessary improvements.

Dataset

The data used for this project is from the IBM HR Analytics Employee Attrition & Performance Dataset. It contains 35 categorical and numerical features for 1470 unique employees who work in one of nine different job roles, along with the information on whether they have quit or not. As the project largely depends on the predictions done by supervised learning, the labels in the dataset are crucial for this project.

Data Preprocessing

Before inputting our data into our models, we preprocessed it to allow our models to properly learn the patterns present in the data and enhance the overall performance. Our dataset contained many categorical variables, which we transformed into numerical data using one-hot encoding. We chose this specific encoding method to avoid introducing a relationship between the values of the variables; we simply wanted to convert them to a numerical format.

While initially exploring the data, we also found that our dataset was imbalanced: there were drastically more employees who did not quit their job than those who did. We used Synthetic Minority Oversampling Technique (SMOTE) to fix this problem by artificially growing the number of employees who quit. This technique was chosen as the research papers we referenced also utilized it. We also did not want to decrease the size of the dataset by deleting a certain number of rows where an employee stayed with the company in our balancing attempt.

The final preprocessing step we took included removing irrelevant variables, such as employee number, and aggregating common variables like daily_rate, hourly_rate, and monthly_rate into one column.

Classifiers

The project begins by using five binary classifiers that classify employees based on whether they would quit their job or not. The models are Logistic Regression, Decision Tree, Random Forest, XGBoost, and Support Vector Machine (SVM). These models were chosen since past research has used them to address similar classification problems.

Logistic Regression

Logistic Regression is a machine learning model that determines the probability of two possible outcomes in a binary classification setting. For feature selection, a wrapper auto feature selection method was used. As well, the hyperparameters of the logistic regression models were tuned using random search. The final model achieved an accuracy of 91.89%, which is better than the research paper’s accuracy of 85.48%.

Support Vector Machine (SVM)

The Support Vector Machine (SVM) classifier plots data points in high-dimensional space and tries to create a hyperplane to most adequately classify the data; nearby data points are called the “Support Vectors.” To refine our model, we used a Sequential Forward Selector with the wrapper method to achieve an accuracy of 88%, cutting down the data to just around 20 features. To further improve the model, we ran hyperparameter tuning with Grid and Random search. We reached a maximum accuracy across all validation and tests of 92%, which is a 7% increase from the reference paper.

Decision Tree

Decision Trees are widely used in both regression and classification problems and utilize a tree-like data structure to make decisions on how to split input data into two subsections in each iteration repeatedly. Seventeen features were selected in the feature selection stage, increasing the baseline model accuracy by about 6%. This stage was conducted using a Sequential Forward Selector. The hyper-parameter tuning did not cause any improvements in the model metrics, leaving the best possible model with a final accuracy of 88.8%, about 9% better than the research paper referenced.

Random Forest

The random forest model is an ensemble model that is a collection of multiple decision trees where the predictions from each decision tree get stored, and the prediction with the highest votes is yielded as the model’s final prediction. To train the model, 15-fold cross validation was selected, which provided an accuracy of 93%. This is almost a 10% increase from the reference paper. The k=15 was chosen due to the smaller size of the dataset. As the feature selection and the hyperparameter tuning techniques decreased the accuracy, no such fine-tuning techniques were included in the final model training.

XGBoost

The XGBoost (eXtreme Gradient Boosting) model is also a tree-based ensemble method that is one step up from the random forest model. It creates decision trees in parallel and combines the weaker trees to create a collectively stronger model. To train the model, XGBoost’s feature importance technique was used, which discarded 18 features. Using the selected features, the model was trained using 15-fold cross validation, and Random Search CV was used to fine-tune the model. The final model yielded 92.73% accuracy, around 3% higher than the reference paper.

The table below summarizes the steps taken to produce the best results:

Clustering

Through clustering, we identified common trends within certain groups of employees, especially what distinct “clusters” of employees will quit their jobs.

K-Means

As our data was a mix of both categorical and continuous data with a high number of features, we used K-Means clustering on our one-hot encoded data.

T-SNE

To visualize the clusters, we reduced the number of features using t-SNE and grouped the data using K-Means. T-SNE (t-distributed Stochastic Neighbour Embedding) is an unsupervised, dimensionality-reduction algorithm for visualizing high-dimensional data. It is similar to PCA, but while PCA seeks to maximize variance by preserving large pairwise distances among datapoints, t-SNE only preserves small pairwise distances as a measure of similarity. In this way, t-SNE uses a probabilistic approach to capture the complex, non-linear relationships among features, while PCA uses a linear, mathematical approach. As such, t-SNE was able to capture similarities among data points and reduce computation costs for K-Means, which is used to group the employees predicted to quit. The figure below demonstrates the results of running t-SNE and K-Means with three clusters upon all models.

Comparison among clusters

The cluster analysis was done to find the combinations of dominant features within each cluster that would lead to employee attrition. Below is a general summary of all the clusters identified from the employees that our classifiers predicted would quit.

Figure 2: Visualizing how the numerical features (Age, Distance from Home, and Average Remuneration Rate) differ across the three clusters

Cluster 0

This cluster consists primarily of older, experienced male employees who are single. These employees have worked with the company the longest and do the most overtime work. However, they are paid the least. In general, they are also the least satisfied with their working environment out of all the other clusters. Additionally, these employees are largely in the R&D department. Compared to the other clusters, these employees also have the lowest job level and the poorest job satisfaction. Overall, these employees can be classified as ones who work the most but get paid the least.

Cluster 1

This cluster mostly contains males in their mid 20’s to mid 30’s who are paid the most compared to the other clusters and have the lowest number of years in their current roles. They tend to work overtime as well. Similar to Cluster 0, most of these employees are part of the R&D department with job titles such as ‘Research Scientist’ and ‘Laboratory Technician’. These employees also have the highest level of environmental satisfaction and job involvement among all the other clusters. Even though these employees chose to leave, they were quite satisfied with their jobs. This cluster also contained the greatest number of divorced employees out of all the clusters.

Cluster 2

This cluster is comparatively the most balanced in terms of gender, with a male to female ratio of 1.8. Of all the clusters, the employees here switch roles the most, and they have the lowest number of years in their current roles with their current managers. They score surprisingly high in work-life balance and most of them are married too. This cluster also has a relatively equal number of employees who work overtime and those who don’t.

Figure 3: Visualizing how the categorical features (Gender, Relationship Status, Whether they do overtime or not, Department) differ across the three clusters

Future Steps

Large corporations and small businesses can utilize our models’ results to understand their employees better, thus mitigating employee turnover. Companies can pass their employee data through the trained models, find common factors between employees projected to leave, and then make the necessary changes to ensure a lower employee turnover rate. In the future, it may be interesting to experiment with more advanced machine learning models, such as a neural network, to see whether such a model’s predictions are more accurate than the traditional machine learning models used in this study.