Employee Attrition Factors Prediction

Project Overview

An organization is only as good as its employees, as they provide the company with its competitive advantage. As a result, it is imperative that companies understand how to retain their top talent. Using machine learning and data analytics, this project aims to identify areas of improvement across all departments of a company by determining factors that might be responsible for employee attrition in different teams.


The data used for this project is from the IBM HR Analytics Employee Attrition & Performance Dataset. It contains 35 categorical and numerical features for 1470 unique employees who work in one of nine different job roles, along with the information on whether they have quit or not. As the project largely depends on the predictions done by supervised learning, the labels in the dataset are crucial for this project.

Data Preprocessing

Before inputting our data into our models, we preprocessed it to allow our models to properly learn the patterns present in the data and enhance the overall performance. Our dataset contained many categorical variables, which we transformed into numerical data using one-hot encoding. We chose this specific encoding method to avoid introducing a relationship between the values of the variables; we simply wanted to convert them to a numerical format.


The project begins by using five binary classifiers that classify employees based on whether they would quit their job or not. The models are Logistic Regression, Decision Tree, Random Forest, XGBoost, and Support Vector Machine (SVM). These models were chosen since past research has used them to address similar classification problems.

Table 1: Classifier Results.


Through clustering, we identified common trends within certain groups of employees, especially what distinct “clusters” of employees will quit their jobs.

Figure 1: Visualization of all the t-SNE plots for all models
Figure 2: Visualizing how the numerical features (Age, Distance from Home, and Average Remuneration Rate) differ across the three clusters
Figure 3: Visualizing how the categorical features (Gender, Relationship Status, Whether they do overtime or not, Department) differ across the three clusters

Future Steps

Large corporations and small businesses can utilize our models’ results to understand their employees better, thus mitigating employee turnover. Companies can pass their employee data through the trained models, find common factors between employees projected to leave, and then make the necessary changes to ensure a lower employee turnover rate. In the future, it may be interesting to experiment with more advanced machine learning models, such as a neural network, to see whether such a model’s predictions are more accurate than the traditional machine learning models used in this study.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
University of Toronto Machine Intelligence Team

University of Toronto Machine Intelligence Team

UTMIST’s Technical Writing Team publishes articles on topics within machine learning to our official publication: https://medium.com/demistify