RealTime — Real Estate Price Prediction

Motivation & Goal

In Toronto, sold home prices continued to rise at an explosive growth rate from January 2021 to January 2022, increasing by over 24% year-over-year. Despite interest rate increases slowing this rise recently, being able to afford a good-quality home is now harder than ever.

Dataset

One of the most prominent improvements to last year’s project which has been drastically restructured in RealTime is the dataset. In last year’s project, RealValue, the dataset was manually constructed and only consisted of 157 homes with very limited corresponding attributes. This year, RealTime’s dataset is composed of over 30,000 homes across Toronto, giving us a much more diverse and larger set of data. Not only that, but each house also contains several more attributes, both numerical and categorical, as well as over 40 amenities, which further strengthen the dataset. Some amenities include nearby universities, public transit, restaurants, and schools. In addition, new attribute choices include parking space, number of bedrooms, number of bathrooms, and type of home (whether a detached, condo, etc.).

Workflow

Cleaning/Feature Engineering

Normalization

Example plots of analysis tools.

Training

Upon further inspection of our own dataset, we realized that a significant portion (approximately 65%) of our houses didn’t have stored images. For that reason, we had to evaluate the dataset with that issue in mind.

  • Method: Extending support vector machines into the regression space, with the “kernel” method learning higher-dimension transformations that allow for the separation of data. A kernel is used to transform the data to a high-dimensional space, in which a linear regression can be applied to generate a hyperplane to represent the data. The objective function of SVR also allows for some error tolerance.
  • Benefits: This regressor was chosen due to its strength in working with higher-dimensional data and reducing prediction errors to a certain range. In addition, SVR has been used in the literature for housing price prediction.
  • Method: Fully connected feedforward artificial neural network, consisting of layers of neurons with an activation function between each layer.
  • Benefits: MLPs are a basic neural network widely used in machine learning, with the capacity to solve complex problems.
  • Method: An ensemble method that uses many decision trees to make a prediction. In a regression problem, an average is taken of the decision tree results. Generally, the decision trees are fitted to the entire dataset.
  • Benefits: Ensemble of Regression Trees has been used in the literature for housing price prediction. Ensemble methods using decision trees are useful in that they create a more diverse set of regressors, which can then be averaged. Outliers do not have a significant impact, and decision trees can be easily interpreted and visualized. In addition, assumptions about the data distribution are not required.
  • Method: Gradient boosting is usually used with an ensemble of decision trees (as in the case of Scikit-Learn) and allows for the optimization of loss functions.
  • Benefits: The optimization of loss functions allows this method to usually perform better than random trees.
  • Method: This algorithm extends K-Nearest Neighbours classification to regression. In order to make a prediction on the result of some input data point, this algorithm takes the K closest points to that data point (based on some distance metric). It then takes an average of the labels of those K closest points, using that as the prediction output. No actual model is actually built.
  • Benefits: This algorithm is simple to implement, and no assumptions need to be made about the data. No model needs to be built, because this algorithm just relies on computing distances between points. K-Nearest Neighbours has been used in the literature before for housing price prediction.
  • Method: This method is similar to Extra Trees, except that the decision trees are fitted to random subsets of the dataset.
  • Benefits: Since Random-Forest fits decision trees to random subsets of the data, overfitting is controlled and accuracy can be improved. Ensemble methods using decision trees are useful in creating a more diverse set of regressors, which can then be averaged. Outliers do not have a significant impact, and decision trees can be easily interpreted and visualized. In addition, assumptions about the data distribution are not required.
  • Method: A linear regression model that uses the Huber loss function instead of squared error loss. In the Huber loss function, outliers are penalized less compared to in squared error loss.
  • Benefits: This model is less sensitive to outliers because of its loss function, making it useful for a large housing price dataset in which outliers are present.
  • Method: A linear regression model that uses a linear least squares loss function with an additional L2 regularization loss term in order to penalize overfitting.
  • Benefits: Due to the regularization loss term, this model is effective at preventing overfitting in data. This is particularly desirable for housing price prediction, in which we wish to obtain a generalizable model.

Takeaways

Creating a self-updating database require APIs that are mostly unchanging, limitless, and need little-to-no maintenance. Utilizing APIs in a manner that serves those three qualities was our biggest challenge.

Conclusion

Altogether, we realized that high-quality data is arguably the most important component of an elite machine learning algorithm. Despite creating interesting architectures in last year’s project, our large dataset this year led to a 6% MAPE improvement. As our best MAPE is less than 10%, we can now conclude that our real estate price estimator is highly accurate.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
University of Toronto Machine Intelligence Team

University of Toronto Machine Intelligence Team

UTMIST’s Technical Writing Team publishes articles on topics within machine learning to our official publication: https://medium.com/demistify