RealTime — Real Estate Price Prediction

University of Toronto Machine Intelligence Team

10 min readJun 12, 2022

A UTMIST Project by Arsh Kadakia, Sepehr Hosseini, Matthew Leung, Arahan Kadakia, and Leo Li

RealTime is a machine learning project for predicting home prices in the GTA. It is a significant improvement on last year’s project, RealValue, by focusing on more data than ever before. Thanks to the effective use of data from a variety of demographic and public sources and an iterative database update process, RealTime is able to provide highly accurate home pricing decisions for brokerages and regular homeowners alike.

Motivation & Goal

In Toronto, sold home prices continued to rise at an explosive growth rate from January 2021 to January 2022, increasing by over 24% year-over-year. Despite interest rate increases slowing this rise recently, being able to afford a good-quality home is now harder than ever.

In this environment, finding value and having full autonomy and understanding about your housing decisions are critical. However, for different stakeholders, this experience is more challenging in various aspects.

For brokerages, the fast-moving nature of home sales and an ever-changing housing market lead to significant negotiation uncertainty. The significant discrepancy between new home prices and old home prices leads to decisions being taken based on the “hunch” or “intuition” of realtors who may not have the best grasp on the market. A data-driven solution that is adaptable to changing times and considerate of different types of information can provide a distinct market advantage to the brokerage that has the solution, as they can recommend better advice to their clients. In the short and long-term, these strong suggestions will lead to better relationships and consequently success.

For clients, this experience becomes more challenging as you would like to stay “in budget” for your dream house or your first investment property. The volatile market leads to people making once-in-a-life home decisions on a “hunch” and then dealing with the consequences later. In many cases, clients overpay for homes, putting further financial strain on themselves and hamstringing their ability to grow their wealth later on in the future. A solution that is driven by real-time insights and accurate predictions can lead to a more confident decision by clients.

RealTime provides a solution to the problem of unclear market changes by allowing brokerages and clients to have access to an ML-backed real estate solution that draws upon different data sources that are updated to close recency.

The goal of RealTime is to achieve 8% mean absolute percentage error (MAPE) on a final set of attributes determined by the available data sources.

A notable subgoal is to create a real-time, extensive database that is up-to-date with information from many data sources. Without such a database, the promise of a data-driven solution that is aware of current market developments doesn’t quite exist.

Dataset

One of the most prominent improvements to last year’s project which has been drastically restructured in RealTime is the dataset. In last year’s project, RealValue, the dataset was manually constructed and only consisted of 157 homes with very limited corresponding attributes. This year, RealTime’s dataset is composed of over 30,000 homes across Toronto, giving us a much more diverse and larger set of data. Not only that, but each house also contains several more attributes, both numerical and categorical, as well as over 40 amenities, which further strengthen the dataset. Some amenities include nearby universities, public transit, restaurants, and schools. In addition, new attribute choices include parking space, number of bedrooms, number of bathrooms, and type of home (whether a detached, condo, etc.).

Another major enhancement made in RealTime is that unlike last year’s approach where we manually made additions to the dataset, our new dataset follows an automated and iterative process, which always keeps our dataset up-to-date and makes regular updates to ensure reliable and accurate home prices. This also means that our database is constantly growing, which plays a major role in improving our performance.

Workflow

The data collection process is an iterative process that involves fetching the house data through an API request, saving this data into our database, and updating the data with more detailed house information and amenities periodically. This process is an automated process and executes every 12 hours to ensure our database is always giving us up-to-date and reliable information. To automate this process, we use GitHub Actions to create Docker container actions as our method of choice for creating the builds. We operate the actions on a Google Cloud Virtual Machine, ensuring that our hardware is consistent and can reliably update our database. Our database of choice is MongoDB due to its fast data access for read/write operations.

Cleaning/Feature Engineering

Normalization

The most complicated part of creating such a solution is appropriately pre-processing the different sources of received data.

Our group classified received features/attributes into three types: numerical, “easy” categorical, and “hard” categorical.

Attributes such as bedrooms, bathrooms, and longitude/latitude are numerical examples. For numerical data, two types of normalization were used: max normalization and min-max normalization. For numerical attributes with more unique data points, min-max normalization was the preferred choice. Otherwise, numerical attributes with fewer unique points were treated with max normalization. These decisions were taken by testing various configurations and noting patterns between the most accurate trials.

“Easy” categorical features consist of discrete features which could be directly one-hot encoded without further data processing. Examples include the house attachment style (e.g. attached, detached, semi-detached) and basement development (e.g. finished, unfinished, no basement).

“Hard” categorical features consist of features that need to be specially handled in order to be useful. Examples of these include a dictionary of amenities near a house (e.g., a certain house has 3 subway stations, 1 university, 3 schools, and 4 supermarkets within a 2km radius) and a dictionary of room areas.

Specific Choices

Once we incorporated data from different APIs, we realized that the best feature space involved significantly fewer attributes.

In order to reduce the number of attributes, we used a variety of assessment plots, including box-and-whisker plots and correlation plots, as shown above.

Box-and-whisker plots were utilized to consider the general trend that different values of a single attribute presented, and how similar they were to other values of the same attribute. If they were considered “too” similar, they were consequently removed from later training.

Correlation plots were used similarly. Again, if the overall attribute slope was considered negligible (close to zero), it was removed from later training.

Using these analysis tools, we reduced our feature space, which helped with increasing training accuracy and, more importantly, reducing overfitting issues that allowed for a growth in test accuracy.

Training

Upon further inspection of our own dataset, we realized that a significant portion (approximately 65%) of our houses didn’t have stored images. For that reason, we had to evaluate the dataset with that issue in mind.

The training was conducted in two parts:

1. Training Without Images

In this part, we had access to approximately 17,000 fully complete homes. However, we could not utilize the notable attribute of images.

2. Training With Images

In this part, we had access to approximately 6,000 homes with photos. While this is a lower number of homes, we were able to include photos as a feature in this subset.

Training Without Images

We chose a set of regressors to test our dataset upon, with justifications provided as follows:

1. Support Vector Regressor

Method: Extending support vector machines into the regression space, with the “kernel” method learning higher-dimension transformations that allow for the separation of data. A kernel is used to transform the data to a high-dimensional space, in which a linear regression can be applied to generate a hyperplane to represent the data. The objective function of SVR also allows for some error tolerance.
Benefits: This regressor was chosen due to its strength in working with higher-dimensional data and reducing prediction errors to a certain range. In addition, SVR has been used in the literature for housing price prediction.

2. Multi-layer Perceptron

Method: Fully connected feedforward artificial neural network, consisting of layers of neurons with an activation function between each layer.
Benefits: MLPs are a basic neural network widely used in machine learning, with the capacity to solve complex problems.

3. Ensemble of Regression Trees (Extra Trees Ensemble Regressor)

Method: An ensemble method that uses many decision trees to make a prediction. In a regression problem, an average is taken of the decision tree results. Generally, the decision trees are fitted to the entire dataset.
Benefits: Ensemble of Regression Trees has been used in the literature for housing price prediction. Ensemble methods using decision trees are useful in that they create a more diverse set of regressors, which can then be averaged. Outliers do not have a significant impact, and decision trees can be easily interpreted and visualized. In addition, assumptions about the data distribution are not required.

4. Gradient Boosting (Gradient Boosted Decision Trees)

Method: Gradient boosting is usually used with an ensemble of decision trees (as in the case of Scikit-Learn) and allows for the optimization of loss functions.
Benefits: The optimization of loss functions allows this method to usually perform better than random trees.

5. K-Nearest Neighbours Regression

Method: This algorithm extends K-Nearest Neighbours classification to regression. In order to make a prediction on the result of some input data point, this algorithm takes the K closest points to that data point (based on some distance metric). It then takes an average of the labels of those K closest points, using that as the prediction output. No actual model is actually built.
Benefits: This algorithm is simple to implement, and no assumptions need to be made about the data. No model needs to be built, because this algorithm just relies on computing distances between points. K-Nearest Neighbours has been used in the literature before for housing price prediction.

6. Random-Forest Model

Method: This method is similar to Extra Trees, except that the decision trees are fitted to random subsets of the dataset.
Benefits: Since Random-Forest fits decision trees to random subsets of the data, overfitting is controlled and accuracy can be improved. Ensemble methods using decision trees are useful in creating a more diverse set of regressors, which can then be averaged. Outliers do not have a significant impact, and decision trees can be easily interpreted and visualized. In addition, assumptions about the data distribution are not required.

7. Huber Regressor

Method: A linear regression model that uses the Huber loss function instead of squared error loss. In the Huber loss function, outliers are penalized less compared to in squared error loss.
Benefits: This model is less sensitive to outliers because of its loss function, making it useful for a large housing price dataset in which outliers are present.

8. Ridge Regressor

Method: A linear regression model that uses a linear least squares loss function with an additional L2 regularization loss term in order to penalize overfitting.
Benefits: Due to the regularization loss term, this model is effective at preventing overfitting in data. This is particularly desirable for housing price prediction, in which we wish to obtain a generalizable model.

Eventually, our best model was the ensemble of regression trees, with the rest of the results declared below.

Training With Images

Following our determination of the regression tree ensemble as the best model, we decided to utilize two architectures to train on the homes with photos.

The first architecture was the same architecture utilized in last year’s RealValue project as follows:

In the above architecture, the CNN network was chosen to be EfficientNetB0 due to its high accuracy on image classification problems.

The second architecture involved incorporating photos into the training of a traditional model such as a regression tree ensemble.

We did so by “flattening” the photos into a high-dimensional vector. This was done with the use of an autoencoder. The overall design looks like the following:

The input photo is passed through an encoder, and the consequent embedding is combined with the statistical input and eventually passed into the ensemble for inference.

Altogether, after considering both inputs, the following statistics emerge.

In the 6,000 houses with photos, it is clear that photos had a positive impact on the test accuracy, improving it by nearly 1.5% MAPE.

However, it is also evident that the number of houses limits the effectiveness of the overall dataset. While using the full dataset of 17,000 houses, the test accuracy improves by 1% MAPE.

For that reason, we can conclude that the dataset length matters more so than its width for this project. Altogether, we would prefer to have both in future projects, as it is clearly demonstrated that having more attributes increases accuracy.

Takeaways

Creating a self-updating database require APIs that are mostly unchanging, limitless, and need little-to-no maintenance. Utilizing APIs in a manner that serves those three qualities was our biggest challenge.

For example, real estate APIs are now either band-limited or provide little in the way of extensive attributes.

Our next steps now include making our request allocations more efficient so we can serve more homes. In addition, we want to draw upon further data sources and integrate more APIs so that we can build a wider dataset while maintaining a high length.

Conclusion

Altogether, we realized that high-quality data is arguably the most important component of an elite machine learning algorithm. Despite creating interesting architectures in last year’s project, our large dataset this year led to a 6% MAPE improvement. As our best MAPE is less than 10%, we can now conclude that our real estate price estimator is highly accurate.

You can check out our work on the real-time.ca website and try our price estimator for yourself!