Wall Street Bots 2: Crypto Price Prediction Using Machine Learning

University of Toronto Machine Intelligence Team

19 min readJun 1, 2023

A UTMIST Project by: Lisa Yu, Fernando Assad, Andrew Huang, Eliss Hui, Anand Karki, Sujit Magesh, Geting (Janice) Qin, Peter Shi, Dav Vrat, Nick Wood, Randolph Zhang.

WallSteetBots2 is a 6-month project to predict cryptocurrency prices using Twitter tweets and market data through applying machine learning techniques.

1. Background

1.1. Motivation

There is currently a high level of interest and investment in cryptocurrency markets, but the complex and ever-changing nature of these markets has made it challenging for investors to accurately predict price movements. In comparison to other financial instruments like stocks, bonds, and options, cryptocurrencies also have high quality market data that are freely accessible to the public.

As a result, there has been a growing interest in applying machine learning and natural language processing techniques to cryptocurrency price prediction. Machine learning algorithms are designed to learn patterns and relationships in data, which can be useful in predicting the direction and magnitude of future cryptocurrency price movements. On the other hand, natural language processing can be used to analyze social media sentiment and news articles pertaining to cryptocurrencies, which can help gain insight into market trends.

Our models are trained on data scraped from Twitter as well as high frequency market data from CoinAPI, Alpaca, and other sources.

This article will highlight our research on forecasting short and medium term returns in cryptocurrency markets using machine learning and natural language processing techniques.

1.2. Previous Iteration of WSB

This WallStreetBots-Crypto is a continuation of last year’s iteration of the WallStreeBots project. Previously, WSB built a stock-trading AI that attempted to predict the next minute price of common “meme stocks” like GME based on real time Reddit sentiments and implemented several portfolio optimization techniques to periodically rebalance the portfolio for risk-return optimization. The entire pipeline was implemented on the WallStreetBots terminal (http://www.wallstreetbots.org/). See a screenshot of the terminal dashboard below.

Although last year’s WSB project was able to achieve over 66% directional prediction accuracy, we found that the greatest limitation was the lack of publicly available high-quality market data for stocks. These data, such as the exchange trades data, are only accessible to professional trading firms. Hence in this iteration of the project, we shift our focus to predicting cryptocurrency prices given their transparent nature.

1.3. Previous Academic Work

1.3.1 Twitter Sentiment Analysis

In recent years, researchers have started to notice the power of Natural Language Processing (NLP) and text mining for predicting the financial market. Using user sentiment on social media platforms to predict various financial assets has become an active area of research. One of the most widely used social media platforms in this research is Twitter. On this platform, users can post short texts called “tweets”, which contain sentiments and moods. Since Twitter was founded in 2006, it has become increasingly popular. In 2022, Twitter announced that it would have 368 million active monthly users. Due to the popularity of Twitter, there has been a lot of research into detecting sentiment in tweets. For example, the Valence Aware Dictionary for Sentiment Reasoning (VADER) is a rule-based model for analyzing sentiment in social media text that achieves an F1 score of 0.96, outperforming individual human raters with an F1 score of 0.84. There is, however, research showing that around 14% of Twitter content for Bitcoin is sent from bots (Kraaijeveld and Smedt, 2020). Nevertheless, previous research from Antweiler and Frank (2004), and Bollen et al. (2011) demonstrates that using social media texts can help in predicting the market.

1.3.2 Bitcoin Price Prediction and Forecasting

In 2017, Stenqvist and Jacob Lönnö explored the use of sentiment analysis on Twitter data related to Bitcoin to predict its price fluctuations. In their work, a naive prediction model is presented, which shows that the most accurate aggregated time for predictions is 1 hour, predicting a Bitcoin price change for 4 hours in the future. Moreover, Mallqui and Fernandes (2019) examined various machine learning models to predict price direction with the best model in their work achieving a 62.91% directional accuracy. Multiple machine learning models are also used in the work of Chen et al. (2020), the results show that the Long Short-term Memory (LSTM) model achieves better performance than other methods using the previous exchange rate, such as autoregressive integrated moving average (ARIMA), support vector regression (SVM), etc. During the same year, Kraaijeveld and Smedt (2020) showed that Twitter sentiments are feasible to predict the price of Bitcoin, Bitcoin Cash, and Litecoin through granger causality testing.

1.4. Trading Intuition

In addition to relevant academic research, it is also important to consider the trading intuition that explains the fundamental crypto price movements driven by market supply-and-demand. Throughout the project, we ensure that our results are aligned with trading intuitions.

2. Data Collection

2.1. Tweets for NLP

2.1.1. Scraping and Cleaning

A tweet scraping script was developed and executed using Snscrape to collect data on 14 different cryptocurrencies (However, the scope of this project only explored price predictions of Bitcoin). The script required a specified date range and retrieved 100 tweets per hour within that range. The script incorporated a set of heuristics to mitigate spam accounts that post tweets on Twitter. The heuristics approved accounts that were verified and rejected accounts with low follower-to-following ratios, specifically those with less than 10% of followers compared to their following, accounts that tweet over 200 times daily on average, and accounts that follow the maximum number of other accounts as enforced by Twitter.

2.1.2. NLP Preprocessing — Sentiment Labeling of Tweets

The raw dataset of Tweets was processed using two pre-trained NLP sentiment analysis models in Python. The models used were VADER (Valence Aware Dictionary for sEntiment Reasoning) and an implementation of Google’s T5 model fine-tuned for emotion recognition.

VADER provides a single feature (“sentiment_score”) that measures the overall positivity/negativity of a Tweet, represented as a floating point number in the range of (-1, +1), where -1 indicates high negativity and +1 represents high positivity.

The fine-tuned Google T5 model provides six features that measure the intensity of specific emotions: happy, sad, anger, fear, and surprise. These features are represented as floating point numbers in the range of (0, 1), where 0 indicates no detection, and 1 indicates strong detection.

Multiple cloud computing instances from Google Cloud Platform were used to apply the two selected models to the full dataset of Tweets.

2.1.3. Processed Dataset Creation

Datasets with averaged NLP features were created for a variety of frequency intervals (1-hour, 4-hour, 12-hour, 1-day), merged with the corresponding frequency of Bitcoin price data on the Binance exchange provided by CoinAPI. These datasets provide measurements for the average intensity of features for each given time interval, allowing for NLP features to be correlated with the actual log return of Bitcoin across each period. Below is a table containing the features produced through manipulation of raw sentiment metrics across specific intervals.

2.2. Market Data

The project collected various market data by first coming up with a hypothesis based on trading intuition for why certain data should be a driver of Bitcoin price. Then exploratory cross correlation analysis is conducted on the feature (at time t) and Bitcoin’s log returns shifted (at time t+shift offset) to check for linear relationships. The team also conducted checks for non-linear correlation based on hypothesis tests. We do not report any non-linear correlation results since none were significant. However, we note that the implementation of the non-linear correlation test had a high false negative rate. See below table for features and correlation test results.

Market Data Features and Correlation Results

In the market data collection process, we note in particular the Python scripts used to collect the data features from CoinAPI. Scripts were written to generate CoinAPI keys, collect, and pre-process data. The first script generates CoinAPI keys and stores them for later use. A second script was used to retrieve Bitcoin to USDT and Ethereum to USDT price data from Binance and FTX exchanges, which were sourced from CoinAPI. Notice that we collected both Bitcoin and Ethereum data on both Binance and FTX exchanges because we wanted to check if there are any lead-lag relationships between the two coins’ prices on the two different exchanges. Due to the limitation of each key being able to send 100 requests every 24 hours, the script utilized a sliding window to ensure that each key is utilized to its capacity without encountering any error of exceeding the requests limit. The data was collected and preprocessed, with a remarkable rate of 1 million entries per hour, equivalent to 1.9 years of data per hour. Subsequently, the data was cleaned, and logarithmic return values were computed. Features for data collected included “Time_period_start”, “Time_period_end”, “Asset_ID”, “Count”, “Open”, “High”, “Low”, “Close”, “Volume”, and “Opening Day”, with period = 1 minute. These features were used to calculate additional features which included “Log_returns”, which is the natural log of closing price divided by opening price. Other preprocessing included transforming timestamps from UTC to UNIX format.

Similar python scripts were used to collect additional second-by-second Bitcoin limit order book and historical trades data. Intuitively, this data shows how many people are willing to buy or sell how much of Bitcoin at what price. It gives insight to the supply and demand of Bitcoin which fundamentally drives any price changes. An example limit order book (also known as trading ladder) is shown in the following figure.

For a price in each row, the left column shows the aggregated volume that people on the exchange are willing to buy at the price given, and the right column shows the aggregated volume that people are willing to sell for the price given. In the previous figure, we call 2168.00 the level 1 ask price and 290 the level 1 ask volume. Similarly, we call 2167.75 the level 1 bid price and 477 the level 1 bid volume. From CoinAPI, we collected 2 levels deep of the book data for Bitcoin on Binance exchange (features include “asks_lvl1_price”, “asks_lvl1_size”, “asks_lvl2_price”, “asks_lvl2_size”, “bids_lvl1_price”, “bids_lvl1_size”, “bids_lvl2_price”, “bids_lvl2_size”). From these, we further engineered the following features:

“bid_ask_spread” = “asks_lvl1_price” — “bids_lvl1_price”
“bid_ask_strength” = “bids_lvl1_size” + “bids_lvl2_size” — “asks_lvl1_size” — “asks_lvl2_size”

In addition to the limit book which can be thought of as snapshots at the beginning of each second, we also collected the aggregated data for all the trades that were matched and executed over each second interval, getting the following features:

“total_buy_size”
“total_sell_size”
“buy_sell_strength” = ‘total_buy_size” — “‘total_sell_size”
“mean_trade_price”
“min_trade_price”
“max_trade_price”

Overall we see that the Bitcoin block features and the book-and-trades features have significant correlations with lagged Bitcoin log-returns. Hence we proceeded to train various ML models on these features.

3. Fitting ML Models

Based on our choice of input features at time t, we attempt to predict the log-return of Bitcoin at time t+1, where each time interval could be one second, one minute, one hour, or one day depending on the dataset. This is a regression problem. All model hyperparameters are tuned using grid search unless otherwise stated. At this stage, we only evaluated the models based on directional accuracy and MSE. We will proceed to evaluate the promising models more thoroughly using more metrics in the following section.

3.1. Baseline Model — Simple Moving Average

We considered the simple moving average model with window size of 3 as a baseline model and achieved a directional accuracy of 50.6% and MSE of 2.548e-06.

3.2. ARMA Model with Historical Returns

Univariate ARMA-class models were considered. Testing was done on the G-Research Kaggle Dataset on Bitcoin log-returns using a simple ARMA model with automatic selection of data and seasonal lags. The MSE on the testing dataset was 1.32e-06 and the out of sample directional accuracy is 51%. It should be noted that out of sample predictions only utilized the first data point of the test dataset for forecasting.

3.3. Support Vector Machine (SVM) with NLP Sentiments Data

One of the initial approaches for predicting the log return of Bitcoin prices is employing support vector machines (SVM) on the average sentiment and mood scores per hour and day in the tweets data we collected. In particular, there are five moods: happy, sad, angry, surprised, and fear. After splitting the dataset based on the follower counts of the corresponding Twitter user of each tweet, there are 18 matrices. In particular, we used multiples of the average follower counts to separate the data. The tweets are categorized into three categories: low (lower than 2/3 of the average follower counts), middle (from the range of higher than or equal to 2/3 of the average follower counts and lower than or equal to 3/2 of the average follower counts), and high (higher than 3/2 of the average follower count). With a Support Vector Regression (SVR) model from the scikit-learn library in Python, tuning the regularization parameter and the tolerance parameter, the model achieves a directional accuracy of 56.3% with a mean squared error of 0.00665 and a mean absolute error of 0.0522 when using the daily data. However, SVM gave worse results when we shortened the time interval of the data to be one hour, the model now has a directional accuracy of 54.0% but an MSE of 1.61 and an MAE of 0.807. This increase in error signifies the SVM with linear kernel does not capture the data’s characteristics due to the data’s complexity.

3.4. XGBoost Model with NLP Account, Tweet, Mood and Volume/Price Data

An XGBoost (XGB) model was trained on hourly aggregate means of Account (follower, following), Tweet (likes, replies, retweets), and mood (happy, sad, fear, angry, surprise) metrics of the tweets scraped in addition to the Bitcoin volume and price data. This gave an MSE of 3.104e-05 and an accuracy of 53.1%.

3.5. LSTM with NLP Sentiments Data

The processed NLP data described above were used to train Keras LSTM models for each of the specified intervals. An LSTM model with 100 units followed by a dropout and dense layer was used, and hyperparameters were tuned using the Bayesian hyperparameter search available from Weights and Biases. The input variables were the 15 features described above, as well as the log return values (for previous periods) A 2:1 training/testing split was used, allowing for the model’s performance to be tested on ~1 year of unseen historical cryptocurrency data.

The test period of the best performing model trained on 24-hour intervals of averaged sentiment gave overall directional accuracy of 56% and MSE of 0.0013.

One notable trend that we consistently observed across our models was a strong correlation between longer sentiment intervals and improved trading performance. The worst performing model was trained on sentiment intervals of 1 hour, and the best performing model was the one training on the 24 hour interval. This has several possible implications: it is possible that the effects of Twitter sentiments on the cryptocurrency market are delayed, and tend to occur much after Tweets are published. Alternatively, it is possible that averaged sentiment metrics are more statistically significant when sampled across larger time intervals, and therefore provide a better indication of the direction that the market will take.

3.6. Linear Regression with Price and Volume Data

The linear regression model is used to predict the Bitcoin log returns using the price and volume data. There are the following model assumptions for a gaussian linear model: linearity, equal variance, normality and independence. In practice, the log-return of stock price approximately follows a gaussian distribution. In this case, it is assumed that the same distribution applies to crypto currency, which has normality and equal variance satisfied. However, no other information is available about linearity and independence. A regression model has to be created to examine the fit of the linear model. Before fitting the model, the dataset has to be cleaned. The main concern in the dataset is multicollinearity. Multicollinearity happens when there are variables that have high correlation with each other. In the training dataset, correlation can be found between count and volume, and count is eventually dropped. The final model has volume, high, low, open and close as parameters that predict log-return. The parameters have statistical significance. However, the linear correlation between log-return and covariates are very low (all below 0.01). This suggests that a linear model is not a good choice for predicting log-return. The MSE is 3.16044e-06 and directional accuracy is 50.6%

3.7. Single Variable LSTM with Price Data

A single variable LSTM model with 50 units, a dropout layer, and a dense layer was trained to predict the next minute Bitcoin prices based on the historical Bitcoin log returns where window size is 10. The model was trained using stochastic gradient descent and MSE loss. The model was tuned using grid search. The final test directional accuracy was 49.8% and MSE was 4.95e-07.

3.8. Multivariable LSTM with Book and Trades Data

A multivariable LSTM model with 100 units, a dropout layer, and a dense layer was trained on the book and trades dataset to predict the next second’s mean, min, and max trade price of Bitcoin by predicting the log returns of these values. The benefit of predicting not just the center but also the min/max trade prices allows us to predict a range of probable trade prices over the next second. This information could be useful for market makers to remove their quotes when the book is about to be eaten up or to become an aggressor and front-run upcoming trends. However, for the purpose of this project, we only evaluate the results for the mean price predictions to allow compatibility with other datasets/models explored. The multivariable features were reframed from time-series data using a window frame size of 10. The final model was trained using the Adam optimizer (SGD yielded suboptimal results) with decayed learning rate and MSE loss. The model was tuned using grid search. The final test directional accuracy was 66.6% and MSE was 1.45e-09.

3.9. HMM with Book and Trades Data

A multivariable Guassian Hidden Markov Model with 25 hidden states was also fitted on the Bitcoin book and trades dataset to predict the next second’s mean, min, and max trade price of Bitcoin by predicting the log returns of these values. One notable data pre-processing step prior to fitting the model is the normalization of all features to a small enough range. For example, the level1 ask price of the book is converted to the nominal difference of it against the mean trade price. This allows the HMM model to better fit the data using a small finite number of states since the model assumes the hidden states to be discrete. The final test directional accuracy is 57.4% and MSE is 1.89e-09.

4. Further Evaluation of LSTM with NLP Sentiments Data and LSTM with Book and Trades Data

Our best model fitted on NLP data was the LSTM model and the best model fitted on market data was the LSTM with book and trades data. Let’s evaluate these two models more thoroughly using more metrics than just directional accuracy and MSE. It is important to note the difference in data frequency between the two models. The LSTM fitted on NLP sentiments is predicting day-by-day log-returns whereas the LSTM fitted on book-and-trades dataset is predicting second-by-second log returns. Predicting longer intervals is a more difficult task so it is important to keep this in mind when directly comparing the performance of the two models.

4.1. Performance Evaluation Metrics

Evaluation metrics are essential for properly assessing the performance of each of our models. There are two main types of metrics we will use in this project: financial and statistical.

4.1.1. Financial Metrics

Financial metrics are highly useful in capturing the performance of the models in the context of risk and profitably in the market. We graph the cumulative portfolio return of a hypothetical $1 portfolio over time with reinvestment at each time step where the entire portfolio takes on either a long or a short position based on our model’s predictions. This can be compared to the returns of the benchmark portfolio of holding or shorting $1 of Bitcoin over the same investment period. If we wanted to minimize risk of wrong predictions, we could consider a cutoff value where our hypothetical portfolio would only take a long/short position if the predicted return is larger than the cutoff. This hypothetical portfolio return curve can also be inspected for any large drawdowns or fluctuations, ensuring stable returns over time. We also report the maximum drawdown of such a portfolio, its Sharpe ratio, and its Sortino ratio (Sortino ratio only penalizes risk associated with negative returns and not positive ones). In general, a model with a Sharpe ratio of greater than 1 is considered good.

4.1.2. Statistical Metrics

In addition to financial metrics, we also considered statistical metrics including the predicted directional accuracy, MSE, confusion matrix, and F1 score. We also visualized a scatterplot of the actual vs predicted log returns, the distribution of wrong predictions, and some sample predictions.

4.2. Performance of LSTM Fitted on NLP Sentiments Data

For the LSTM trained on the NLP sentiments (24hr interval) data, The final test directional accuracy was 56.5% and MSE was 0.0013. The hypothetical $1 portfolio that trades with reinvestments using our predictions over the test period of ~400 days made an approximately 7% return, outperforming holding bitcoin by ~6 times. The portfolio has a Sharpe ratio of 0.154, Sortino ratio of 0.163, and max drawdown of 15.38%. See below grapes for a visualization of some log return predictions and the confusion matrix if we view it as a binary classification problem. The F1 score is 0.564.

4.3. Performance of LSTM Fitted on Book and Trades Data

For the LSTM trained on the book-and-trades data, The final test directional accuracy was 66.6% and MSE was 1.45e-09. The hypothetical $1 portfolio that trades with reinvestments using our predictions over the test period of ~3500 seconds made an approximately 5% return, outperforming holding bitcoin by 47 times. The portfolio has a Sharpe ratio of 0.32, Sortino ratio of 0.48, and max drawdown of 0.02%. If we trade only based on signals of predicted returns above a cutoff, we were able to get a Sharpe ratio of above 1.2 and Sortino ratio of above 2.8. See graphs below for a visualization of some log return predictions and the confusion matrix if we view it as a binary classification problem. The F1 score is 0.625. Based on the distribution of wrong prediction percentages, we see that a lot of the wrong predictions happen when the actual movement of Bitcoin price is close to 0. This could be because the actual small movement of the actual price is due to noise.

Given the great performance of both models, we wondered whether combining both datasets together would result in better results than both individual models. The datasets have different frequencies; nlp data is hourly and book/trade data is in seconds. Due to time constraints of the project, we combined the two datasets by repeating the NLP sentiment data point for every book/trade data point. As a result, for every data point in the merged dataset it would have the latest nlp data that is strictly earlier than the data point so that look-ahead bias is avoided. The combined dataset was used to train a multivariable LSTM model. However, the result is worse than the LSTM model with only book-and-trades data. There are possible explanations for this: the nlp data points are naively repeated and a time-encoded positional embedding instead may be more appropriate. Considering the difference in frequency between the two datasets, the information of the nlp data point may be misleading for the data points that are “further” from the nlp data point.

5. Concluding Thoughts

5.1. Limitations

So far, we have shown that our model and trading strategy is powerful, able to generate huge returns in only a few hours. However, there are several limitations that traders must consider when implementing this model and strategy in real time.

One significant limitation of high-frequency trading models is the requirement for real-time data. This data is essential to make accurate predictions, but it may not always be readily available, and some data sources may be behind paywalls. Obtaining and analyzing real-time data automatically can be a challenging and time-consuming process, which limits the accessibility of this type of trading.

Another critical limitation to consider is trading costs. High-frequency trading requires making many trades over a short period, which incurs additional trading costs. These costs can quickly add up, and if not accounted for, they can significantly reduce the profits generated by the strategy. Moreover, retail traders often pay a higher trading fee than institutional traders, which further increases the cost of trading.

Furthermore, recent changes in Twitter’s API policy have put restrictions on the data accessible through the platform. Twitter has become a key source of data for natural language processing (NLP), which is an essential part of high-frequency trading. The unavailability of data from Twitter limits the data sources available for analysis, which reduces the effectiveness of the strategy. This also raises questions about the speed at which the data will be available and its accuracy, as a key feature of the model is the immediate availability of the data.

In addition, high-frequency trading models require significant computational power, which may not be feasible for retail traders or smaller trading firms. The predicted power is made possible by complex neural networks which require a generous amount of computing potential, which may not be available on demand. The need for low latency connections further complicates the requirements for infrastructure. Achieving low latency is essential to ensure that trades are executed quickly and at the right time and price. However, implementing the infrastructure required for low latency connections can be expensive and challenging.

Overall, there are a number of limitations with this strategy but none are insurmountable. With the right strategy and with enough capital, we are confident that this model and strategy can be executed for a profit.

5.2. Future Steps

Future steps of the project could include attempts to merge the more promising models into one model or an ensemble of models. More work could also be done to merge datasets of different frequencies. The number of data features could also be reduced using dimensionality reduction techniques like PCA.

Check out more information about WSB2 and other ML projects on the UTMIST project page. You can also find a presentation of WSB2 (and other UTMIST projects) here.