MIDS Capstone Project Summer 2024

Liquidity Pool Arbitrage

Team members

Description

Our Capstone project focuses on predicting arbitrage opportunities in time series liquidity pool transactions. Our tool enables cryptocurrency traders to make smarter decisions, optimize their strategies, and increase returns. By providing accurate predictions, traders can better navigate the volatile crypto market and maximize their profitability. We analyze transaction data from Ethereum-based liquidity pools using advanced machine learning models, including an LGBM model and XG Boost. Our approach involves predicting percent differences in currencies, incorporating external factors like global market prices and transaction fees to enhance prediction accuracy. We seek support through access to more data sources, mentorship from industry experts, and partnerships with crypto trading platforms to refine and validate our tool. Your collaboration will help us bring this innovative solution to the market and revolutionize how traders engage with liquidity pools.

Dataset and Modeling

Arbitrage Equation:

Net Gain = A(1+ΔP)(1-T2) - A(1-T2) - G

A is the amount invested
A(1+ΔP) is the value obtained after arbitrage
T1 and T2 are the transaction fees (constant, based on the pool providers)
G is the gas fees (computational cost of doing a transaction on the blockchain)

The type of arbitrage that we are looking at is where we buy a token at a lower price in one pool and sell it for a higher price in the other pool. To have a net gain we first need an amount that we are willing to invest. Theoretically we will be making this amount back and capturing the difference in the pricing between the two pools. So A times the 1 plus the change in percentage will be the value we obtain after the arbitrage opportunity. Next we need to subtract the transaction fees that are incurred for trading in the two pools as well as the Gas fees. This results in us getting the Net Gain. To get the minimum amount we need to invest we set our net gain equal to zero and solve for A.

Minimum Amount to Invest:

A = G/((1+ΔP)(1-T2) - (1-T2))

Here we see that gas prices and the percent change are the only variables that vary per transaction and these will be what we predict in our modeling.

Our data was sourced from Uniswap V3 data that was hosted on the Graph. The data was collated by querying the Graph API for two specific WETH/USDC liquidity pools. The queries returned transactional data from each of the pools which we needed to process and marry based on the timestamps of each transaction. As such these pools needed to have a relatively high volume of transactions to ensure that we get transactions that were close to one another. Once records of transactions that occurred close enough to one another were obtained we could calculate the percent difference between the prices of the WETH coin in both liquidity pools as well as the total gas fees from each transaction.

In our modeling approach we experimented with a variety of ways to put our data together. Initially we had went to source various data types that we thought would help in predicting a large differences in prices between two pools. However, through our eda and modeling we quickly discovered that the other features were clouding the model’s judgement and creating a lot of noise. The features that helped our model performance the most were the various lags, means moving averages of the target variable.

In our initial approach where we just slid our target variable back by 5 minutes. Our random forest, linear regression and xgb boosts used this approach. In our LGBM model we experimented with a different approach. First we took the mean at each minute interval. Then we lagged the mean and then slid our target variable back by 5 minutes. This helped the performance of our LGBM by smoothing the data that it was predicting. For our LSTM model, we followed a windowing approach. We start with our target variable and find the entry that is exactly 5 minutes prior. Then we take the lags at a set frequency to make up the entries that will help predict our target variable. When you stack these you then get a dataset. We experimented with different frequencies of the lags and the amount of time before prior to the target variable.

Evaluation

To predict the percent change, we explored various models. We scored our models based on the RMSE (Root Mean Squared Error) and the R² (wellness of fit).

Type	RMSE (% Change)	R²
Random Forest	0.00127	0.375
XGBoost	0.00134	0.421
Linear (Baseline)	0.00105	0.573
LSTM	0.000963	0.598
LGBM	0.000949	0.646

Our baseline was the linear regression model which performed surprisingly well. We tuned the hyperparameters of our better performing models to ensure we were getting the best performance from these. We explored tree based models like random forest, xgboost and LGBMs. The LGBM was by far the better performing tree based model. This makes sense as in its approach it grows the leafs that maximize the loss, which in turn allows this model to predict more accurately than a random forest or xgboost which grows the trees level wise. The LSTM was a close second and it’s architecture of windows works rather well at time series problems.

Next we predicted the gas prices. The gas price is the computational cost of doing a transaction on the blockchain. We observed that the actual gas had sudden spikes all throughout the days. This could be due to various factors such as spikes in demand, miner behavior, network congestion in the blockchain and so on. However, when the gas prices are relatively constant, we see that the models are performing well. Due to the spikes, we saw the R²suffer across the board.

Type	RMSE (total gas price)	R²
Linear (Baseline)	48613	-0.015
LSTM	145.17	0.112
LGBM	266.39	0.170
XGBoost	266.36	0.171

We kept our linear regression as our baseline. We had to retune the hyperparameters for these models to be able to predict the gas fees. Our best performing model was the XGboost model, we see it outperforming the LGBM this time around. This is likely due to the LGBM overfitting the training data and we observed cases of this where the model predicted consistently higher than the actual gas fee and there were also instances where it attempts to predict these spikes.

Impact and Technical Challenges

Putting the predictions of the gas fees and the percentage change together we are able to find the minimum amount that we should invest. Using the actual percent change and gas fees during that time we are able to see what our returns. Our strategy here is the invest at every opportunity where the model predicts a positive amount that is within a specified budget. From our test runs we observed what our returns would be using this investment strategy at different budgets. We saw that at a budget of 6000 if we invest using this strategy we can make a return of 8000. However on the flip side if we had a high enough budget around 10,000 dollars we could potentially lose over 3 million dollars. This is a trend that we have seen consistently that at higher amount of investments there is a potentially large downside.

Our deep dive into the world of decentralized finance faced 3 major technical challenges

Firstly, identifying and executing arbitrage opportunities for profit consistently is really difficult. we're dealing with an incredibly challenging arena. Cross-liquidity pool arbitrage opportunities are scarce, as the market is saturated with professional institutions who have significant advantages in terms of computing power, speed and investment size. Moreover, predicting slippage and gas prices accurately is also very difficult, making it difficult to gauge true profitability.

The second challenge revolves around the complexity of blockchain data. While it is all open and freely available, parsing raw blockchain data efficiently without expensive premium tools isn’t easy - there are a lot of data startups in the space working to solve for this but their premium tools are expensive, putting retail traders at a disadvantage when trying to identify opportunities quickly.

Lastly, we must contend with Ethereum's 'dark forest' nature. All transactions made on the blockchain are visible in the mempool before confirmation, leaving them exposed. This means anyone can potentially front-run your swap, trade or order, and in our case stealing our arbitrage opportunity. MEV is basically when clever traders (or even more clever bots and algorithms) make money by frontrunning you in the split second before your crypto trade goes through. The presence of these types of bots make it hard to compete without institutional grade hardware and data

Course

Data Science 210. Capstone , Summer 2024

Class Project Gallery

More Information

mydiamondhands Project Website

Arbitrage Playground

Kaggle Datasets and Models created

Github Repo

Presentation

Visit the Arbitrage Playground to try our Streamlit app for yourself. An Etherscan API key will be required to query the data live. Visit our project website for our knowledge base. Our Github Repo and Kaggle links host the data collated and the model training.

Video

If you require video captions for accessibility and this video does not have captions, click here to request video captioning.

Last updated: August 8, 2024