Could Machine Learning and NLP Have Predicted Oil’s Crash? The Answer is YES.

5 min read

A guide on how to combine Machine Learning and NLP to successfully predict the COVID-19 lead oil crash.

On April 20th, 2020, futures for WTI (Crude oil’s U.S. benchmark) reached for the first time in history negative values. In other words, producers had to pay traders to discard the excess oil off their hands. Such a catastrophe was not a one-day thing. On the contrary, there was a build-up process taking place behind the scenes for months before the crash. The process its self will be discussed in another article as it is not the subject of today’s analysis.

Black Gold can be described as many things. Its main characteristic is undeniably its volatility and unpredictable nature. It is important to understand my interpretation of these words. Oil is a geopolitical game. Historical data prove this statement by showing geopolitical undercurrent each time major price fluctuations take place.

The techniques discussed in this article could have also been implemented by hand, without the use of Machine Learning

Machine learning and Artificial Intelligence can be used to solve almost the entirety of human kind’s problems. Price predictions present no exception.

Countless machine learning techniques are tested daily to provide insight on how to perfect price-prediction models. The majority of the models currently in the market utilize technologies such as Artificial Neural Networks, Random Forest, and Logistic Regression, to perform some sort of time-series technical analysis. A detailed explanation of how one should go with constructing such a prediction model can be found below.

Predicting Oil Prices With Machine Learning And Python- The complete guide on predicting the price of “black gold” with less than 0.3% error using Python and Machine Learning

In simple terms, data-scientists feed their models with historical data and teach them how to react accordingly. A huge problem is created though.

The problem is simple. The markets are a live organism. An organism so fragile and prone to third-party interference where it becomes almost impossible to predict its next move.

The problem, in other words, is the element of the unknown. Insider information is king. Hedge funds and governments constantly manipulate the markets and leave mathematical and machine learning models completely helpless.

Although there are periods that could be described as ‘relatively peaceful’, the unknown can always surface and change everything. As mentioned in the introduction, oil is a prime case-study due to its price’s vulnerability to geopolitical games (e.x. Russia vs U.S.A. e.t.c.).

Many people have reached the incorrect conclusion that machine learning can not be used to predict market outcomes, especially in unexpected conditions such as COVID-19. Contrary to the skeptics, I believe that oil’s market-crash could have been easily predicted by performing some tweaks to my oil-predicting model found here.

Time-Series machine learning analysis works. There is no question there. It is thus essential that any model can perform time-series technical analysis to some extent. Nevertheless, there are numerous instances where it fails.

The answer to whether it works is somewhere in between. Although it is successful, it can not function by itself in a live market ecosystem. This is where the techniques introduced in this article come in.

For a model to be able to rapidly react to surprise changes/discoveries I have resulted that it should follow the guidelines below:

  • The model should possess the ability to be interactive (human interference can be used as data inputs for the model).
  • In addition to time-series analysis, sentiment analysis on news articles related to the commodity should be done in 3-hour intervals.
  • Twitter sentiment analysis should be conducted daily.
  • The model should also take into consideration, data imported from google trends.

Although the above will not be able to have 100% accuracy when the ‘unknown’ comes into play, situations will be surely better handled.

I believe that a model following the guidelines presented above could have partly predicted the recent oil-crash.

As I have already outlined, a situation such as this did not escalate in a day. On the contrary, it has been evolving for over a month now. In summary:

  • Newsfeed analysis would show that there is an increased supply of oil currently in the market.
  • Using google trends, the model would then see that the demand for oil has suddenly decreased.
  • Twitter could also be used as a reference point but it would not really help in this case.

Newsfeed Analysis

Something really weird has happened in the past months. A plethora of articles have surfaced by large publications all talking about a sudden supply surge in oil production by OPEC.

There was an agreement of OPEC members to limit production until March 31, 2020. On March, 6 though, Russia announced it would no longer restrict productions as of the first of April. As a response, OPEC announced it would increase production.

It now becomes evident, why articles such as the ones listed below started surfacing all around the web:

Taking into consideration all of the above, the model, using NLP, would have summarized the news related to oil that month into key phrases such as “oil supply surge”, and “increased oil supply”.

The model would have also classified the month, using sentiment analysis in order to create a risk score.

*I will be calling this Data-Point 1

Google Trends

Google Trends is an excellent tool which is unfortunately usually neglected and not taken into consideration. Once one understands what google trends actually is, neglecting it, simply does not make sense.

Google Trends is a preview of the search/buying patterns of the entire world.

It is hence obvious that much useful information can derive from analyzing key-words/phrases related to oil. There is a plethora of such phrases and words but simple examples would be:

“Where can I buy oil?”

“Can I buy oil from home?”

“Is oil expensive to buy?”

The reason for which these keywords/phrases are so important is that they show to our model, the quantity demanded for oil. These are all common google searches made by people wanting to buy oil for themselves.

The quantity demanded would obviously not be the actual quantity demanded of oil, as the model would simply not have the data needed to compute such a value. In reality, such statistics would enable us to form an idea of the quantity demanded of oil in relation to previous weeks/months/years.

Let’s take a look at the google trends searches related to buying oil for the weeks leading up to the oil crash.

Google Trends for Oil Purchases March-April

It appears that something quite weird started to take place around mid-March. People have stopped buying oil.

When examining this phenomenon in a ceteris paribus environment, it makes no sense at all. There must have been a trigger event, leading to such a decrease in oil’s quantity demanded. This event is COVID-19. The majority of the world has been forced to quarantine in their homes, without having the luxury of commuting via car. Commercial flights have almost entirely been forbidden, as well as the majority of their sea-voyaging counterparts. In other words, the necessity of oil in our everyday lives has reduced dramatically.

The model would have immediately picked up such an anomaly, resulting in making the assumption that the quantity demanded for oil has reduced astronomically.

*I will be calling this Data-Point 2

Twitter Sentiment Analysis

Although twitter activity might not be the most suitable indicator in this case scenario, extracting the sentiment of tweets related to oil could provide additional insight for the model.

The best way to integrate twitter would be to analyze all tweets related to oil and assign them with a sentiment-score. If this score was negative it could have helped the model make a decision.

I recently came across two resources, that are directly related to the contents of this analysis and may be helpful.

*I will be calling this Data-Point 3


By combining all three data points, the model would come to the conclusion that there is an excess supply of oil in the market and thus the price would have to go down. Although it would have not shown the exact price of oil, it would have certainly indicated that the best move would be to short the market.

I encourage everyone to experiment with different machine learning techniques and try to implement the steps mentioned above, in order to make a functional, more accurate price predictor for oil, and any other commodity.

This is a theoretical and quite laconic approach to the topic in order to help people think. Once I have developed the model, another article will follow, explaining it part-by-part.

Do you want to learn more?

If you want to advance your knowledge and are interested in predicting more commodities’ prices I highly encourage you to read the articles listed below:

Filippos Dounis I'm a 16 year-old student freelancer. I have been programming for seven years now and I specialize in Machine Learning. The majority of my research revolves around machine learning models in economics, finance, and medicine.

Schedule a DDIChat with Filippos Dounis

Leave a Reply

Your email address will not be published. Required fields are marked *