Last fall(2019) I enrolled in the CDSP class to enrich my Data Science knowledge. Besides learning lots of theory about data mining, clustering, the semantic web, time series analysis, neural networks and other state of the art techniques, what set CDSP apart for me is the practical exercises that go along with the theory. At the end of the course the knowledge is accumulated in a final case study into a business problem of your own choosing.
I paired up with Mels, a trader in dairy products at Hoogwegt International, to investigate possible predictors for the price of butter.
In this blog I will explain the process we went through and how applying the learnings during the course to a real world problem, not only enhanced the learning experience, but also immediately resulted in a useful business case.
First a little bit about me. I have a background in Business Intelligence. This entails extracting and transforming data, creating business dashboards, interacting with databases, writing the odd Python script and other general data-wrangling. I do not have much knowledge about data science. Also I don’t know much about the in-depth workings of the stock market in general, or trading dairy products in particular. I know my milk comes from a cow and I get it from my local grocery store at a price that seems reasonable to me.
My team partner Mels however knows everything about cows and dairy products. How much cows produce at what age, how it should be stored, all the derivative products you can make from milk, which countries are net-importers of these products and which are net-exporters, what makes a good milk season and how sunshine in New Zealand influences the price of my chocolate bar in the Netherlands.
Together we set down and thought about what would be a good case to apply our newly acquired data knowledge to. Mels had an interesting idea: In trading, traditionally there is a strong correlation between stocks and prices. When the price is high, the stocks are low, everyone wants to sell at these high prices, and vice versa, at low prices everyone stocks their product, and waits for prices to rise again. In one particular market however, the butter market, this correlation is not that evident: stocks and prices do not seem correlated at all.
Could we use our new Data Science knowledge to quantify the seemingly low correlation between price and stock in the butter market and find alternative predictors?
In the course we learned about a variety of data science techniques but which one was the appropriate one to apply?
We decided, after some deliberation with our very helpful teachers, to apply four of them. Some general statistics was always needed to show correlation and significance, timeseries analysis to find trends and seasonality and generalized linear models to find predictors for the butter price. Lastly we would use the Akaike information criterion for model selection, thus using a different but established method for finding predictors.
This was the easiest one. Applying the theory and calculating the correlation between the butter price data series and the stock data series in R gave us a Pearson correlation coefficient of 0.19. Since this is nowhere near 0.70, which would indicate significant correlation, we could confidently say the two series are uncorrelated.
Next we applied time series decomposition to see if there were seasonal effects. We spitted the time series in three parts: trend effects, seasonal effects and rest effects clustered in random effects. In the figure below the results.
The increasing trend line indicates that the average butter sales increased in the shown period. There is an increasing general demand for butter.
What clearly is stands out is the seasonal effect. Mels had a logical explanation for this from a business perspective. Before the festive holidays in December the prices go up since everybody wants butter for cooking and snacks. After the holidays, in January, there is much less demand since new year resolutions rarely entail eating lots of butter. So everyone who has a surplus of butter dumps them at low prices to prevent the costs of storing, hence the post December dump prices.
Taking a closer look at the y-axis however reveals that the random effects dominates. The trend shows a somewhat steady increase from 160 to 220 and where the seasonal variation is between approximately -10 and +20, the random variation is between -30 and +60.
Other effects besides seasonal and trend are affecting the price and now the question is, can we find useful predictors using Mels knowledge of the dairy business?
First, being good students we made a point of splitting the data in a training set, on which we would train our algorithms, and a validation set, which we would use to validate the outcome and compare with the predictions of the algorithms.
We made the split such that we had six years of test data and one year of validation data. Predicting further into the future did not seem necessary and we had to be careful not to make our test data set too small to prevent overfitting.
Using Mels business knowledge we selected 21 variables possibly related to the butter price. Since every business has its secrets I cannot fully disclose what each of them was but it suffices to say that we selected some time variables and variations of stock indicators as well as some other possible predictors.
Fitting all 21 variables to the test data resulted in the left figure below. Using only the variables which were deemed significant, five in total, gave us the figure on the right.
The better fit and thus lower root mean squared error (rmse) on the left does not indicate it is necessarily better, since the low rmse can also be attributed to overfitting, ea. given a larger amount of variables as there are data points you always get a better fit, with in extremis the same number of variables as data points giving a perfect fit. No matter if these variables are in any way logically related to the target variable.
The next step was testing a different technique where, starting from zero, a variable is added and seen if the fit improved. We learned this can be done using a general linearized model with L1 regularization. This results in variating the importance of a variable one at a time until no further improvement in the fit is obtained and then adding a variable and again varying the weight of each variable until the fit no longer improves. The resulting mean squared error is calculated each time. Below on the left you can see each coefficient varied one at a time, starting from zero, until all are added, and on the right the resulting mean squared error. One can see that adding more than the first 12 variables no longer decreases the mean squared error thus giving an optimum fit.
The resulting fit on our test data is plotted in the figure below. The rmse in this case is 21.
Lastly we applied stepwise AIC to find an optimum model. This not only tries to find the best fit by adding one variable at a time but also gives a punishment for using too many variables, which increases the risk of overfitting, thus finding a balance between adding variables and model complexity. This resulted in 12 variables. The resulting fit gave a rmse of 12 and is shown below.
Remember we split the data in a training and validation set? Up to now we only used the training data set. To validate the algorithms and see how well they predict the butter price we compared their predictions with the validation set. The outcomes are shown below.
We now see that although Akaike scored good on the training data it performs worse on the validation data. The best model seems to be a Generalized Linear Model with L1-regularization. An explanation for this could be that the pattern in the training set is very different from the validation set. If you look at the split we made in the figure below we see a flat price development instead of a volatile one just were we put the split.
Since this pattern was not in the training set it is hard to predict it.
Then again, just because we cannot see a pattern, the hope was an algorithm just might. And maybe the predicting algorithms have the same flat pattern thus still resulting in a good fit.
Is any model good enough to trade on? Well, Mels does not think so. He concluded that none of the selected variables was good enough to predict the butter price and the market was apparently driven by other things than the selected variables. He was however very enthusiastic because he now had a useful and reliable tool to test if a variable is a good predictor. Now he can no longer just rely on his gut feeling but also prove this by using numbers and explain this to his colleagues.
In the Certified Data Science Course we learned a lot of techniques to extract information from data. We applied some of these to predict butter prices, which could be used by Mels, a dairy product trader, in his daily work. Although we did not find any clear predictors we did find a reliable and reproducible technique to check if a variable is a good predictor.
It is important to mention that none of us, neither Mels or me, had any extensive knowledge of data science prior to the course. Yet by applying the newly acquired knowledge directly in the course, and with the help of our teachers, we could investigate a real-world practical question where business value is immediately evident. And we were not alone. Other teams worked at banks and used their final assignment to predict the chance that defined groups of people would be applicable for mortgage, thus reducing the cost of the intake process. Another team predicted the occupation of cells in police stations to enhance the flow of short stay inmates.
Of course we could further improve our techniques and evaluate other predictors or maybe even use text mining newspaper articles to predict the butter price. Now, thanks to the course we can!
I would like to thank our teachers, Hugo Koopmans and Koen de Koning , for their great course and their patience in explaining all the techniques to us!