The python of wall street Part 4
Or sorry you need to train it again
Disclaimer:
This project is a eight parts project that I’ll leverage to have the opportunity to implement different technologies I want to explore.
You are welcome to re-use any part of this script. But I would not advise using it on the stock market with your money. If you do, I am in no way responsible for whatever may result of it.
part 1: extracting data and computing trend indicators
part 2: creating an ETL pipeline for data quality and centralization
part 3: creating a classification model
part 4: automatic retraining of the model
part 5: create apis to access the data
part 6: data visualization
part 7: create docker container for micro services architecture
part 8: process automation
What is the pipeline
The whole pipeline, from data exctraction from the api, data processing, feature extraction and augmentation before labelling the data and inserting them in a database before training is as follow:
Why you need to retrain your model
The model you train is only representative of the pattern of the data based on past data. Therefore, it is only representative of the environment at a given time.
Models will often require retraining at some point after they are released into production. Often the distributions and statistics of the underlying data, the ground truth, will change after deployment. These changes are usually from changing market and demographic conditions.
One can see on this graph the drift between training set and testing or in production set. One can also see the difference between the model and the true underlying function. This model was relevant before the environment evolved, but cannot be used anymore in production.
Retraining approaches
In order to detect drift, I implemented two approaches.
One based on a statistical approach relying on tests for discrete data (binary data, i.e: 0 and 1 for the indicators and candlestick) and continuous data (features whose value range from -infinity to + infinity).
And an other one based on machine learning and a classifier, supposed to discriminate between new data and former data.
Statistical tests:
A statistical test provides a mechanism for making quantitative decisions about a process or processes. The intent is to determine whether there is enough evidence to “reject” a conjecture or hypothesis about the process.
https://www.itl.nist.gov/div898/handbook/prc/section1/prc13.htm
Given a population and a new sample. What is the likelihood that this sample given its mean and standard deviation belongs to the population ?
Given the scoring that the statistical test returns, does the sample belongs to the population or not ? Is it possible to reject the null hypotesis ?
This can be written as follow:
- Ho: null hypotesis => the sample distribution belongs to the population distribution given its p-value, it is not possible to reject the null hypotesis.
- Halternative => the sample distribution does not belong to the population distribution given its scoring value, the null hypothesis is rejected.
Discrete data:
Data are discrete if they can only take certain values. In our case, some of the features can only be 0 or 1. In our case, the following features are categorical data: rsi_indicator, stoch_indicator, ema_indicator, bollinger_indicator and candlesticks.
- Chi square test: a chi square test measures how the actual observed data compare to the expected output of the experiment. For instance, if you toss a coin 1000 times and want to check if the coin is balanced or not, you could compare the number of actual head or tails to the expected number which would be 500 heads and 500 tails.
According to the resulting score of the test and the lookup with the chi square table, you can then asses whether or not you can reject the null hypothesis. In this case it would be that the distribution for the given feature is the same for the new dataset and the old one.
Continuous data:
Data are continuous if they can take any values within a given range. In our case, some of the features belong to the -infinity to + infinity range.
In our case, the following features are continuous data: volume, numberOfTrades, RSI, var_ema, var_bollinger, var_stoch.
- T-test: a t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups.
To operate a T-test, one need three key values: the difference between the mean values from each data set (mean difference), the standard deviation of each group and the number of data points in each group.
Then, a t-score is calculated and this score can be used in conjunction with the t-table to asses whether or not this mean value is likely to occur and if the null hypothesis can or cannot be rejected.
In this case for instance, if the t-value resulting from the t-score falls in one of the red area, it would be deemed as unlikely and therefore rejecting the null hypothesis.
However, a drawback of the t-test is that it assumes a normal distribution of the dataset which may not be true in many cases. Therefore an other test should be used
cumulative distribution value CDF: probability of a variable to have a value which is less or equals to a given value
- kolgoromov test: This is a non parametric test which allows to get rid of the normality hypothesis and therefore generalize better to real life cases than the t-test. Its purpose is to compare the distribution/shape of two distributions.
- The test will measure the delta between the CDF of the two populations. The score wll be equal to the max of those differences (should be 0 if the distributions are equals). Then, once again, the probability of having this delta value if the null hypothesis is true is assessed. Given this value, the choice to reject or not the null hypothesis is taken.
Classifier approach:
- Random forest:
An other approach, based on random forest could also be implemented. The objective is here to train a model based on one hand on the new incoming data and on the other hand on a sub-sample of the already inserted data so that the resulting dataset is balanced.
Then, a model will be trained and the new incoming data will go through classification.
If the new data are close to the older ones, the classification accuracy (given by the area under curve) will be very poor as the model is not able to clearly discriminate between the two classes. Otherwise, if the model actually performs well and it is actually able to efficiently differentiate between the two classes of old and new data, it means that the distribution of the new data is different from the old ones and therefore once inserted, a new model should be trained.
AUC — ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.
TPR: True Positive Rate
FPR: False Positive Rate
- However, even if this method is quite efficient and powerful, can get increasingly computing heavy if the training dataset is big moreover, training time may also be a problem.
Conclusion
Several methods exist in order to asses whether or not the environment and the data extracted changed over time and therefore if the model needs retraining to compensate for drift. But given the time, accuracy and computing constraints, some statistical methods i.e a combination of chi squared test and kologormov smirnov test or machine learning based methods can be leveraged.
Next Steps
Now that the model factory is created and operational, it is interesting to add some data visualization as well and access the data in an easier way. Therefore, an API for the master record will be implemented. Moreover, once implementing micro-services, the communication between containers will be done via APIs.