The python of wall street Part 3
Or how how to train a model
Disclaimer:
This project is a eight parts project that I’ll leverage to have the opportunity to implement different technologies I want to explore.
You are welcome to re-use any part of this script. But I would not advise using it on the stock market with your money. If you do, I am in no way responsible for whatever may result of it.
part 1: extracting data and computing trend indicators
part 2: creating an ETL pipeline for data quality and centralization
part 3: creating a classification model
part 4: automatic retraining of the model
part 5: create apis to access the data
part 6: data visualization
part 7: create docker container for micro services architecture
part 8: process automation
Part 1: Or was all this work worth it ?
The end goal of all the previous steps was to create a systematic way to extract data, augment features and label the data in order to get a dataset containing three classes.
- buy: 1
- hold: 0
- sell: -1
The pipeline allows to filter and compute the extra features before actually training the model, resulting in a better segmentation and lower load on the machine once it comes to training a model.
On the other hand, due to the labelling process, the number of data points is quite unbalanced with a lot of 0 label and only few 1 and -1.
This will probably result in a problem during training and will be tackle later.
Part 2: Final manipulations
In the previous steps, some of the extra features that were computed were based on moving averages. Therefore, when computing them, some of the values were null but still needed to be inserted with some value in the database. They were transformed as -100.
Those lines can’t be used in the model so they are dropped before training.
The resulting dataset contains:
- Candlestick data : binary 0 or 1
- volume, numberOfTrades, RSI: continuous data as they were provided by the api or computed.
- var_ema, var_bollinger, var_stoch: difference between fast moving average, slow moving average, upper band and lower band and %K and %D, this aggregated data is more representative of the general behaviour and not of specific values taken by the data, mayube to related to the stock.
- rsi_indicator, stoch_indicator, ema_indicator, bollinger_indicator: -1, 0 or 1, depending of the classic technical analysis.
Part 3: Training a model
Picking a model
The model I picked was a classifier random forest based for the following reasons
- Good for classification tasks
- Handle missing values and maintains accuracy for missing data
- Does not overfit easily
- Handles large datasets with high dimensionality
Fine tuning
For fine tuning purposes, I used gridsearch, an exhaustive search over specified parameter values for an estimator in order to find the best hyperparameters possible for this model.
Moreover, a more diverse dataset is needed. As of now, the processing and selection results in a very unbalanced dataset as it tries to predict the best moment to buy and sell on a given day, resulting in only one -1 and one 1 label per day which is very low compared to the large number of 0s labels.
The resulting scoring is as follow:
{‘criterion’: ‘gini’, ‘max_depth’: 4, ‘max_features’: ‘auto’, ‘n_estimators’: 100}
saving model as: model.sav
Accuracy: 1.00 (+/- 0.00)
fit_time
[0.24494672 0.23546934 0.2274127 0.2270391 0.22433066]
score_time
[0.01685119 0.01598787 0.01579905 0.0159049 0.01657605]
test_precision_macro
[1. 1. 1. 1. 1.]
test_recall_macro
[1. 1. 1. 1. 1.]
Conclusion
With surch a high recall and precision, the model is definitely overfitting on the 0s data points which is an issue. On the other hand, the whole data pipeline and training pipeline is quite fast, efficient and robust and adding manipulations for dealing with the unbalanced dataset and get some more relevant results is probably doable.
Next Steps
The incoming data being dynamic and new data coming on a daily basis, in time the model may drift and its performance may drop if given some new data not respecting the historical pattern it was trained on. An automated way to quantify whether or not the new input data are very different and therefore if the model is still relevant or needs to be retrained to fit the situation better would be interesting.