# print(result) "Part 4 of NotADev" ## **Introducing Machine Learning** Now that the data was enriched with technical indicators and lag features, it was time to build a predictive model to forecast stock movements.

💭

I’ve worked on predictable models for decades, mainly around services, customers or debt, in the telecommunications industry mostly. The stock market, and the crypto market even more so, aren’t easy eggs to crack, and I’m no Jim Simons, but, I did want an element of prediction to help assist the technical indicators so I could get a reasonable inkling of the share price to be, and hopefully catch more wins than losses.

--- ### **Building the Predictive Model** The AI assistant suggested using **XGBoost**, a powerful and efficient gradient boosting algorithm that's well-suited for tabular data. > ## **What is XGBoost in Machine Learning?** > > [XGBoost, or eXtreme Gradient Boosting, is a XGBoost algorithm in machine](https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/#:~:text=XGBoost%20builds%20a%20predictive%20model,made%20by%20the%20existing%20ones.) learning algorithm under ensemble learning. It is trendy for supervised learning tasks, such as regression and classification. XGBoost builds a predictive model by combining the predictions of multiple individual models, often decision trees, in an iterative manner. > > The algorithm works by sequentially adding weak learners to the ensemble, with each new learner focusing on correcting the errors made by the existing ones. It uses a gradient descent optimization technique to minimize a predefined loss function during training. > > Key features of XGBoost Algorithm include its ability to handle complex relationships in data, regularization techniques to prevent overfitting and incorporation of parallel processing for efficient computation. > > source: [What is the XGBoost algorithm and how does it work? (](https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/#:~:text=XGBoost%20builds%20a%20predictive%20model,made%20by%20the%20existing%20ones.)[analyticsvidhya.com](http://analyticsvidhya.com)[)](https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/#:~:text=XGBoost%20builds%20a%20predictive%20model,made%20by%20the%20existing%20ones.) ```python from xgboost import XGBClassifier def prepare_data(data): # Define the target variable data['Future_Return'] = (data['Close'].shift(-1) - data['Close']) / data['Close'] data['Target'] = (data['Future_Return'] > 0).astype(int) data.dropna(inplace=True) # Select features features = data.drop(['Target', 'Future_Return', 'Close'], axis=1).columns X = data[features] y = data['Target'] return X, y ```

🛑

Challenge: When splitting the data into training and testing sets, it initially used random shuffling, which was not appropriate for time series data as it breaks the temporal order - Obviously, the AI gave me the reason after I ran into errors.

💡

AI's Solution: The AI recommended using a time series split to preserve the sequence of data.

```python from sklearn.model_selection import TimeSeriesSplit def train_model(X, y): tscv = TimeSeriesSplit(n_splits=5) model = XGBClassifier(use_label_encoder=False, eval_metric='logloss') for train_index, test_index in tscv.split(X): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] model.fit(X_train, y_train) return model ``` ---

ℹ

A little insight into train and test.

### Train and Test Data When building a machine learning model, it’s crucial to evaluate its performance on unseen data. This is where the concepts of train and test data come into play. 1. **Training Data**: * **Purpose**: Used to train the model. * **Process**: The model learns patterns, relationships, and features from this data. * **Example**: If you have a dataset of house prices, the training data would include features like the number of rooms, location, and the corresponding house prices. 2. **Test Data**: * **Purpose**: Used to evaluate the model’s performance. * **Process**: After training, the model makes predictions on the test data, and these predictions are compared to the actual values to assess accuracy. * **Example**: Continuing with the house prices example, the test data would also include features like the number of rooms and location, but the model would predict the house prices, which are then compared to the actual prices. --- ### Using XGBoost with Train and Test Data Here’s a quick example guide to using XGBoost with train and test data: 1. **Import Libraries**: ```python import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score ``` 2. **Load and Prepare Data**: ```python # Example using a dataset from sklearn.datasets import load_breast_cancer data = load_breast_cancer() X = data.data y = data.target ``` 3. **Split Data into Train and Test Sets**[:](https://www.bing.com/new#faq) ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` 4. [**Train the XGBoost Model**:](https://www.bing.com/new#faq) ```python model = xgb.XGBClassifier() model.fit(X_train, y_train) ``` 5. **Make Predictions on Test Data**: ```python y_pred = model.predict(X_test) ``` 6. **Evaluate the Model**: ```python accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") ``` ### Why Sp[lit Data?](https://www.bing.com/new#faq) * **Avoid Overfitting**: By evaluating the model on unseen data (test data), you can ensure it generalises well and isn’t just memorising the training data. * **Model Validation**: It helps in validating the model’s performance and tuning hyperparameters effectively. ### Conclusion Using train and test data is essential for building robust machine learning models. XGBoost, with its powerful capabilities, can efficiently handle this process, ensuring high performance and accuracy. Resources: [How to train XGBoost models in Python (](https://www.youtube.com/watch?v=aLOQD66Sj0g)[youtube.com](http://youtube.com)[)](https://www.youtube.com/watch?v=aLOQD66Sj0g) --- ### [**Handling Overfitting**](https://www.youtube.com/watch?v=aLOQD66Sj0g)

🛑

Challenge: The model performed exceptionally well on the training data but poorly on the test data, indicating overfitting.

💡

AI's Solution: To combat overfitting, the AI suggested:

* **Hyperparameter Tuning:** Adjusting parameters like `max_depth`, `n_estimators`, and `learning_rate` to find the optimal combination. * **Cross-Validation:** Using `TimeSeriesSplit` to perform cross-validation that respects the temporal order. * **Regularisation:** Adding regularisation parameters like `reg_alpha` and `reg_lambda` to penalise complex mode[l](https://www.bing.com/new#faq)s[.](https://www.bing.com/new#faq) ```python from sklearn.model_selection import RandomizedSearchCV def train_model_with_cv(X, y): model = XGBClassifier(use_label_encoder=False, eval_metric='logloss') param_grid = { 'n_estimators': [50, 100, 150], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.05, 0.1], 'subsample': [0.8, 1.0], 'colsample_bytree': [0.8, 1.0], 'reg_alpha': [0, 0.1, 0.5], 'reg_lambda': [1, 1.5, 2] } tscv = TimeSeriesSplit(n_splits=5) grid_search = RandomizedSearchCV(model, param_grid, cv=tscv, scoring='accuracy', n_iter=10) grid_search.fit(X, y) return grid_search.best_estimator_ ``` This approach improved the model's generalisation to unseen data. --- ### **Handling Class Imbalance**

🛑

Challenge: The target variable was imbalanced, with more instances of one class over the other, which can bias the model.

💡

AI's Solution: The AI suggested using SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes.

> ### What is SMOTE? > > SMOTE is a technique used to create synthetic samples for the minority class in a dataset. This helps balance the class distribution, which is crucial for training machine learning models effectively on imbalanced data. > > ### How Does SMOTE Work? > > 1. **Identify Minority Class Samples**: SMOTE starts by identifying the samples in the minority class. > > 2. **Generate Synthetic Samples**: It then generates new synthetic samples by interpolating between existing minority class samples. This is done by selecting two or more similar instances and creating a new instance that lies between them in the feature space. > > 3. **Add Synthetic Samples to Dataset**: These synthetic samples are added to the dataset, resulting in a more balanced class distribution. > > > ### Benefits of SMOTE > > * **Improves Model Performance**: By balancing the dataset, models can learn better and perform more accurately on the minority class. > > * **Reduces Overfitting**: Unlike simple oversampling (which duplicates minority class samples), SMOTE reduces the risk of overfitting by creating new, unique samples. > ```python from imblearn.over_sampling import SMOTE def balance_classes(X, y): smote = SMOTE(random_state=42) X_resampled, y_resampled = smote.fit_resample(X, y) return X_resampled, y_resampled ``` After balancing the classes, the model's performance improved significantly. --- ### **Encountered Error:** While training the model, I ran into an error: ```bash ValueError: could not convert string to float: '2024-03-07'. ```

💡

AI's Solution: The AI pointed out that non-numeric data (like date strings) were included in the feature set. To fix this, we ensured that only numeric columns were used.

```python def select_numeric_features(data): numeric_cols = data.select_dtypes(include=[np.number]).columns return data[numeric_cols] ``` By selecting only numeric features, we eliminated the error. **Results so far** ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1727683026269/70442f0b-153f-48d4-9394-d8702909f568.png align="center") --- Excellent, so we have a working bot, with analysis, back-testing and predictions, but, the accuracy is quite low, so this will be my next item to work on (or rather AI to work on). pxng0lin.