I’m working on a project of stock price prediction . To begin i thought i d use a statistical model like SARIMAX because i want to add many features when fitting the model.
this is the plot i get
import pandas as pd import numpy as np import io import os import matplotlib.pyplot as plt from statsmodels.tsa.statespace.sarimax import SARIMAX from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from google.colab import drive # Mount Google Drive drive.mount(‘/content/drive’) # Define data directory path data_dir = ‘/content/drive/MyDrive/Parsed_Data/BarsDB/’ # List CSV files in the directory file_list = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith(‘.csv’)] # Define features features = [‘open’, ‘high’, ‘low’, ‘volume’, ‘average’, ‘SMA_5min’, ‘EMA_5min’, ‘BB_middle’, ‘BB_upper’, ‘BB_lower’, ‘MACD’, ‘MACD_Signal’, ‘MACD_Hist’, ‘RSI_14’] # Input symbol train_symbol = input(« Enter the symbol to train the model (e.g., AAPL): « ).strip().upper() print(f »Training SARIMAX model on symbol: {train_symbol} ») # Load training data df = pd.DataFrame() for file_path in file_list: try: temp_df = pd.read_csv(file_path, usecols=[‘Symbol’, ‘Timestamp’, ‘close’] + features) temp_df = temp_df[temp_df[‘Symbol’] == train_symbol].copy() if not temp_df.empty: df = pd.concat([df, temp_df], ignore_index=True) except Exception as e: print(f »Error loading {file_path}: {e} ») if df.empty: raise ValueError(« No training data found. ») df[‘Timestamp’] = pd.to_datetime(df[‘Timestamp’]) df = df.sort_values(‘Timestamp’) df[‘Date’] = df[‘Timestamp’].dt.date test_day = df[‘Date’].iloc[-1] train_df = df[df[‘Date’] != test_day].copy() test_df = df[df[‘Date’] == test_day].copy() # Fit SARIMAX model on training data endog = train_df[‘close’] exog = train_df[features] # Drop rows with NaN or Inf combined = pd.concat([endog, exog], axis=1) combined = combined.replace([np.inf, -np.inf], np.nan).dropna() endog_clean = combined[‘close’] exog_clean = combined[features] model = SARIMAX(endog_clean, exog=exog_clean, order=(5, 1, 2), enforce_stationarity=False, enforce_invertibility=False) model_fit = model.fit(disp=False) # Forecast for the test day exog_forecast = test_df[features] forecast = model_fit.forecast(steps=len(test_df), exog=exog_forecast) # Evaluation actual = test_df[‘close’].values timestamps = test_df[‘Timestamp’].values # Compute direction accuracy actual_directions = [‘Up’ if n > c else ‘Down’ for c, n in zip(actual[:-1], actual[1:])] predicted_directions = [‘Up’ if n > c else ‘Down’ for c, n in zip(forecast[:-1], forecast[1:])] direction_accuracy = (np.array(actual_directions) == np.array(predicted_directions)).mean() * 100 rmse = np.sqrt(mean_squared_error(actual, forecast)) mape = np.mean(np.abs((actual – forecast) / actual)) * 100 mse = mean_squared_error(actual, forecast) r2 = r2_score(actual, forecast) mae = mean_absolute_error(actual, forecast) tolerance = 0.5 errors = np.abs(actual – forecast) price_accuracy = (errors <= tolerance).mean() * 100 print(f »nEvaluation Metrics for {train_symbol} on {test_day}: ») print(f »Direction Prediction Accuracy: {direction_accuracy:.2f}% ») print(f »Price Prediction Accuracy (within ${tolerance} tolerance): {price_accuracy:.2f}% ») print(f »RMSE: {rmse:.4f} ») print(f »MAPE: {mape:.2f}% ») print(f »MSE: {mse:.4f} ») print(f »R² Score: {r2:.4f} ») print(f »MAE: {mae:.4f} ») # Create DataFrame for visualization predictions = pd.DataFrame({ ‘Timestamp’: timestamps, ‘Actual_Close’: actual, ‘Predicted_Close’: forecast }) # Plot plt.figure(figsize=(12, 6)) plt.plot(predictions[‘Timestamp’], predictions[‘Actual_Close’], label=’Actual Closing Price’, color=’blue’) plt.plot(predictions[‘Timestamp’], predictions[‘Predicted_Close’], label=’Predicted Closing Price’, color=’orange’) plt.title(f’Minute-by-Minute Close Prediction using SARIMAX for {train_symbol} on {test_day}’) plt.xlabel(‘Timestamp’) plt.ylabel(‘Close Price’) plt.legend() plt.grid(True) plt.xticks(rotation=45) plt.tight_layout() plt.show()
and this is the script i work with
but the results seems to good to be true i think so feel free to check the code and tell me if there might be an overfitting or the test and train data are interfering .
this is the output with the plot :
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(« /content/drive », force_remount=True). Enter the symbol to train the model (e.g., AAPL): aapl Training SARIMAX model on symbol: AAPL /usr/local/lib/python3.11/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided. As a result, forecasts cannot be generated. To use the model for forecasting, use one of the supported classes of index. self._init_dates(dates, freq) /usr/local/lib/python3.11/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided. As a result, forecasts cannot be generated. To use the model for forecasting, use one of the supported classes of index. self._init_dates(dates, freq) /usr/local/lib/python3.11/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals warnings.warn(« Maximum Likelihood optimization failed to » /usr/local/lib/python3.11/dist-packages/statsmodels/tsa/base/tsa_model.py:837: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.11/dist-packages/statsmodels/tsa/base/tsa_model.py:837: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index( Evaluation Metrics for AAPL on 2025-05-09: Direction Prediction Accuracy: 80.98% Price Prediction Accuracy (within $0.5 tolerance): 100.00% RMSE: 0.0997 MAPE: 0.04% MSE: 0.0099 R² Score: 0.9600 MAE: 0.0822
submitted by /u/gnassov to r/learnmachinelearning
[link] [comments]
Laisser un commentaire