Time series analysis is a crucial skill for data scientists and analysts working with sequential data. Whether you're predicting stock prices, analyzing weather patterns, or forecasting sales, understanding how to handle time-dependent data is essential. In this blog post, we'll explore how to perform time series analysis using Scikit-learn in Python.
Before we begin, make sure you have the necessary libraries installed:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt
Let's start by loading a sample time series dataset:
# Load data data = pd.read_csv('time_series_data.csv') data['Date'] = pd.to_datetime(data['Date']) data.set_index('Date', inplace=True)
Next, we'll create features from the time series:
def create_features(df): df['dayofweek'] = df.index.dayofweek df['quarter'] = df.index.quarter df['month'] = df.index.month df['year'] = df.index.year df['dayofyear'] = df.index.dayofyear return df data = create_features(data)
When working with time series, it's important to maintain the temporal order of the data:
train = data.loc[data.index < '2022-01-01'] test = data.loc[data.index >= '2022-01-01'] FEATURES = ['dayofweek', 'quarter', 'month', 'year', 'dayofyear'] TARGET = 'Sales' X_train = train[FEATURES] y_train = train[TARGET] X_test = test[FEATURES] y_test = test[TARGET]
Let's start with a simple linear regression model:
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) model = LinearRegression() model.fit(X_train_scaled, y_train) train_predictions = model.predict(X_train_scaled) test_predictions = model.predict(X_test_scaled) train_mse = mean_squared_error(y_train, train_predictions) test_mse = mean_squared_error(y_test, test_predictions) print(f"Train MSE: {train_mse}") print(f"Test MSE: {test_mse}")
Let's plot our predictions against the actual values:
plt.figure(figsize=(12, 6)) plt.plot(train.index, y_train, label='Train Actual') plt.plot(train.index, train_predictions, label='Train Predicted') plt.plot(test.index, y_test, label='Test Actual') plt.plot(test.index, test_predictions, label='Test Predicted') plt.legend() plt.title('Time Series Forecast') plt.show()
While linear regression provides a good baseline, more advanced techniques can improve our predictions:
from sklearn.ensemble import RandomForestRegressor rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train_scaled, y_train) rf_train_predictions = rf_model.predict(X_train_scaled) rf_test_predictions = rf_model.predict(X_test_scaled) rf_train_mse = mean_squared_error(y_train, rf_train_predictions) rf_test_mse = mean_squared_error(y_test, rf_test_predictions) print(f"Random Forest Train MSE: {rf_train_mse}") print(f"Random Forest Test MSE: {rf_test_mse}")
While Scikit-learn doesn't have built-in ARIMA models, we can combine it with statsmodels:
from statsmodels.tsa.arima.model import ARIMA from sklearn.base import BaseEstimator, RegressorMixin class ARIMAWrapper(BaseEstimator, RegressorMixin): def __init__(self, order=(1,1,1)): self.order = order def fit(self, X, y): self.model = ARIMA(y, order=self.order) self.result = self.model.fit() return self def predict(self, X): return self.result.forecast(steps=len(X)) arima_model = ARIMAWrapper(order=(1,1,1)) arima_model.fit(X_train, y_train) arima_predictions = arima_model.predict(X_test) arima_mse = mean_squared_error(y_test, arima_predictions) print(f"ARIMA Test MSE: {arima_mse}")
Understanding which features contribute most to our predictions can provide valuable insights:
feature_importance = pd.DataFrame({'feature': FEATURES, 'importance': rf_model.feature_importances_}) feature_importance = feature_importance.sort_values('importance', ascending=False) plt.figure(figsize=(10, 6)) plt.bar(feature_importance['feature'], feature_importance['importance']) plt.title('Feature Importance') plt.xlabel('Features') plt.ylabel('Importance') plt.xticks(rotation=45) plt.show()
Traditional cross-validation doesn't work well for time series due to temporal dependencies. Instead, we can use time series cross-validation:
from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5) cv_scores = [] for train_index, val_index in tscv.split(X_train): X_train_cv, X_val_cv = X_train.iloc[train_index], X_train.iloc[val_index] y_train_cv, y_val_cv = y_train.iloc[train_index], y_train.iloc[val_index] model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train_cv, y_train_cv) predictions = model.predict(X_val_cv) mse = mean_squared_error(y_val_cv, predictions) cv_scores.append(mse) print(f"Cross-validation MSE scores: {cv_scores}") print(f"Average MSE: {np.mean(cv_scores)}")
In this blog post, we've explored various techniques for time series analysis using Scikit-learn in Python. We've covered data preprocessing, feature engineering, model building, and evaluation. By leveraging these tools and techniques, you can build powerful time series models for a wide range of applications.
Remember, time series analysis is a vast field, and there's always more to learn. Experiment with different models, feature engineering techniques, and preprocessing steps to find what works best for your specific time series problem.
08/11/2024 | Python
05/11/2024 | Python
08/12/2024 | Python
25/09/2024 | Python
26/10/2024 | Python
22/11/2024 | Python
26/10/2024 | Python
22/11/2024 | Python
25/09/2024 | Python
25/09/2024 | Python
22/11/2024 | Python
25/09/2024 | Python