Classification is a fundamental task in machine learning, and Scikit-learn offers a rich set of tools to tackle it. In this blog post, we'll explore various classification models and how to implement them using Scikit-learn.
First, let's import the necessary libraries:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, classification_report
Let's use the famous Iris dataset as an example:
from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
Let's start with a simple yet powerful classifier:
from sklearn.linear_model import LogisticRegression lr_model = LogisticRegression(random_state=42) lr_model.fit(X_train_scaled, y_train) y_pred = lr_model.predict(X_test_scaled) print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))
Logistic Regression is great for linearly separable data and provides easily interpretable results.
Next, let's try a non-linear model:
from sklearn.tree import DecisionTreeClassifier dt_model = DecisionTreeClassifier(random_state=42) dt_model.fit(X_train, y_train) y_pred = dt_model.predict(X_test) print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))
Decision Trees can capture complex relationships in the data but may overfit if not properly tuned.
Let's upgrade to an ensemble method:
from sklearn.ensemble import RandomForestClassifier rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) y_pred = rf_model.predict(X_test) print("Random Forest Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))
Random Forests combine multiple Decision Trees to create a more robust and accurate classifier.
Now, let's try a powerful non-linear classifier:
from sklearn.svm import SVC svm_model = SVC(kernel='rbf', random_state=42) svm_model.fit(X_train_scaled, y_train) y_pred = svm_model.predict(X_test_scaled) print("SVM Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))
SVMs are great for high-dimensional data and can handle complex decision boundaries.
To find the best model and its optimal parameters, we can use GridSearchCV:
from sklearn.model_selection import GridSearchCV param_grid = { 'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear'] } svm_grid = GridSearchCV(SVC(), param_grid, cv=5) svm_grid.fit(X_train_scaled, y_train) print("Best parameters:", svm_grid.best_params_) print("Best cross-validation score:", svm_grid.best_score_) y_pred = svm_grid.predict(X_test_scaled) print("Optimized SVM Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))
This approach helps us find the best model configuration automatically.
For models like Random Forests, we can easily check feature importance:
importances = rf_model.feature_importances_ feature_names = iris.feature_names for name, importance in zip(feature_names, importances): print(f"{name}: {importance}")
This insight can help us understand which features are most crucial for our classification task.
To get a more robust estimate of our model's performance, we can use cross-validation:
from sklearn.model_selection import cross_val_score cv_scores = cross_val_score(rf_model, X, y, cv=5) print("Cross-validation scores:", cv_scores) print("Mean CV score:", cv_scores.mean())
This gives us a better idea of how our model might perform on unseen data.
By exploring these different classification models in Scikit-learn, you're well on your way to becoming proficient in applying machine learning techniques to real-world problems. Remember, the key is to experiment with different models, understand their strengths and weaknesses, and choose the one that best fits your specific dataset and problem.
15/11/2024 | Python
06/10/2024 | Python
22/11/2024 | Python
26/10/2024 | Python
08/11/2024 | Python
26/10/2024 | Python
15/11/2024 | Python
17/11/2024 | Python
15/10/2024 | Python
06/10/2024 | Python
17/11/2024 | Python