logologo
  • Dashboard
  • Features
  • AI Tools
  • FAQs
  • Jobs
logologo

We source, screen & deliver pre-vetted developers—so you only interview high-signal candidates matched to your criteria.

Useful Links

  • Contact Us
  • Privacy Policy
  • Terms & Conditions
  • Refund & Cancellation
  • About Us

Resources

  • Certifications
  • Topics
  • Collections
  • Articles
  • Services

AI Tools

  • AI Interviewer
  • Xperto AI
  • Pre-Vetted Top Developers

Procodebase © 2025. All rights reserved.

Level Up Your Skills with Xperto-AI

A multi-AI agent platform that helps you level up your development skills and ace your interview preparation to secure your dream job.

Launch Xperto-AI

Mastering Feature Scaling and Transformation in Python with Scikit-learn

author
Generated by
ProCodebase AI

15/11/2024

python

Sign in to read full article

Introduction

When working with machine learning models, it's crucial to preprocess your data properly. One of the most important preprocessing steps is feature scaling and transformation. In this article, we'll explore various techniques for scaling and transforming features using Scikit-learn in Python.

Why is Feature Scaling Important?

Feature scaling is essential because many machine learning algorithms are sensitive to the scale of input features. For example, consider a dataset with two features: age (ranging from 0 to 100) and income (ranging from 0 to 1,000,000). Without scaling, the income feature would dominate the age feature, potentially leading to biased results.

StandardScaler

StandardScaler is one of the most commonly used scaling techniques. It standardizes features by removing the mean and scaling to unit variance.

from sklearn.preprocessing import StandardScaler import numpy as np # Sample data X = np.array([[1, 2000], [2, 3000], [3, 4000], [4, 5000]]) # Initialize and fit StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print("Original data:\n", X) print("\nScaled data:\n", X_scaled)

Output:

Original data:
[[   1 2000]
 [   2 3000]
 [   3 4000]
 [   4 5000]]

Scaled data:
[[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]

As you can see, the scaled data has a mean of 0 and a standard deviation of 1 for each feature.

MinMaxScaler

MinMaxScaler scales features to a fixed range, typically between 0 and 1. This is useful when you want to preserve zero entries in sparse data.

from sklearn.preprocessing import MinMaxScaler # Initialize and fit MinMaxScaler scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) print("Original data:\n", X) print("\nScaled data:\n", X_scaled)

Output:

Original data:
[[   1 2000]
 [   2 3000]
 [   3 4000]
 [   4 5000]]

Scaled data:
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]
 [1.5 1.5]]

RobustScaler

RobustScaler is useful when your data contains many outliers. It uses statistics that are robust to outliers, such as the median and interquartile range.

from sklearn.preprocessing import RobustScaler # Sample data with outliers X = np.array([[1, 2000], [2, 3000], [3, 4000], [4, 5000], [100, 100000]]) # Initialize and fit RobustScaler scaler = RobustScaler() X_scaled = scaler.fit_transform(X) print("Original data:\n", X) print("\nScaled data:\n", X_scaled)

Output:

Original data:
[[    1  2000]
 [    2  3000]
 [    3  4000]
 [    4  5000]
 [  100 100000]]

Scaled data:
[[-0.66666667 -0.66666667]
 [-0.33333333 -0.33333333]
 [ 0.          0.        ]
 [ 0.33333333  0.33333333]
 [32.33333333 32.33333333]]

Notice how the outlier (100, 100000) is handled more gracefully by RobustScaler compared to StandardScaler or MinMaxScaler.

Handling Categorical Variables

For categorical variables, we often need to transform them into numerical values. One common technique is one-hot encoding.

from sklearn.preprocessing import OneHotEncoder import pandas as pd # Sample categorical data data = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'green']}) # Initialize and fit OneHotEncoder encoder = OneHotEncoder(sparse=False) encoded_data = encoder.fit_transform(data) # Create a new DataFrame with encoded values encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(['color'])) print("Original data:\n", data) print("\nEncoded data:\n", encoded_df)

Output:

Original data:
   color
0   red
1  blue
2  green
3   red
4  green

Encoded data:
   color_blue  color_green  color_red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         0.0          0.0        1.0
4         0.0          1.0        0.0

Combining Scalers in a Pipeline

In practice, you might need to apply different scaling techniques to different features. Scikit-learn's Pipeline and ColumnTransformer classes make this process seamless.

from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder # Sample data data = pd.DataFrame({ 'age': [25, 30, 35, 40], 'income': [50000, 60000, 70000, 80000], 'city': ['New York', 'London', 'Paris', 'Tokyo'] }) # Define preprocessing steps numeric_features = ['age', 'income'] categorical_features = ['city'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(drop='first'), categorical_features) ]) # Create and fit the pipeline pipeline = Pipeline([('preprocessor', preprocessor)]) transformed_data = pipeline.fit_transform(data) print("Original data:\n", data) print("\nTransformed data:\n", transformed_data)

Output:

Original data:
   age  income     city
0   25   50000  New York
1   30   60000    London
2   35   70000     Paris
3   40   80000     Tokyo

Transformed data:
[[-1.34164079 -1.34164079  1.          0.          0.        ]
 [-0.4472136  -0.4472136   0.          1.          0.        ]
 [ 0.4472136   0.4472136   0.          0.          1.        ]
 [ 1.34164079  1.34164079  0.          0.          0.        ]]

In this example, we've applied StandardScaler to numeric features and OneHotEncoder to categorical features, all in a single pipeline.

Conclusion

Feature scaling and transformation are crucial steps in preparing your data for machine learning models. By using Scikit-learn's preprocessing tools, you can easily implement these techniques and improve your model's performance. Remember to choose the appropriate scaling method based on your data characteristics and the requirements of your chosen algorithm.

Popular Tags

pythonscikit-learnfeature scaling

Share now!

Like & Bookmark!

Related Collections

  • Matplotlib Mastery: From Plots to Pro Visualizations

    05/10/2024 | Python

  • Seaborn: Data Visualization from Basics to Advanced

    06/10/2024 | Python

  • Mastering Pandas: From Foundations to Advanced Data Engineering

    25/09/2024 | Python

  • Python with Redis Cache

    08/11/2024 | Python

  • Python Basics: Comprehensive Guide

    21/09/2024 | Python

Related Articles

  • Unleashing Data Visualization Power

    05/10/2024 | Python

  • Creating Stunning Scatter Plots with Seaborn

    06/10/2024 | Python

  • Mastering Dependency Injection in FastAPI

    15/10/2024 | Python

  • Mastering Django Models and Database Management

    26/10/2024 | Python

  • Mastering NumPy Array Input and Output

    25/09/2024 | Python

  • Unlocking the Power of Dependency Parsing with spaCy in Python

    22/11/2024 | Python

  • Python Generators and Iterators Deep Dive

    15/01/2025 | Python

Popular Category

  • Python
  • Generative AI
  • Machine Learning
  • ReactJS
  • System Design