Python SimpleImputer Module
The SimpleImputer module in Python is a part of the scikit-learn library that is used to preprocess data before the training of the model. It is used to handle the missing values of numeric data using various methods such as mean, median, most_frequent, and constant.
Syntax
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = 'NaN', strategy ='mean')
imputer.fit(data)
imputed_data = imputer.transform(data)
Example
Let's consider a dataset which contains some missing values.
import pandas as pd
from sklearn.impute import SimpleImputer
data = pd.read_csv('data.csv')
print(data)
Output:
Age Salary
0 25 25000
1 27 30000
2 30 NaN
3 33 35000
4 35 NaN
5 40 42000
Now, we can use SimpleImputer module to replace the missing values with the mean of remaining values.
imputer = SimpleImputer(missing_values = 'NaN', strategy ='mean')
imputer.fit(data)
imputed_data = imputer.transform(data)
print(imputed_data)
Output:
[[2.5000e+01 2.5000e+04]
[2.7000e+01 3.0000e+04]
[3.0000e+01 3.3333e+04]
[3.3000e+01 3.5000e+04]
[3.5000e+01 3.3333e+04]
[4.0000e+01 4.2000e+04]]
Explanation
In the above example, we have imported SimpleImputer module from sklearn.impute, and read a dataset using pandas library. Then, by using SimpleImputer, we have replaced the missing values with the mean of remaining values. The missing values are denoted as NaN in the dataset.
Use
SimpleImputer module is used to preprocess the data before training the machine learning models. It helps in maintaining the consistency of the dataset, which leads to better output.
Important Points
- SimpleImputer is a part of the scikit-learn library.
- It is used to handle missing values in a dataset.
- The missing values are replaced with selected strategy such as mean, median, mode or constant.
- Missing values are denoted as NaN in the dataset.
Summary
SimpleImputer module is used to preprocess the data and replace the missing values with selected strategy such as mean, median, mode or constant. The missing values are denoted as NaN in the dataset. It is a part of scikit-learn library and is primarily used in machine learning models for handling missing values.