python
  1. python-simpleimputer-module

Python SimpleImputer Module

The SimpleImputer module in Python is a part of the scikit-learn library that is used to preprocess data before the training of the model. It is used to handle the missing values of numeric data using various methods such as mean, median, most_frequent, and constant.

Syntax

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = 'NaN', strategy ='mean')
imputer.fit(data)
imputed_data = imputer.transform(data)

Example

Let's consider a dataset which contains some missing values.

import pandas as pd
from sklearn.impute import SimpleImputer

data = pd.read_csv('data.csv')
print(data)

Output:

   Age  Salary
0   25   25000
1   27   30000
2   30    NaN
3   33   35000
4   35    NaN
5   40   42000

Now, we can use SimpleImputer module to replace the missing values with the mean of remaining values.

imputer = SimpleImputer(missing_values = 'NaN', strategy ='mean')
imputer.fit(data)
imputed_data = imputer.transform(data)
print(imputed_data)

Output:

[[2.5000e+01 2.5000e+04]
 [2.7000e+01 3.0000e+04]
 [3.0000e+01 3.3333e+04]
 [3.3000e+01 3.5000e+04]
 [3.5000e+01 3.3333e+04]
 [4.0000e+01 4.2000e+04]]

Explanation

In the above example, we have imported SimpleImputer module from sklearn.impute, and read a dataset using pandas library. Then, by using SimpleImputer, we have replaced the missing values with the mean of remaining values. The missing values are denoted as NaN in the dataset.

Use

SimpleImputer module is used to preprocess the data before training the machine learning models. It helps in maintaining the consistency of the dataset, which leads to better output.

Important Points

  • SimpleImputer is a part of the scikit-learn library.
  • It is used to handle missing values in a dataset.
  • The missing values are replaced with selected strategy such as mean, median, mode or constant.
  • Missing values are denoted as NaN in the dataset.

Summary

SimpleImputer module is used to preprocess the data and replace the missing values with selected strategy such as mean, median, mode or constant. The missing values are denoted as NaN in the dataset. It is a part of scikit-learn library and is primarily used in machine learning models for handling missing values.

Published on: