DataFrame.drop_duplicates() - Pandas DataFrame Basics
The drop_duplicates()
method in Pandas is used to remove duplicates from a DataFrame. It removes the rows that have duplicate values in all columns or those specified columns.
Syntax
The basic syntax to use drop_duplicates()
method is:
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
subset
- column name or sequence of column names, optional. Only consider certain columns for identifying duplicates.keep
- {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to keep.first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.False
: Drop all duplicates.
inplace
- bool, default False. Modify the DataFrame in place.ignore_index
- bool, default False. If True, the resulting axis will be labeled 0, 1, …, n - 1.
Example
Consider the following example where we have a DataFrame data
which contains duplicate rows.
import pandas as pd
data = pd.DataFrame({
"Name": ["Alice", "Bob", "Charlie", "Alice", "Bob"],
"Age": [23, 28, 26, 23, 29],
"City": ["New York", "Los Angeles", "San Francisco", "New York", "Austin"]
})
print("Data with duplicates:")
print(data)
data = data.drop_duplicates()
print("Data after removing duplicates:")
print(data)
The output of this code will be:
Data with duplicates:
Name Age City
0 Alice 23 New York
1 Bob 28 Los Angeles
2 Charlie 26 San Francisco
3 Alice 23 New York
4 Bob 29 Austin
Data after removing duplicates:
Name Age City
0 Alice 23 New York
1 Bob 28 Los Angeles
2 Charlie 26 San Francisco
4 Bob 29 Austin
In this example, we first create a DataFrame data
using a dictionary. The DataFrame contains duplicate rows. We then use the drop_duplicates()
method to remove the duplicate rows and obtain a new DataFrame with unique rows.
Output
The output of the example code is a DataFrame with duplicate rows removed.
Explanation
The drop_duplicates()
method removes duplicate rows from the DataFrame. By default, it considers all columns for identifying duplicates. The subset
parameter can be used to specify certain columns for identifying duplicates. The keep
parameter determines which duplicates (if any) to keep. By default, it keeps the first occurrence of a duplicate row and removes the rest.
Use
The drop_duplicates()
method is used to remove duplicate rows from a DataFrame. This is especially useful when working with large datasets where duplicate rows can cause issues.
Important Points
- The
drop_duplicates()
method is used to remove duplicate rows from a DataFrame. - The
subset
parameter can be used to specify certain columns for identifying duplicates. - The
keep
parameter determines which duplicates (if any) to keep.
Summary
The drop_duplicates()
method is an important method in the Pandas library for removing duplicate rows from a DataFrame. It provides flexibility and control over which duplicates to keep and which to remove. This method is essential for cleaning up data and preparing it for further analysis or use.