DataFrame.drop_duplicates() - Pandas DataFrame Basics

The drop_duplicates() method in Pandas is used to remove duplicates from a DataFrame. It removes the rows that have duplicate values in all columns or those specified columns.

Syntax

The basic syntax to use drop_duplicates() method is:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

subset - column name or sequence of column names, optional. Only consider certain columns for identifying duplicates.
keep - {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to keep.
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace - bool, default False. Modify the DataFrame in place.
ignore_index - bool, default False. If True, the resulting axis will be labeled 0, 1, …, n - 1.

Example

Consider the following example where we have a DataFrame data which contains duplicate rows.

import pandas as pd

data = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Alice", "Bob"],
    "Age": [23, 28, 26, 23, 29],
    "City": ["New York", "Los Angeles", "San Francisco", "New York", "Austin"]
})

print("Data with duplicates:")
print(data)

data = data.drop_duplicates()

print("Data after removing duplicates:")
print(data)

The output of this code will be:

Data with duplicates:
      Name  Age           City
0    Alice   23       New York
1      Bob   28    Los Angeles
2  Charlie   26  San Francisco
3    Alice   23       New York
4      Bob   29         Austin

Data after removing duplicates:
      Name  Age           City
0    Alice   23       New York
1      Bob   28    Los Angeles
2  Charlie   26  San Francisco
4      Bob   29         Austin

In this example, we first create a DataFrame data using a dictionary. The DataFrame contains duplicate rows. We then use the drop_duplicates() method to remove the duplicate rows and obtain a new DataFrame with unique rows.

Output

The output of the example code is a DataFrame with duplicate rows removed.

Explanation

The drop_duplicates() method removes duplicate rows from the DataFrame. By default, it considers all columns for identifying duplicates. The subset parameter can be used to specify certain columns for identifying duplicates. The keep parameter determines which duplicates (if any) to keep. By default, it keeps the first occurrence of a duplicate row and removes the rest.

Use

The drop_duplicates() method is used to remove duplicate rows from a DataFrame. This is especially useful when working with large datasets where duplicate rows can cause issues.

Important Points

The drop_duplicates() method is used to remove duplicate rows from a DataFrame.
The subset parameter can be used to specify certain columns for identifying duplicates.
The keep parameter determines which duplicates (if any) to keep.

Summary

The drop_duplicates() method is an important method in the Pandas library for removing duplicate rows from a DataFrame. It provides flexibility and control over which duplicates to keep and which to remove. This method is essential for cleaning up data and preparing it for further analysis or use.

DataFrame.drop_duplicates() - Pandas DataFrame Basics

Syntax

Example

Output

Explanation

Use

Important Points

Summary

Pandas

pandas Introduction

pandas Features

pandas Introduction to Pandas Series

pandas Series.map()

pandas Series.std()

pandas Series.to_frame()

pandas Series.unique()

pandas Series.value_counts()

pandas Introduction to Pandas DataFrame

pandas DataFrame.append()

pandas DataFrame.apply()

pandas DataFrame.aggregate()

pandas DataFrame.assign()

pandas DataFrame.astype()

pandas DataFrame.count()

pandas DataFrame.cut()

pandas DataFrame.describe()

pandas DataFrame.drop_duplicates()

pandas DataFrame.groupby()

pandas DataFrame.head()

pandas DataFrame.hist()

pandas DataFrame.iterrows()

pandas DataFrame.join()

pandas DataFrame.mean()

pandas DataFrame.melt()

pandas DataFrame.merge()

pandas DataFrame.pivot_table()

pandas DataFrame.query()

pandas DataFrame.rename()

pandas DataFrame.sample()

pandas DataFrame.shift()

pandas DataFrame.sort()

pandas DataFrame.sum()

pandas DataFrame.to_excel()

pandas DataFrame.transform()

pandas DataFrame.transpose()

pandas DataFrame.where()

pandas Add column to DataFrame columns

pandas DataFrame to Numpy Array

pandas DataFrame to CSV

pandas Reading and Writing with Pandas

pandas Concatenation

pandas Data Operations Overview

pandas Data Processing Techniques

pandas DataFrame.corr()

pandas DataFrame.dropna()

pandas DataFrame.fillna()

pandas DataFrame.replace()

pandas DataFrame.iloc[]

pandas DataFrame.isin()

pandas DataFrame.loc[]

pandas loc vs iloc

pandas Cheat Sheet

pandas Introduction to Pandas Indexing

pandas Multiple Index

pandas Pandas Reindex

pandas Reset Index

pandas Set Index

pandas Introduction to Pandas and NumPy

pandas Boolean indexing

pandas Concatenating data

pandas Pandas vs NumPy

pandas Introduction to Pandas Time Series

pandas Datetime

pandas Time Offset

pandas Time Periods

pandas Convert string to date

pandas Plotting

pandas Sorting Methods

pandas Drop Columns in pandas

pandas Indexing and Selecting a Pandas DataFrame