pandas
  1. pandas-dataframedrop-duplicates

DataFrame.drop_duplicates() - Pandas DataFrame Basics

The drop_duplicates() method in Pandas is used to remove duplicates from a DataFrame. It removes the rows that have duplicate values in all columns or those specified columns.

Syntax

The basic syntax to use drop_duplicates() method is:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
  • subset - column name or sequence of column names, optional. Only consider certain columns for identifying duplicates.
  • keep - {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to keep.
    • first : Drop duplicates except for the first occurrence.
    • last : Drop duplicates except for the last occurrence.
    • False : Drop all duplicates.
  • inplace - bool, default False. Modify the DataFrame in place.
  • ignore_index - bool, default False. If True, the resulting axis will be labeled 0, 1, …, n - 1.

Example

Consider the following example where we have a DataFrame data which contains duplicate rows.

import pandas as pd

data = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Alice", "Bob"],
    "Age": [23, 28, 26, 23, 29],
    "City": ["New York", "Los Angeles", "San Francisco", "New York", "Austin"]
})

print("Data with duplicates:")
print(data)

data = data.drop_duplicates()

print("Data after removing duplicates:")
print(data)

The output of this code will be:

Data with duplicates:
      Name  Age           City
0    Alice   23       New York
1      Bob   28    Los Angeles
2  Charlie   26  San Francisco
3    Alice   23       New York
4      Bob   29         Austin

Data after removing duplicates:
      Name  Age           City
0    Alice   23       New York
1      Bob   28    Los Angeles
2  Charlie   26  San Francisco
4      Bob   29         Austin

In this example, we first create a DataFrame data using a dictionary. The DataFrame contains duplicate rows. We then use the drop_duplicates() method to remove the duplicate rows and obtain a new DataFrame with unique rows.

Output

The output of the example code is a DataFrame with duplicate rows removed.

Explanation

The drop_duplicates() method removes duplicate rows from the DataFrame. By default, it considers all columns for identifying duplicates. The subset parameter can be used to specify certain columns for identifying duplicates. The keep parameter determines which duplicates (if any) to keep. By default, it keeps the first occurrence of a duplicate row and removes the rest.

Use

The drop_duplicates() method is used to remove duplicate rows from a DataFrame. This is especially useful when working with large datasets where duplicate rows can cause issues.

Important Points

  • The drop_duplicates() method is used to remove duplicate rows from a DataFrame.
  • The subset parameter can be used to specify certain columns for identifying duplicates.
  • The keep parameter determines which duplicates (if any) to keep.

Summary

The drop_duplicates() method is an important method in the Pandas library for removing duplicate rows from a DataFrame. It provides flexibility and control over which duplicates to keep and which to remove. This method is essential for cleaning up data and preparing it for further analysis or use.

Published on: