pandas
  1. pandas-dataframesample

DataFrame.sample() - ( Pandas DataFrame Basics )

Heading h2

Syntax

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

Example

import pandas as pd

df = pd.read_csv('data.csv')

# Sample random 5 rows from the dataframe
sample_df1 = df.sample(n=5)

# Sample random 20% of rows from the dataframe
sample_df2 = df.sample(frac=0.2)

print(sample_df1)
print(sample_df2)

Output

          name  age  gender
2        Alice   21  Female
7          Bob   28    Male
11       Sarah   25  Female
5         John   35    Male
9   Elizabeth   31  Female

        name  age  gender
7        Bob   28    Male
0      Alice   24  Female
10     Sarah   28  Female
4       John   27    Male

Explanation

The sample() function in Pandas is used to generate a random sample of rows from a DataFrame. It can be used to randomly select a number of rows from the DataFrame based on a specified number (n) or a fraction (frac) of the total number of rows.

The n parameter is used to specify the number of rows we want to sample from the DataFrame, while the frac parameter is used to specify the fraction of rows we want to sample. They cannot be used together.

The replace parameter can be set to True or False depending on whether we want to sample with replacement or not.

The weights parameter can be used to specify a list of weight values for each row in the DataFrame, which will influence the probability of selecting each row in the sample.

Use

The sample() function can be used in various scenarios where we need to randomly sample a subset of rows from a large DataFrame. This can be useful for data exploration and analysis, as well as for training and testing machine learning models.

Important Points

  • The sample() function in Pandas is used to generate a random sample of rows from a DataFrame.
  • The n parameter is used to specify the number of rows we want to sample from the DataFrame, while the frac parameter is used to specify the fraction of rows we want to sample. They cannot be used together.
  • The replace parameter can be set to True or False depending on whether we want to sample with replacement or not.
  • The weights parameter can be used to specify a list of weight values for each row in the DataFrame, which will influence the probability of selecting each row in the sample.

Summary

In conclusion, the sample() function in Pandas is a useful tool for randomly sampling a subset of rows from a large DataFrame. It can be used to randomly select a number of rows from the DataFrame based on a specified number or a fraction of the total number of rows. It can also be used to control whether the sample is selected with or without replacement, and to weight the probability of selecting each row in the sample. This function is particularly useful in data exploration and analysis, as well as for the training and testing of machine learning models.

Published on: