DataFrame.sample() - ( Pandas DataFrame Basics )
Heading h2
Syntax
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Example
import pandas as pd
df = pd.read_csv('data.csv')
# Sample random 5 rows from the dataframe
sample_df1 = df.sample(n=5)
# Sample random 20% of rows from the dataframe
sample_df2 = df.sample(frac=0.2)
print(sample_df1)
print(sample_df2)
Output
name age gender
2 Alice 21 Female
7 Bob 28 Male
11 Sarah 25 Female
5 John 35 Male
9 Elizabeth 31 Female
name age gender
7 Bob 28 Male
0 Alice 24 Female
10 Sarah 28 Female
4 John 27 Male
Explanation
The sample()
function in Pandas is used to generate a random sample of rows from a DataFrame. It can be used to randomly select a number of rows from the DataFrame based on a specified number (n
) or a fraction (frac
) of the total number of rows.
The n
parameter is used to specify the number of rows we want to sample from the DataFrame, while the frac
parameter is used to specify the fraction of rows we want to sample. They cannot be used together.
The replace
parameter can be set to True
or False
depending on whether we want to sample with replacement or not.
The weights
parameter can be used to specify a list of weight values for each row in the DataFrame, which will influence the probability of selecting each row in the sample.
Use
The sample()
function can be used in various scenarios where we need to randomly sample a subset of rows from a large DataFrame. This can be useful for data exploration and analysis, as well as for training and testing machine learning models.
Important Points
- The
sample()
function in Pandas is used to generate a random sample of rows from a DataFrame. - The
n
parameter is used to specify the number of rows we want to sample from the DataFrame, while thefrac
parameter is used to specify the fraction of rows we want to sample. They cannot be used together. - The
replace
parameter can be set toTrue
orFalse
depending on whether we want to sample with replacement or not. - The
weights
parameter can be used to specify a list of weight values for each row in the DataFrame, which will influence the probability of selecting each row in the sample.
Summary
In conclusion, the sample()
function in Pandas is a useful tool for randomly sampling a subset of rows from a large DataFrame. It can be used to randomly select a number of rows from the DataFrame based on a specified number or a fraction of the total number of rows. It can also be used to control whether the sample is selected with or without replacement, and to weight the probability of selecting each row in the sample. This function is particularly useful in data exploration and analysis, as well as for the training and testing of machine learning models.