How to Access Sample Datasets in Pandas


Often you may want to access sample datasets in pandas to play around with and practice different functions.

Fortunately you can build sample pandas datasets by using the built-in testing feature.

The following examples show how to use this feature.

Example 1: Create Pandas Dataset with All Numeric Columns

The following code shows how to create a pandas dataset with all numeric columns:

import pandas as pd

#create sample dataset
df1 = pd.util.testing.makeDataFrame()

#view dimensions of dataset
print(df1.shape)

(30, 4)

#view first five rows of dataset
print(df1.head())

                   A         B         C         D
s8tpz0W5mF -0.751223  0.956338 -0.441847  0.695612
CXQ9YhLhk8 -0.210881 -0.231347 -0.227672 -0.616171
KAbcor6sQK  0.727880  0.128638 -0.989993  1.094069
IH3bptMpdb -1.599723  1.570162 -0.221688  2.194936
gaR9ZxBTrH  0.025171 -0.446555  0.169873 -1.583553

By default, the makeDataFrame() function creates a pandas DataFrame with 30 rows and 4 columns in which all of the columns are numeric.

Example 2: Create Pandas Dataset with Mixed Columns

The following code shows how to create a pandas dataset with all numeric columns:

import pandas as pd

#create sample dataset
df2 = pd.util.testing.makeMixedDataFrame()

#view dimensions of dataset
print(df2.shape)

(5, 4)

#view first five rows of dataset
print(df2.head())

     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

By default, the makeMixedDataFrame() function creates a pandas DataFrame with 5 rows and 4 columns in which the columns are a variety of data types.

We can use the following code to display the data type of each column:

#display data type of each column
df2.dtypes

A           float64
B           float64
C            object
D    datetime64[ns]
dtype: object

From the output we can see:

  • Column A is numeric
  • Column B is numeric
  • Column C is a string
  • Column D is a date

Example 3: Create Pandas Dataset with Missing Values

The following code shows how to create a pandas dataset with some missing values in various columns:

import pandas as pd

#create sample dataset
df3 = pd.util.testing.makeMissingDataFrame()

#view dimensions of dataset
print(df3.shape)

(30, 4)

#view first five rows of dataset
print(df3.head())

                   A         B         C         D
YgAQaNaGfG  0.444376 -2.264920  1.117377 -0.087507
JoT4KxJeHd  1.913939  1.287006 -0.331315 -0.392949
tyrA2P6wz3       NaN  2.988521  0.399583  0.095831
1qvPc9DU1t  0.028716  1.311452 -0.237756 -0.150362
3aAXYtXjIO -1.069339  0.332067  0.204074       NaN

By default, the makeMissingDataFrame() function creates a pandas DataFrame with 30 rows and 4 columns in which there are some missing values (NaN) in various columns.

This function is particularly useful because it allows you to work with a dataset that has some missing values, which is common in real-world datasets.

Additional Resources

The following tutorials explain how to perform other common tasks in pandas:

How to Create Pandas DataFrame with Random Data
How to Randomly Sample Rows in Pandas
How to Shuffle Rows in a Pandas DataFrame

Leave a Reply

Your email address will not be published.