One of the first steps of any data analysis project is **exploratory data analysis**.

This involves exploring a dataset in three ways:

**1. Summarizing** a dataset using descriptive statistics.

**2. Visualizing** a dataset using charts.

**3. Identifying** missing values.

By performing these three actions, you can gain an understanding of how the values in a dataset are distributed and detect any problematic values before proceeding to perform a hypothesis test or perform statistical modeling.

The following step-by-step example shows how to perform exploratory data analysis for a dataset in Python.

**Step 1: Create the Data**

First, let’s create the following pandas DataFrame:

**import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'points': [18, 22, 19, 14, 14, 11, 20, 28],
'assists': [5, 7, 7, 9, 12, 9, 9, 4],
'rebounds': [11, 8, 10, 6, 6, np.nan, 9, 12]})
**

We can take a look at the first five rows of the DataFrame by using the **head()** function:

**#view first five rows of dataset
df.head()
team points assists rebounds
0 A 18 5 11.0
1 A 22 7 8.0
2 A 19 7 10.0
3 A 14 9 6.0
4 B 14 12 6.0
**

**Step 2: Summarize the Data**

We can use the **describe()** function to quickly summarize each numerical variable in the dataset:

**#summarize numerical variables
df.describe()
points assists rebounds
count 8.0000000 8.00000 7.000000
mean 18.250000 7.75000 8.857143
std 5.3652320 2.54951 2.340126
min 11.000000 4.00000 6.000000
25% 14.000000 6.50000 7.000000
50% 18.500000 8.00000 9.000000
75% 20.500000 9.00000 10.50000
max 28.000000 12.0000 12.00000**

For each of the numeric variables we can see the following information:

**count**: Total number of non-missing values**std**: The mean value**min**: The minimum value**25%**: The value of the first quartile (25th percentile)**50%**: The median value (50th percentile)**75%**: The value of the third quartile (75th percentile)**max**: The maximum value

For the categorical variables in the dataset, we can use **value_counts** to get a frequency count of each value:

**#display frequency counts for team variable
df['team'].value_counts()
A 4
B 4
Name: team, dtype: int64**

From the output we can see:

**A**: This value occurs 4 times.**B**: This value occurs 4 times.

We can use the **shape** function to get the dimensions of the DataFrame in terms of number of rows and number of columns:

**#display rows and columns
df.shape
(8, 4)
**

We can see that the DataFrame has **8 **rows and **4 **columns.

**Step 3: Visualize the Data**

We can also create charts to visualize the values in the dataset.

For example, we can use the pandas **hist()** function to create a histogram of the values for each numerical variable:

**#create histogram for each numerical variable
df.hist(grid=False, edgecolor='black')
**

The x-axis of each histogram shows the values for each variable and the y-axis shows the frequency of each value.

We can also use the pandas **boxplot****()** function to create a boxplot for each numerical variable:

**#create boxplot for each numerical variable
df.boxplot(grid=False)
**

We can also use the **geom_boxplot()** function to create a boxplot of one variable grouped by another variable:

We can also use the pandas **corr()** function to create a correlation matrix to view the correlation coefficient between each pairwise combination of numeric variables in the DataFrame:

**#create correlation matrix
df.corr()
points assists rebounds
points 1.000000 -0.725841 0.767007
assists -0.725841 1.000000 -0.882046
rebounds 0.767007 -0.882046 1.000000
**

**Related:** What is Considered to Be a “Strong” Correlation?

**Step 4: Identify Missing Values**

We can use the following code to count the total number of missing values in each column of the DataFrame:

**#count total missing values in each column
df.isnull().sum()
team 0
points 0
assists 0
rebounds 1
dtype: int64
**

From the output we can see that there is only one missing value in the **rebounds** column.

All other columns have no missing values.

We have now completed a basic exploratory data analysis on this dataset and have a good understanding of how the values are distributed for each variable in this dataset.

**Related:** How to Impute Missing Values in Pandas

**Additional Resources**

The following tutorials explain how to perform other common tasks in Python:

How to Create Frequency Tables in Python

How to Create Boxplot from Pandas DataFrame

How to Create a Histogram from Pandas DataFrame