How to Read HTML Tables with Pandas (Including Example)


You can use the pandas read_html() function to read HTML tables into a pandas DataFrame.

This function uses the following basic syntax:

df = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')

The following example shows how to use this function to read in a table of NBA team names from this Wikipedia page.

Example: Read HTML Table with Pandas

Before using the read_html() function, you’ll likely have to install lxml:

pip install lxml

Note: If you’re using a Jupyter notebook, you need to restart the kernel after performing this installation.

Next, we can use the read_html() function to read every HTML table on this Wikipedia page:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

#read all HTML tables from specific URL
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')

#display total number of tables read
len(tabs)

44

We can see that a total of 44 HTML tables were found on this page.

I know that the table I’m interested in has the word “Division” in it, so I can use the match argument to only retrieve HTML tables that contain this word:

#read HTML tables from specific URL with the word "Division" in them
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association',
                    match='Division')

#display total number of tables read
len(tabs)

1

I can then list the names of the columns of the table:

#define table
df = tabs[0]

#list all column names of table
list(df)

[('Division', 'Eastern Conference'),
 ('Team', 'Eastern Conference'),
 ('Location', 'Eastern Conference'),
 ('Arena', 'Eastern Conference'),
 ('Capacity', 'Eastern Conference'),
 ('Coordinates', 'Eastern Conference'),
 ('Founded', 'Eastern Conference'),
 ('Joined', 'Eastern Conference'),
 ('Unnamed: 8_level_0', 'Eastern Conference')]

I’m only interested in the first two columns, so I can filter the DataFrame to only contain these columns:

#filter DataFrame to only contain first two columns
df_final = df.iloc[:, 0:2]

#rename columns
df_final.columns = ['Division', 'Team']

#view first few rows of final DataFrame
print(df_final.head())

   Division                Team
0  Atlantic      Boston Celtics
1  Atlantic       Brooklyn Nets
2  Atlantic     New York Knicks
3  Atlantic  Philadelphia 76ers
4  Atlantic     Toronto Raptors

The final table contains only the ‘Division’ and ‘Team’ columns.

Additional Resources

The following tutorials explain how to read other types of files in pandas:

How to Read a Text File with Pandas
How to Read Excel Files with Pandas
How to Read CSV Files with Pandas

Leave a Reply

Your email address will not be published.