Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
from io import StringIO
import pandas.testing as pdt
from time import time

# Creating HTML-Table Strings, 10 Tables, with 15 columns and 300 rows each
dataframes = [pd.DataFrame(np.random.randint(0,100, (300, 15)), columns=[f"Col-{i}" for i in range(15)])]*10
html_strings = [df.to_html() for df in dataframes]

# Testing the runtime when all HTML Tables are combined into a single string
combined_html = "\n".join(html_strings)
start = time()
combined = pd.read_html(StringIO(combined_html))
combined_time = time()-start
print(f"Took {combined_time} Seconds to load the combined data")
# Took 23.394442081451416 Seconds to load the combined data

# Testing the runtime when each HTML Table is read in separately
start = time()
chunked = [pd.read_html(StringIO(string))[0] for string in html_strings]
chunked_time = time()-start
print(f"Took {chunked_time} Seconds to load the data in chunks")
# Took 0.5712969303131104 Seconds to load the data in chunks

print(f"Chunked was {round(combined_time/chunked_time, 2)} faster than the combined data")
# Chunked was 40.95 faster than the combined data

# Checking if the resulting tables are the same
combined_df = pd.concat(combined).reset_index(drop=True)
chunked_df = pd.concat(chunked).reset_index(drop=True)
pdt.assert_frame_equal(combined_df, chunked_df) # No Assertion => data is the same

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.10.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3 Cython : None pytest : 7.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : 2.9.5 jinja2 : 3.1.2 IPython : 8.6.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : 0.8.3 fsspec : 2022.11.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.4.44 tables : None tabulate : 0.9.0 xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Prior Performance

No response

Comment From: asishm

The issue seems to either be with lxml or the xpath expression pandas is using (not an xpath expert - but leaning towards the former)

https://github.com/pandas-dev/pandas/blob/094b2c0a183a96411b7ee2bc26a98e6d82c8cbb9/pandas/io/html.py#L748

that xpath expression takes ~ 20 seconds on the combined html (seems to be exponential in nature) compared to sub second timings for the individual htmls