Pandas PERF: pandas read html code is taking too long

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import time
import pandas as pd

html_str = '''
<table>
  <tr>
    <th>Codes</th>
  </tr>
  <tr>
    <td>41651,65125,17328,02872,49459,79208,38630,24723,13276,29613,55978,68885,73452,ABCDE</td>
  </tr>
</table>
'''

print(f'\nNote: This takes over 500 seconds to complete loading !!')

start_time = time.time()
df1 = pd.read_html(html_str)[0]
end_time = time.time()

print(f'\nData: {df1}')
print(f'\nDuration : {int(end_time-start_time)} seconds')

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.10.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22624 machine : AMD64 processor : AMD64 Family 23 Model 104 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : English_India.1252 pandas : 2.0.0 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.7 lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

No response

Comment From: VivaSrini

I also tried with converters and flavor options, but it still takes the same duration:

    df1 = pd.read_html(html_str, converters={0: str}, flavor='html5lib')[0]

Comment From: ErdiTk

After running snakeviz on your code I see the following: Screenshot from 2023-04-13 22-19-07

Then by debugging the code bottleneck appears on the call of:

self.num.search(x.strip())

in line 893 of the file python_parser.py. To be fair I don't really know why the call takes so long, and it's only for the search in the following line of your text:

   <td>41651,65125,17328,02872,49459,79208,38630,24723,13276,29613,55978,68885,73452,ABCDE</td>

However I tried to debug it by using of the regex module instead of re and the search function was completed immediately. Might be a possible solution, but I think it requires an extensive update of the __init__ method of PythonParser.

Comment From: asishm

This is the regex being used i.e. self.num = re.compile('^[\\-\\+]?([0-9]+,|[0-9])*(\\.[0-9]*)?([0-9]?(E|e)\\-?[0-9]+)?$')

running it on regex101.com (https://regex101.com/r/Xup3Or/1) gives the following message

Catastrophic backtracking has been detected and the execution of your expression has been halted. To find out more and what this is, please read the following article: Runaway Regular Expressions

Comment From: Sjlver

If I understand correctly, the intent of that regex is to look for numbers, with optional , separators.

This should achieve the same effect with less backtracking:

self.num = re.compile(
  r'^[\-\+]?'                # Optional: Initial sign
  + r'([0-9]+(?:,[0-9]+)*)'  # Digits. Optionally followed by groups of digits with comma separator
  + r'(\.[0-9]*)?'           # Optional: period, followed by digits
  + r'((E|e)\-?[0-9]+)?$'    # Optional: exponent
)

Note that the original had an optional digit before the exponent. This would never match, as far as I can tell.