Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import time
import pandas as pd
html_str = '''
<table>
<tr>
<th>Codes</th>
</tr>
<tr>
<td>41651,65125,17328,02872,49459,79208,38630,24723,13276,29613,55978,68885,73452,ABCDE</td>
</tr>
</table>
'''
print(f'\nNote: This takes over 500 seconds to complete loading !!')
start_time = time.time()
df1 = pd.read_html(html_str)[0]
end_time = time.time()
print(f'\nData: {df1}')
print(f'\nDuration : {int(end_time-start_time)} seconds')
Installed Versions
Prior Performance
No response
Comment From: VivaSrini
I also tried with converters and flavor options, but it still takes the same duration:
df1 = pd.read_html(html_str, converters={0: str}, flavor='html5lib')[0]
Comment From: ErdiTk
After running snakeviz on your code I see the following:
Then by debugging the code bottleneck appears on the call of:
self.num.search(x.strip())
in line 893 of the file python_parser.py
. To be fair I don't really know why the call takes so long, and it's only for the search in the following line of your text:
<td>41651,65125,17328,02872,49459,79208,38630,24723,13276,29613,55978,68885,73452,ABCDE</td>
However I tried to debug it by using of the regex
module instead of re
and the search function was completed immediately. Might be a possible solution, but I think it requires an extensive update of the __init__
method of PythonParser
.
Comment From: asishm
This is the regex being used i.e. self.num = re.compile('^[\\-\\+]?([0-9]+,|[0-9])*(\\.[0-9]*)?([0-9]?(E|e)\\-?[0-9]+)?$')
running it on regex101.com (https://regex101.com/r/Xup3Or/1) gives the following message
Catastrophic backtracking has been detected and the execution of your expression has been halted. To find out more and what this is, please read the following article: Runaway Regular Expressions
Comment From: Sjlver
If I understand correctly, the intent of that regex is to look for numbers, with optional ,
separators.
This should achieve the same effect with less backtracking:
self.num = re.compile(
r'^[\-\+]?' # Optional: Initial sign
+ r'([0-9]+(?:,[0-9]+)*)' # Digits. Optionally followed by groups of digits with comma separator
+ r'(\.[0-9]*)?' # Optional: period, followed by digits
+ r'((E|e)\-?[0-9]+)?$' # Optional: exponent
)
Note that the original had an optional digit before the exponent. This would never match, as far as I can tell.