Pandas [MemoryError] read_html fails on table with 30k rows

Hi,

# Your code here
import pandas as pd
df = pd.read_html('/root/index.html')

Problem description

pandas in method read_html() can not read 30k or long row of a table tag,But use CPU and Memory of machine but any result of this method,

Output of `pd.show_versions()`

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.12
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

Best regards, Ramin

Comment From: gfyoung

@raminfp : Thanks for reporting this! Unfortunately, it is hard for us to understand what went wrong at this point. Could you please:

1) Provide code that we can use to replicate the issue? (that includes the file used) 2) The error that you get when you try to read the file?

Comment From: raminfp

OK, please give me your email address i send my test case html, code :

def html_2_excel(self):

        try:
            print "Starting read html ...."
            df = pd.read_html('/root/index.html') ===> here can not read html file
            print "Starting convert html to excel ...."
            excel_xlsx_name = self.defualtpath + "\\" +self.pPath.replace(".html",".xlsx")
            writer = pd.ExcelWriter(excel_xlsx_name)
            df[1].to_excel(writer,'Sheet1')
            print "Finishd"
            writer.save()
            return writer.path

        except Exception as e:
            raise

Comment From: gfyoung

@raminfp : It would be preferable if you could either post the HTML file on GitHub or host it on a document-sharing platform for us to view. Also, if you could post the error that you are seeing when you run this code (remove the try-except portion), that would be great as well.

Comment From: raminfp

i haven't any error message, when i use 30 row for read html work it, but when i have 30k row of table any result of read_html (error or anything), i thinks happen in infinite loop , i not sure,

Comment From: gfyoung

@raminfp : Given what you said, I would prefer that you not share the HTML file with me or anyone else. Could you try to create an HTML table of your own (without sensitive data) that you could use to replicate this error? Also, if it's an infinite loop, then at some point, you should get an error message from Python.

Comment From: raminfp

@gfyoung Any update?

Comment From: gfyoung

@raminfp : I don't see a table in the HTML code that you attached. The ones generated via JavaScript don't count because they would get rendered only when you open it in a browser.

Comment From: raminfp

@gfyoung : Oops, sorry, i send again valid html file,i was mistake with same files name in my desktop, sorry again,

Comment From: raminfp

@gfyoung : again i test with test_case with 30k row in table i got error :


>>> import pandas as pd
>>> df = pd.read_html('C:\\Users\\samin\\Desktop\\index.html')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\pandas\io\html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "C:\Python27\lib\site-packages\pandas\io\html.py", line 748, in _parse
    ret.append(_data_to_frame(data=table, **kwargs))
  File "C:\Python27\lib\site-packages\pandas\io\html.py", line 640, in _data_to_frame
    _expand_elements(body)
  File "C:\Python27\lib\site-packages\pandas\io\html.py", line 623, in _expand_elements
    body[ind] += empty * (lens_max - length)
MemoryError

test_case.zip

Comment From: gfyoung

Hmm...interesting. Not entirely sure at the moment why you would be running out of memory so quickly, given that the table isn't that large. Not sure if it's because you have a table inside another table when you're reading this, though I'm not sure why that should break anything.

I presume you have sufficient RAM to load all this data into memory, so I don't think that would be an issue. Have you tried reading this file on another computer (I'm on my phone at the moment)?

Comment From: raminfp

@gfyoung : Yes, i test to my server with RAM 8 G and killed python interpreter


Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd

>>>
>>> df = pd.read_html("/root/index.html")
Killed
root@ubuntu:~# python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

Comment From: gfyoung

Okay, good to know. Hard to tell at this point where the error is occurring, and why so much memory is being eaten up when reading a relatively small table. This will take some time to investigate.

Comment From: jreback

closing as a user isssue.

Comment From: raminfp

@gfyoung please explain what happen test case for you ?

Comment From: gfyoung

@raminfp : Sorry for not getting back to you about this!

As far as I can tell, I seem to be able to read this table without issues.

Comment From: raminfp

@gfyoung : Ohh, Interesting, please give me, What OS? Python version? What Code?

Comment From: gfyoung

@raminfp : Windows 10, Python 3.6.2, using the same code you provided. However, I do get the MemoryError on another machine with similar specifications.

Comment From: raminfp

However, I do get the MemoryError on another machine with similar specifications.

@gfyoung : Why close issue?

Comment From: gfyoung

Judging from @jreback response, although I think this issue should be investigated, it's hard for us to determine whether this is an issue with the code or rather with the machine on which this code was executed given that I was able to read it successfully in one instance.

From a practical standpoint, can you try dividing up the table so that you can it in chunks? You can also try reading it in with open directly, and see if you can parse the table (albeit somewhat manually) from a stream of data.

Comment From: raminfp

I am waiting for your answer CC : @jreback

From a practical standpoint, can you try dividing up the table so that you can it in chunks? You can also try reading it in with open directly, and see if you can parse the table (albeit somewhat manually) from a stream of data.

good point @gfyoung Parse HTML

Thanks again,

Comment From: raminfp

@gfyoung : any answer! please add label Won't fix

Thanks

Comment From: gfyoung

@raminfp : I don't think there's any need to.

Pandas [MemoryError] read_html fails on table with 30k rows

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`