Pandas Pandas df.read_html dropping duplicate tables in html

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
tables = pd.read_html('D:\\myhtml.html', header=0)
print (tables)

Problem description

[The html code looks like below. Note simple table scrapping looks good but on this one exactly it is not working as expected The html code

I have a html file locally with multiple tables. Sometimes the content of the table is exactly the same including headers. While reading it through pandas I have noticed that when two tables are exactly identical it will drop the second one as if it is not there. When I change one Value value in the second table it will read the second table also and display.

How can i stop pandas doing that and read every table. Or is it a feature or bug Attaching the exact HTML file in link. If you see there are 4 tables. Still I get only 3 tables values. The big 2 two tables have exact same data and it is publishing only first one

environment : Anaconda 3.6

]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

Expected Output

Ideally it should show 4 tables. Showing only 3

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.6.0 html5lib: 0.9999999 httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.4 boto: 2.45.0 pandas_datareader: None

Comment From: gfyoung

@radioactive9 : Thanks for reporting the issue! A couple of points:

1) I see that you are using an old version of pandas. Can you try upgrading (per our note in the issue) and see if that resolves the issue?

2) Might you be able to provide a smaller example that replicates this error? The HTML code contains a lot of additional tags that I don't think are necessary for surfacing the problem.

Comment From: radioactive9

INSTALLED VERSIONS

commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.3 pytest: 3.0.5 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.4 s3fs: None pandas_gbq: None pandas_datareader: None

Updated the pandas version - It is still doing the same.
Simple duplicate table it is working as expected. But for whatever reason this one is not. Believe me If i change just content of the duplicate table 1 value- it detects all 4 tables individually without a fuss. Something is messing up the parsing. After removing the and I could not replicate the problem. But that specific html file where I have given the link above it is messing up still :( . Now I am really confused and don't know how to handle as that html file is what I need to process :(

Comment From: gfyoung

@radioactive9 : No worries. It's not that I don't believe you. I just haven't been able to access a machine where I can replicate this. That's why I'm trying to figure out how we can simplify your example and zoom in on the actual issue.

Can you try removing the two tables that aren't problematic and see if the problem persists?

Comment From: radioactive9

@gfyoung : Thank you for understanding. Really appreciate you are taking efforts to read through and trying to help. As you suggested I removed two tables and just have the two duplicate tables - one after another. And it detects only one

import pandas as pd tables = pd.read_html('D:\\Script\\Python\\HTMLTOCSV_Python\\myhtmlV3.html', header=0) for i in range(len(tables)): valtable = tables[i] print(len(valtable))

Here is the output

> runfile('C:/Users/.spyder-py3/temp.py', wdir='C:/Users/.spyder-py3')
18

Suggest it read only one table and the second table is ignored

The html file after removing two tables which was never a problem

The HTML File With 2 problem Tables

Comment From: gfyoung

Sure thing. Good that we can start to strip down your example file to get something simpler to replicate. I'm going to shamelessly copy a table from online and ask if you can replicate this issue just by just pasting the following table twice between <html> and <body> tags as follows:

<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Comment From: radioactive9

Out[6]: [ Firstname Lastname Age 0 Jill Smith 50 1 Eve Jackson 94, Firstname Lastname Age 0 Jill Smith 50 1 Eve Jackson 94]

This one came out perfectly OK. So that suggest it has something to do with my html code.

Honestly I have tried to reduced 3 rows each from both duplicate tables of the problem code and - it works. Also if I even change 1 value in any row of first table it works too.

But this exact example that I have attached above as link don't work. Is it working in your system? Please help

Comment From: radioactive9

@gfyoung sorry Did you get a chance to look into the link I shared above. Sorry for the bumping the thread

Comment From: gfyoung

@radioactive9 : Sorry! My computer has been acting up lately, so I haven't been able to take a look.

cc @TomAugspurger

Comment From: gitgithan

@radioactive9

Is this issue still occurring? I have tested your original big table and i can properly read all 4 tables. Below are my versions:

INSTALLED VERSIONS

commit: None python: 3.7.0.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.4 pytest: 3.8.0 pip: 18.1 setuptools: 40.2.0 Cython: 0.28.5 numpy: 1.15.1 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.5.0 sphinx: 1.7.9 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 2.2.3 openpyxl: 2.5.6 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.1.0 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.11 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: WillAyd

Doesn't look like this was ever reproducible. @radioactive9 feel free to reopen if that's not the case

Pandas Pandas df.read_html dropping duplicate tables in html

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

INSTALLED VERSIONS

Output of `pd.show_versions()`