Hi, guys! First of all, thanks for making this open source! =D Keep up the good work!
SCRIPT
import pandas as pd
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States')
print(dfs[0])
The output we found where there is any colspan is considering only the first column, and the other values are pulled back in relation to the colspan amount, making the values on the tail's row as "NaN"
#### Expected Output
In this web page, there is a colspan in the "capital" and "largest city" columns when they are the same, I think that in a general way we would expect that the value would be duplicate on the following
output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 15 Stepping 13, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 27.2.0 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.18.0 statsmodels: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.5.1 html5lib: 0.999 httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: 0.2.1
Comment From: brianhuey
I believe @adailsonfilho means that there are colspan attributes on columns labeled "Cities" and "Area in mi2 (km2)[B][16]" in the first header row. The solution would be that for any cell with a colspan value of n, the cell text is repeated across the next n-1 preceding rows. This issue applies to any td or th tag regardless of whether it is located in thead or tbody (see the cell with text "Phoenix" in the third row of the table referenced above.)
Comment From: chris-b1
closing in favor of #17054