Code Sample, a copy-pastable example if possible
import pandas as pd
import html5lib
import requests
from lxml import html
from lxml import etree
url = "https://en.wikipedia.org/wiki/Comparison_of_single-board_computers"
xpath = "//*[@id=\"mw-content-text\"]/table"
res = requests.get(url)
tree = html.fromstring(res.content)
tables = tree.xpath(xpath)
# for index, t in enumerate(tables):
# print("Table index [%s]" % index)
# print(t.xpath('descendant-or-self::th/text()'))
# print
headers = ['Name', 'PCIe',
'USB 2.0', 'USB 3.0', 'USB devices',
'Storage on-board', 'Storage flash slots', 'Storage SATA',
'Networking ETH', 'Networking WiFi',
'Communication bluetooth', 'Communication I2C', 'Communication SPI',
'Generic I/O GPIO', 'Generic I/O Analog',
'Other interfaces']
# at the beginning and at the end there are 2 unnecessary rows,
# because the original header was to complex for the parser, so it's removed
dta = pd.read_html(html.tostring(tables[2]), skiprows=2)[0][:-2]
dta.columns= headers
dta.sort_values(by=['Name'])[61:]
Problem description
Roughly, starting from id=69 everything seems to be mixed. Either it's because of the parsing library or the specific of the table from Wiki. Values looks like they are shifted to the left.
Expected output
e.g. "Orange Pi Lite" should have USB3.0 == "No", instead there is "b/g/n (RTL8189FTV)", and that value should be shifted 5 columns to the right
Output of pd.show_versions()
html5lib installed using pip
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-28-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.23.5
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.1
bs4: 4.5.1
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1
Comment From: jreback
this is not a pandas issue. you might have better luck on StackOverflow.