related to #12292
Try to get values from only first column with empty cells
import pandas as pd
df = pd.DataFrame([["", 1, 100], [3, 2, 200], ["", 3, 300], ["", "", 400]])
df.to_excel("test_excel.xls", index=False, header=False)
fst_col = pd.read_excel("test_excel.xls", parse_cols=[0], header=None).values
fst_cols = pd.read_excel("test_excel.xls", parse_cols=[0,1], header=None).values
print(fst_col)
print("V.S.")
print(fst_cols)
...[Out]
[[3]]
V.S.
[[ nan 1.]
[ 3. 2.]
[ nan 3.]
[ nan nan]]
Different view of read the same column:
Whether output for first column must be the same for both cases, for generalization of reading data? How can I get full data (included empty values) from first column using parse_cols=[0]?
Expected Output
[[ nan ]
[ 3. ]
[ nan ]
[ nan ]]
pandas: 0.18.1 xlrd: 0.9.4 python: 2.7.7.final.0
Comment From: jreback
easier to read example
In [7]: pd.read_excel('test_excel.xls',header=None)
Out[7]:
0 1 2
0 NaN 1.0 100
1 3.0 2.0 200
2 NaN 3.0 300
3 NaN NaN 400
In [8]: pd.read_excel?
In [9]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0])
Out[9]:
0
0 3
In [10]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0,1])
Out[10]:
0 1
0 NaN 1.0
1 3.0 2.0
2 NaN 3.0
3 NaN NaN
[9] does look buggy.
Comment From: jreback
@chris-b1 as you seem to be the latent excel expert!
also, IIRC we had an issue to parse_cols
-> usecols
? (deprecation)
Comment From: chris-b1
xref #4988 for usecols
. This does look buggy.
Personally, I never use parse_cols
- there isn't a lot performance benefit, so I just read the whole frame and then select.
Comment From: stanleyguan
I'm taking a stab at this (from PyCon Sprint).
Comment From: jorisvandenbossche
From discussion on gitter: this is caused by the fact that the TextParser by default skips empty lines. This is also the default behaviour of read_csv
, but there this can be controlled by skip_blank_lines
:
In [10]: s = """a
...: 1
...:
...: 2
...: """
In [11]: pd.read_csv(StringIO(s))
Out[11]:
a
0 1
1 2
In [12]: pd.read_csv(StringIO(s), skip_blank_lines=False)
Out[12]:
a
0 1.0
1 NaN
2 2.0
So possible options to go forward here:
- regard the current behaviour of read_excel (the one reported here, with the difference between parsing a single or multipe columns if you have missing values) as the desired behaviour (mimicking
read_csv
), but then addskip_blank_lines
as a new keyword arg to have the option to change this - decide we want to fix this and in
read_excel
never skip blank lines. The this option could just be added toTextParser
to enableread_excel
to specify the behaviour.
@jreback @chris-b1 any preferences?
Comment From: stanleyguan
@jorisvandenbossche Thanks for moving the conversation over.
Some more illustration with TextParser:
In [4]: data = [['foo', 3], ['bar', ''], ['', 2], ['foobar', '']]
In [5]: TextParser(data, header=None).read()
Out[5]:
0 1
0 foo 3.0
1 bar NaN
2 NaN 2.0
3 foobar NaN
In [6]: TextParser([[row[0]] for row in data], header=None).read()
Out[6]:
0
0 foo
1 bar
2 foobar
Comment From: chris-b1
It seems like read_csv
might do this in a different order - skipping lines only if the entire line is blank, then selecting columns - is that reasonable here?
pd.read_csv(StringIO("""a,b
,1
2,"""), usecols=['a'])
Out[49]:
a
0 NaN
1 2.0
Comment From: chris-b1
Actually, since TextParser
now supports usecols
, you probably could rip most of the column-specific logic out of read_excel
and defer it to there, which would replicate read_csv.
from pandas.io.parsers import TextParser
TextParser([['a', 'b'], [1, ''], ['', 2]], usecols='a').read()
Out[54]:
a
0 1.0
1 NaN
Comment From: jorisvandenbossche
Yes, that looks like a good way to go.
As the behaviour of read_csv
(to only skip if the full line is missing, also when you select only some columns) certainly makes sense
Comment From: stanleyguan
Deferring the logic to TextParser
does mean that we will do unnecessary parsing for the columns that will be discarded. Is that acceptable?
Comment From: chris-b1
Yes, that's fine - because of the way excel data is structured (row oriented XML) the whole file has to be parsed anyways - it might add a little overhead, but small relative to overall reading time.
df = pd.DataFrame(np.random.randn(1000, 10), columns=list('qwertyuasd'))
df.to_excel('tmp.xlsx', index=False)
%timeit pd.read_excel('tmp.xlsx')
10 loops, best of 3: 111 ms per loop
%timeit pd.read_excel('tmp.xlsx', parse_cols='a')
10 loops, best of 3: 104 ms per loop
Comment From: gfyoung
Not sure when this was patched, but we extensively test parse_cols
(now called usecols
) now, and this is no longer a bug. 🎉
Closing.