Pandas read_excel gives different data for one and more than one elements in parse_cols setting

related to #12292

Try to get values from only first column with empty cells

import pandas as pd
df = pd.DataFrame([["", 1, 100], [3, 2, 200], ["", 3, 300], ["", "", 400]])
df.to_excel("test_excel.xls", index=False, header=False)
fst_col = pd.read_excel("test_excel.xls", parse_cols=[0], header=None).values
fst_cols = pd.read_excel("test_excel.xls", parse_cols=[0,1], header=None).values
print(fst_col)
print("V.S.")
print(fst_cols)

...[Out]
 [[3]]
 V.S.
 [[ nan   1.]
  [  3.   2.]
  [ nan   3.]
  [ nan   nan]]

Different view of read the same column:

Whether output for first column must be the same for both cases, for generalization of reading data? How can I get full data (included empty values) from first column using parse_cols=[0]?

Expected Output

[[ nan ]
[  3. ]
[ nan ]
[ nan ]]

pandas: 0.18.1 xlrd: 0.9.4 python: 2.7.7.final.0

Comment From: jreback

easier to read example

In [7]: pd.read_excel('test_excel.xls',header=None)
Out[7]: 
     0    1    2
0  NaN  1.0  100
1  3.0  2.0  200
2  NaN  3.0  300
3  NaN  NaN  400

In [8]: pd.read_excel?

In [9]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0])
Out[9]: 
   0
0  3

In [10]: pd.read_excel('test_excel.xls',header=None,parse_cols=[0,1])
Out[10]: 
     0    1
0  NaN  1.0
1  3.0  2.0
2  NaN  3.0
3  NaN  NaN

[9] does look buggy.

Comment From: jreback

@chris-b1 as you seem to be the latent excel expert!

also, IIRC we had an issue to parse_cols -> usecols? (deprecation)

Comment From: chris-b1

xref #4988 for usecols. This does look buggy.

Personally, I never use parse_cols - there isn't a lot performance benefit, so I just read the whole frame and then select.

Comment From: stanleyguan

I'm taking a stab at this (from PyCon Sprint).

Comment From: jorisvandenbossche

From discussion on gitter: this is caused by the fact that the TextParser by default skips empty lines. This is also the default behaviour of read_csv, but there this can be controlled by skip_blank_lines:

In [10]: s = """a
    ...: 1
    ...: 
    ...: 2
    ...: """

In [11]: pd.read_csv(StringIO(s))
Out[11]: 
   a
0  1
1  2

In [12]: pd.read_csv(StringIO(s), skip_blank_lines=False)
Out[12]: 
     a
0  1.0
1  NaN
2  2.0

So possible options to go forward here:

regard the current behaviour of read_excel (the one reported here, with the difference between parsing a single or multipe columns if you have missing values) as the desired behaviour (mimicking read_csv), but then add skip_blank_lines as a new keyword arg to have the option to change this
decide we want to fix this and in read_excel never skip blank lines. The this option could just be added to TextParser to enable read_excel to specify the behaviour.

@jreback @chris-b1 any preferences?

Comment From: stanleyguan

@jorisvandenbossche Thanks for moving the conversation over.

Some more illustration with TextParser:

In [4]: data = [['foo', 3], ['bar', ''], ['', 2], ['foobar', '']]

In [5]: TextParser(data, header=None).read()
Out[5]: 
        0    1
0     foo  3.0
1     bar  NaN
2     NaN  2.0
3  foobar  NaN

In [6]: TextParser([[row[0]] for row in data], header=None).read()
Out[6]: 
        0
0     foo
1     bar
2  foobar

Comment From: chris-b1

It seems like read_csv might do this in a different order - skipping lines only if the entire line is blank, then selecting columns - is that reasonable here?

pd.read_csv(StringIO("""a,b
,1
2,"""), usecols=['a'])
Out[49]: 
     a
0  NaN
1  2.0

Comment From: chris-b1

Actually, since TextParser now supports usecols, you probably could rip most of the column-specific logic out of read_excel and defer it to there, which would replicate read_csv.

from pandas.io.parsers import TextParser

TextParser([['a', 'b'], [1, ''], ['', 2]], usecols='a').read()
Out[54]: 
     a
0  1.0
1  NaN

Comment From: jorisvandenbossche

Yes, that looks like a good way to go. As the behaviour of read_csv (to only skip if the full line is missing, also when you select only some columns) certainly makes sense

Comment From: stanleyguan

Deferring the logic to TextParser does mean that we will do unnecessary parsing for the columns that will be discarded. Is that acceptable?

Comment From: chris-b1

Yes, that's fine - because of the way excel data is structured (row oriented XML) the whole file has to be parsed anyways - it might add a little overhead, but small relative to overall reading time.

df = pd.DataFrame(np.random.randn(1000, 10), columns=list('qwertyuasd'))
df.to_excel('tmp.xlsx', index=False)

%timeit pd.read_excel('tmp.xlsx')
10 loops, best of 3: 111 ms per loop

%timeit pd.read_excel('tmp.xlsx', parse_cols='a')
10 loops, best of 3: 104 ms per loop

Comment From: gfyoung

Not sure when this was patched, but we extensively test parse_cols (now called usecols) now, and this is no longer a bug. 🎉

Closing.