Pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb with read_excel function

import pandas as pd

data = pd.read_excel(open("2016-07-12_Zuwendungsbericht_2015_OpenData.xlsx"), sheetname="Zuwendungsbericht", encoding="utf-8")

I'm getting this error when I try to read this excel file: http://transparenz.bremen.de/sixcms/media.php/13/2016-07-12_Zuwendungsbericht_2015_OpenData.xlsx .

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 14: invalid start byte

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-57-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: de_DE.UTF-8 pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.11.3 scipy: None statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None None

Comment From: jorisvandenbossche

This works fine for me, even without specifying the encoding:

In [11]: pd.read_excel('http://transparenz.bremen.de/sixcms/media.php/13/2016-07-12_Zuwendungsbericht_2015_OpenData.xlsx', sheetname="Zuwendungsbericht").head()
Out[11]: 
              Ressort                                Zuwendungsempfänger  \
0                 NaN                                                NaN   
1  03 - Senatskanzlei  acompa. – Begleitungsgruppe für Flüchtlinge un...   
2  03 - Senatskanzlei  Aktion Kultur und Freizeit Huchting und Grolla...   
3  03 - Senatskanzlei  Aktion Kultur und Freizeit Huchting und Grolla...   
4  03 - Senatskanzlei  Aktive Menschen Bremen eingetragener Verein (A...   

  Haushalts-stelle(n)                                    Zuwendungszweck  \
0                 NaN                                                NaN   
1        3020.68400-2                      Begleitung von Behördengängen   
2        3020.68400-2      Sofortprogramm Flüchtlinge in den Stadtteilen   
3        3020.68400-2  Einkaufsfahrten, Begleitung, Dolmetscherdienst...   
4        3020.68400-2      Sofortprogramm Flüchtlinge in den Stadtteilen   

   Institutionelle Zuwendungen Bremens  Unnamed: 5 Unnamed: 6  \
0                               2014.0      2015.0  Veränd. %   
1                                  0.0         0.0        NaN   
2                                  0.0         0.0        NaN   
3                                  0.0         0.0        NaN   
4                                  0.0         0.0        NaN   

   Projekt-förderungen Bremens  Unnamed: 8 Unnamed: 9  \
0                       2014.0      2015.0  Veränd. %   
1                        300.0         0.0       -100   
2                          0.0       800.0        NaN   
3                        750.0         0.0       -100   
4                          0.0       500.0        NaN   

   institutionelle Förderung / Projektförderung Dritter  Unnamed: 11  \
0                                             2014.0          2015.0   
1                                                0.0             0.0   
2                                                0.0             0.0   
3                                                0.0             0.0   
4                                                0.0             0.0   

  Unnamed: 12 Finan\nzie-\nrungs\nart       Stadtteil  
0   Veränd. %                     NaN             NaN  
1           0                      FB        Neustadt  
2           0                      FB        Huchting  
3           0                      FB        Huchting  
4           0                       V  Woltmershausen

So you can leave out the open, and then it should work fine by default.

Comment From: jorisvandenbossche

@GiantCrocodile To clarify a bit: an xlsx file is a binary file, while open will try to read it as a text file and pass this on to read_excel, hence this fails to read it. If you want to use open (which is not needed in this case, as pandas automatically opens the file for you), you can do open(path, mode='rb').

Comment From: GiantCrocodile

Thanks @jorisvandenbossche for helping me! What I've done was posted on StackOverflow as a solution. Obviously it is false - now it does work with what you've said!

A quick question: if I want to add encoding specification, should I do it like I've done it in my example? If yes I'm curious why the encoding parameter isn't mentioned in docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html .

Comment From: jorisvandenbossche

I am not familiar enough with excel to know how its encoding works, but remember that it is not a 'normal' text file, so in most cases you won't need to specify it. Specifically for read_excel, the encoding parameter is not passed through to the actual reading of the excel file, but only for parsing afterwards (kwds here: https://github.com/pandas-dev/pandas/blob/9947a99d04b87e7441f062d6b5eca281ef15deb7/pandas/io/excel.py#L513). But see eg http://xlrd.readthedocs.io/en/latest/unicode.html. xlrd is the library used for reading the excel files. It seems it has a encoding_override keyword, but this is not supported by read_excel at the moment I think.

Do you have an example where you need to specify the encoding?

Comment From: GiantCrocodile

I don't have an example where I need to specify the encoding. I've just started using Pandas and I'm used to specify the encoding so I tried it here but it doesn't matter to me if it just works without specifying it.