import pandas as pd
data = pd.read_excel(open("2016-07-12_Zuwendungsbericht_2015_OpenData.xlsx"), sheetname="Zuwendungsbericht", encoding="utf-8")
I'm getting this error when I try to read this excel file: http://transparenz.bremen.de/sixcms/media.php/13/2016-07-12_Zuwendungsbericht_2015_OpenData.xlsx .
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 14: invalid start byte
Comment From: jorisvandenbossche
This works fine for me, even without specifying the encoding:
In [11]: pd.read_excel('http://transparenz.bremen.de/sixcms/media.php/13/2016-07-12_Zuwendungsbericht_2015_OpenData.xlsx', sheetname="Zuwendungsbericht").head()
Out[11]:
Ressort Zuwendungsempfänger \
0 NaN NaN
1 03 - Senatskanzlei acompa. – Begleitungsgruppe für Flüchtlinge un...
2 03 - Senatskanzlei Aktion Kultur und Freizeit Huchting und Grolla...
3 03 - Senatskanzlei Aktion Kultur und Freizeit Huchting und Grolla...
4 03 - Senatskanzlei Aktive Menschen Bremen eingetragener Verein (A...
Haushalts-stelle(n) Zuwendungszweck \
0 NaN NaN
1 3020.68400-2 Begleitung von Behördengängen
2 3020.68400-2 Sofortprogramm Flüchtlinge in den Stadtteilen
3 3020.68400-2 Einkaufsfahrten, Begleitung, Dolmetscherdienst...
4 3020.68400-2 Sofortprogramm Flüchtlinge in den Stadtteilen
Institutionelle Zuwendungen Bremens Unnamed: 5 Unnamed: 6 \
0 2014.0 2015.0 Veränd. %
1 0.0 0.0 NaN
2 0.0 0.0 NaN
3 0.0 0.0 NaN
4 0.0 0.0 NaN
Projekt-förderungen Bremens Unnamed: 8 Unnamed: 9 \
0 2014.0 2015.0 Veränd. %
1 300.0 0.0 -100
2 0.0 800.0 NaN
3 750.0 0.0 -100
4 0.0 500.0 NaN
institutionelle Förderung / Projektförderung Dritter Unnamed: 11 \
0 2014.0 2015.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
Unnamed: 12 Finan\nzie-\nrungs\nart Stadtteil
0 Veränd. % NaN NaN
1 0 FB Neustadt
2 0 FB Huchting
3 0 FB Huchting
4 0 V Woltmershausen
So you can leave out the open
, and then it should work fine by default.
Comment From: jorisvandenbossche
@GiantCrocodile To clarify a bit: an xlsx file is a binary file, while open
will try to read it as a text file and pass this on to read_excel
, hence this fails to read it. If you want to use open (which is not needed in this case, as pandas automatically opens the file for you), you can do open(path, mode='rb')
.
Comment From: GiantCrocodile
Thanks @jorisvandenbossche for helping me! What I've done was posted on StackOverflow as a solution. Obviously it is false - now it does work with what you've said!
A quick question: if I want to add encoding specification, should I do it like I've done it in my example? If yes I'm curious why the encoding parameter isn't mentioned in docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html .
Comment From: jorisvandenbossche
I am not familiar enough with excel to know how its encoding works, but remember that it is not a 'normal' text file, so in most cases you won't need to specify it. Specifically for read_excel
, the encoding parameter is not passed through to the actual reading of the excel file, but only for parsing afterwards (kwds
here: https://github.com/pandas-dev/pandas/blob/9947a99d04b87e7441f062d6b5eca281ef15deb7/pandas/io/excel.py#L513).
But see eg http://xlrd.readthedocs.io/en/latest/unicode.html. xlrd is the library used for reading the excel files. It seems it has a encoding_override
keyword, but this is not supported by read_excel
at the moment I think.
Do you have an example where you need to specify the encoding?
Comment From: GiantCrocodile
I don't have an example where I need to specify the encoding. I've just started using Pandas and I'm used to specify the encoding so I tried it here but it doesn't matter to me if it just works without specifying it.