Pandas 'column' not in index, but hell it is. Seems like a bug...

I have a dataframe called delivery and when I print(delivery.columns) I get the following:

Index(['Complemento_endereço', 'cnpj', 'Data_fundação', 'Número',
   'Razão_social', 'CEP', 'situacao_cadastral', 'situacao_especial', 'Rua',
   'Nome_Fantasia', 'last_revenue_normalized', 'last_revenue_year',
   'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
   'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
   'CNAEs_secundários', 'Pessoas', 'percent'],
  dtype='object')

Well, we can clearly see that there is a column 'Rua'.

Also, if I print(delivery.Rua) I get a proper result:

82671                         R JUDITE MELO DOS SANTOS
817797                                R DOS GUAJAJARAS
180081           AV MARCOS PENTEADO DE ULHOA RODRIGUES
149373                                 AL MARIA TEREZA
455511                               AV RANGEL PESTANA
...

Even if I write "if 'Rua' in delivery.columns: print('here I am')" it does print the 'here I am'. So 'Rua' is in fact there.

Well, in the immediate line after I have this code:

delivery=delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço','Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]

And voilá, I get this weird error:

Traceback (most recent call last):
File "/file.py", line 45, in <module>
   'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
   'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
'Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]
   'CNAEs_secundários', 'Pessoas', 'percent'],
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 1991, in __getitem__
  dtype='object')
return self._getitem_array(key)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/indexing.py", line 1214, in _convert_to_indexer
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['Rua'] not in index"

Can someone help? I tried stackoverflow but no one could help. I'm starting to think I'm crazy and 'Rua' is an illusion of my troubled mind.

ADDITIONAL INFO

I'm using this code right before the error line:

delivery=pd.DataFrame()

for i in selection.index:
    sample=groups.get_group(selection['#CNAE'].loc[i]).sample(selection['samples'].loc[i])
    delivery=pd.concat((delivery,sample)).sort_values('Capital_Social',ascending=False)


print(delivery.columns)
print(delivery.Rua)
print(delivery.set_index('cnpj').columns)

delivery=delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço',
                                 'Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]

EDIT

New weird stuff: I gave up and deleted 'Rua' from that last piece of code, wishing that it would work. For my surprise, I had the same problem but now with the column 'Número'.

delivery=delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Número','Complemento_endereço',
                                                 'Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica' ]]

KeyError: "['Número'] not in index"

EDIT 2

And then I gave up on 'Número' and took it out. Then the same problem happened with 'Complemento_endereço'. Then I deleted 'Complemento_endereço'. And it happend to 'Telefone' and so on.

** EDIT 3 **

If I do a pd.show_versions(), that's the output:

INSTALLED VERSIONS

commit: None python: 3.5.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 18.2 Cython: None numpy: 1.11.0 scipy: 0.17.1 statsmodels: 0.6.1 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.3 pymysql: 0.7.11.None psycopg2: None jinja2: None boto: None pandas_datareader: None None

Comment From: gfyoung

@abutremutante : Thanks for reporting this! It does look really weird, but we can't replicate it at this point because we can't run your code. Could you provide a complete code sample for us?

Also, if you could provide the output of pd.show_versons in your initial issue box, that would be great.

Comment From: abutremutante

Hi there! Thanks for answering. I wouldn’t like to make it public on github. Could I send it by email to you guys?

Em 17 de ago de 2017, à(s) 19:38, gfyoung notifications@github.com escreveu:

@abutremutante https://github.com/abutremutante : Thanks for reporting this! It does look really weird, but we can't replicate it at this point because we can't run your code. Could you provide a complete code sample for us?

Also, if you could provide the output of pd.show_versons in your initial issue box, that would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/17275#issuecomment-323213193, or mute the thread https://github.com/notifications/unsubscribe-auth/ARNqx54xVjjuQyYofj1-AEjp9NRL5AFnks5sZMD7gaJpZM4O63lz.

Comment From: gfyoung

Preferably not, as anyone who wants to take up this issue would need to see the code. Can you try replicating with a different table (or DataFrame) that doesn't contain sensitive information?

Comment From: abutremutante

Tried here:

import pandas as pd import FindCos.FindCos_Functions as find #that's a file where I write some functions import datetime import pdb

target=find.get_full_basics(business='select * from sqltable;',test_mode=False)

CNAEs=['23.30-3-01','26','27','49.30-2-03','37.02-9-00','46.45','47.73','46.44-3-01'] hired_cos=200

selecting items from CNAEs

selection=pd.DataFrame() for i in CNAEs: x=target.loc[target['#CNAE'].str.startswith(i) == True] selection=pd.concat((selection,x))

FILTERING

selection=selection.loc[selection['Capital_Social'] < 100000000].loc[selection['situacao_cadastral'] == 'ATIVA']\ .loc[selection['situacao_especial'].isnull() == True].loc[selection['Natureza_Juridica'] != 'EMPRESA INDIVIDUAL DE RESP.LIMITADA (DE NATUREZA EMPRESARIA)']\ .loc[selection['Natureza_Juridica'] != 'EMPRESARIO (INDIVIDUAL)']\ .loc[selection['Estado'] != 'PA'].loc[selection['Estado'] != 'AM']\ .loc[selection['Estado'] != 'RR'].loc[selection['Estado'] != 'AC'].loc[selection['Estado'] != 'RO'].loc[selection['Estado'] != 'AP']\ .loc[selection['Estado'] != 'TO']

DUPLICATION CONTROL

lista=['file.csv'] selection=find.exclude_business(selection,lista)

CHECKING PROFILE

groups=selection.groupby('#CNAE') selection['percent']=groups['#CNAE'].transform('size')/len(selection) selection=selection[['#CNAE','percent']].drop_duplicates().sort_values('percent',ascending=False) selection['samples']=round(((hired_cos1.05)selection['percent']))

delivery=pd.DataFrame() for i in selection.index: sample=groups.get_group(selection['#CNAE'].loc[i]).sample(selection['samples'].loc[i]) delivery=pd.concat((delivery,sample)).sort_values('Capital_Social',ascending=False)#.rename(columns={'Capital_Social':'Score_Tamanho'})

MAKING SURE THAT RUA REALLY EXISTS

print(delivery.columns) print(delivery.Rua) print(delivery.set_index('cnpj').columns) delivery=delivery.rename(columns={'Rua':'Rua'}) if 'Rua' in delivery.columns: print('here I am')

PROBLEM LINE

delivery=delivery.set_index('cnpj')[['cnpj','Razao_social','Nome_Fantasia','Data_fundacao','CEP','Estado','Cidade','Bairro','Rua','Numero','Complemento_endereco','Telefone','email','Capital_Social','CNAE','#CNAE','Natureza_Juridica']]

Comment From: gfyoung

@abutremutante : Thanks, but unfortunately, this code is not replicable for us. We can't run import FindCos.FindCos_Functions. Try to just create DataFrame from scratch and replicate the issue.

Comment From: gfyoung

Also, if you could provide the output of pd.show_versons in your initial issue box, that would be great.

Comment From: abutremutante

@gfyoung: I added to the initial issue box the pd.show_versions. Regarding the dataframe, it is a pretty long dataframe. I made a csv of using the 10 first lines of it, right here:

target=find.get_full_basics(business='select * from sqltable limit 10;',test_mode=False) target.to_csv('target10items.csv')

I'm attaching it here. target10items.csv.zip

Comment From: gfyoung

1) Can you replicate your issue using this smaller DataFrame ? 2) I notice you're using a very old version of pandas (we're at 0.20.3 right now). Can you try upgrading and see if that resolves your issue?

Comment From: jschendel

'Bairro' is not in your output for print(delivery.columns) but is in the list you provide after set_index. It's a little suspicious that 'Bairro' appears immediately before 'Rua' in that list. Maybe there's an issue in the error message selecting the missing column?

Comment From: jschendel

Okay, I think the issue is that 'Bairro' is actually the missing key, but pandas 0.18.1 had a bug where the error message displays the wrong item as the missing key.

Using the following code

import pandas as pd
import numpy as np

cols = pd.Index(['Complemento_endereço', 'cnpj', 'Data_fundação', 'Número',
   'Razão_social', 'CEP', 'situacao_cadastral', 'situacao_especial', 'Rua',
   'Nome_Fantasia', 'last_revenue_normalized', 'last_revenue_year',
   'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado',
   'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE',
   'CNAEs_secundários', 'Pessoas', 'percent'],
  dtype='object')
delivery = pd.DataFrame(np.random.random(size=(5, len(cols))), columns=cols)

delivery = delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço','Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]]

On pandas 0.18.1, I get the following error:

KeyError: "['Rua'] not in index"

However, on pandas 0.20.3, I get the corrected error:

KeyError: "['Bairro'] not in index"

Comment From: abutremutante

You Nailed It @jschendel

Thanks a lot @gfyoung

Thank you so much.

Comment From: gfyoung

Closing, as it seems that your issue has been resolved.

Comment From: duocang

Hi I do not see any real idea to solve the problem, @gfyoung Why do you close this? I still have this problem. NO complaint, just so tired of this error.

Comment From: TomAugspurger

@wangxuesong29 do you have a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Comment From: annkia

I've got the same problem as you. I've observed that if I change the data in .csv format in OpenOffice program then the error occurs. Instead of that I've downloaded the data from the Internet and I edited the data in simple Notepad++ editor. Then it works normally. I know that perhaps this solution doesn't help in you case, but maybe you should change the text editor or program that supports .csv files.

Comment From: jacobitosuperstar

ERROR: pandas version 0.23.4 Have the same problem, leaving the same code as above,

After running the code i get:

'Bairro' not in index

CODE: `import pandas as pd import numpy as np

cols = pd.Index(['Complemento_endereço', 'cnpj', 'Data_fundação', 'Número', 'Razão_social', 'CEP', 'situacao_cadastral', 'situacao_especial', 'Rua', 'Nome_Fantasia', 'last_revenue_normalized', 'last_revenue_year', 'Telefone', 'email', 'Capital_Social', 'Cidade', 'Estado', 'Razão_social', 'name_bairro', 'Natureza_Jurídica', 'CNAE', '#CNAE', 'CNAEs_secundários', 'Pessoas', 'percent'], dtype='object') delivery = pd.DataFrame(np.random.random(size=(5, len(cols))), columns=cols)

delivery = delivery.set_index('cnpj')[['Razão_social','Nome_Fantasia','Data_fundação','CEP','Estado','Cidade','Bairro','Rua','Número','Complemento_endereço','Telefone','email','Capital_Social', 'CNAE', '#CNAE', 'Natureza_Jurídica','Pessoas' ]] `

Comment From: MrArca9

This is for those who landed here from searching Google to see what's wrong.

If you are working out of a CSV, or XLSX make 100% sure none of your columns names have a space at the front or end of it.

When importing a CSV i noticed there was an issue getting a column. When exporting the df to a csv and opening it in excel, it's impossible to see the trailing or leading white spaces. You have to open it with notepad or notepad++

Again, this is for those who landed here from a google search. Making sure all leading and trailing whitespaces are removed from your column header names in your csv, xlsx or any other dataframe file template you may be using.

Comment From: Harbalnasser

I am also having this error. The columns are named correctlz but when I use seaborn with my csv file I get the error (my column is our of index)

import seaborn as sns
import pandas as pd
Data = pd.read_csv('test.csv',delimiter=',') 
sns.lmplot(x='predLabel', y='trueLabel', data=Data)

the error message: KeyError: "['predLabel' 'trueLabel'] not in index"

Comment From: iamreechi

I also have the same issue The columns are named correct but when I use seaborn with my csv file I get the error (my column is out of index)

import seaborn as sns import pandas as pd df = pd.read_csv('lawma1.csv', index_col =[0, 1], delimiter=', ') sns.lmplot(x='WEEK1', y='FLEET', data=df).savefig('law.png')

the error message: KeyError: "['FLEET'] not in index

Comment From: Warix3

I had this error and it was because I had a dot "." at the end of a column name, it worked after I removed it. podaci = pd.read_csv('data/fifa19a.csv', names=['id', 'ime', 'godine', 'ocjena', 'potencijal.', 'bodovi', 'stopalo', 'placa_tis_eur', 'cijena_mil_eur']) It was like this in the "potencijal" column It could be a bug

Comment From: rythmrana2

i am facing the same problem,i even used the print(X.columns) and it showed the index 'exposure_end' but when i ussed it in centroids_new=X.groupby(["clusters"]).mean()[["exposure_end","Duration"]] it is showing the error 'exposure_end' not in index.please help i am stuck here for the past two hours.

Comment From: rythmrana2

i found the solution for my problem. i was using the above said statement that centroids_new=X.groupby(["clusters"]).mean()[["exposure_end","Duration"]] i used x.mean(axis=1)above this statement and then used the statement centroids_new=X.groupby(["clusters"]).mean()[["exposure_end","Duration"]] without the mean and it worked fine. was not able to use axis in the statement before because it wasnt working with groupby so had to do it in two steps. and the main problem why it was happening was the axis wasn set to 1.

Comment From: Newton-33

I found a solution to the problem, works perfectly for me.

Check if your csv file is separated by ' , ' or ' ; ' . In my case, my data was separated by ' , ' but I was using ' ; '.

So simply added

Df= pd.read_csv('C:\Users\user\Desktop\data.csv', sep=" , ")

Comment From: Charis111

"column " and "column"

are two different things ,the first has a space in front. SO SIMPLY ADD SPACE WHERE SPACE IS

Eg: df["column "] worked for me

Comment From: Chretien

In case anybody else stumbles upon this error, and you're ABSOLUTELY CERTAIN that there is no whitespace to be found in your columns, do a double check to ensure that the column you're attempting to plot is of a numeric dtype, and not an object.

I was scratching my head with this: dete_resignation['cease_date'].astype("float")

Since the dtype was an object before, I wanted it to be a float. I ran the code and came across this exact error. I followed some of the answers on here and still - the error prevailed. But it was so simple, that I disregarded it until I caught myself and updated my code to this:

dete_resignation['cease_date'] = dete_resignation['cease_date'].astype("float")

Voila! No more error. I felt silly, but experience is experience for a reason. Hoping this helps anyone with the same issue that I had!

Comment From: Lenticular

For those getting this error with seaborn relplot (or similar), this may well be because of https://github.com/mwaskom/seaborn/issues/2622, which is fixed in v0.11.2. Upgrading fixed my error.

Comment From: faddaful

For those still having this issue. Remove any white space from your column name. Remove whitespace from the beginning of your column name or at the end. This fixes it for me. Very simple.