Pandas false warning when using to_sql with a dataframe using MultiIndex

Hi, I get a warning that confused me, and in my opinion is wrong. See code - that should produce warning with further comments.

INSTALLED VERSIONS

commit: None python: 2.7.9.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.16.1 nose: 1.3.4 Cython: 0.22 numpy: 1.9.2 scipy: 0.15.1 statsmodels: 0.6.1 IPython: 3.0.0 sphinx: None patsy: 0.3.0 dateutil: 2.4.0 pytz: 2015.2 bottleneck: None tables: 3.1.1 numexpr: 2.4 matplotlib: 1.4.3 openpyxl: None xlrd: 0.9.3 xlwt: None xlsxwriter: 0.7.1 lxml: None bs4: 4.3.2 html5lib: None httplib2: None apiclient: None sqlalchemy: 0.9.9 pymysql: None psycopg2: None

import pandas as pd
import sqlite3 as sql

con = sql.connect("Temp_db.db")

df = pd.DataFrame(columns=[["FIRST", "FIRST", "SECOND", "SECOND"], ["ONE", "TWO", "ONE", "TWO"]])  # no space in any column

# following 'to_sql' produces warning:
# UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
# However, there are no spaces (see above)
df.to_sql("tmp_table", con=con, if_exists="replace")

# refering to page: http://pydoc.net/Python/pandas/0.14.0/pandas.io.sql/
import re

columns = list(map(str, df.columns))
pat = re.compile('\s+')

print columns 
# OUTPUT: ["('FIRST', 'ONE')", "('FIRST', 'TWO')", "('SECOND', 'ONE')", "('SECOND', 'TWO')"]
# ('FIRST',X'ONE'): At position X of a tuple is a whitespace that falsly produces the warning: 

print any(map(pat.search, columns)) #true

# quick and dirty fix
columns = [i.replace("', '", "','") for i in columns]
print columns
print any(map(pat.search, columns)) #false
```python

Comment From: jorisvandenbossche

Thanks for the report! However, I don't think it is a false warning in this case, as the multi-index tuples are converted to strings (eg "('FIRST', 'ONE')") when used as column names for the sql table. So in those names, there are spaces.

In general, multi-index columns are not really supported by to_sql (but it works, given that they are converted to somewhat awkward strings)

Example (reading the table back into pandas, from your example above):

In [77]: df = pd.read_sql("SELECT * FROM tmp_table", con)

In [79]: df
Out[79]:
Empty DataFrame
Columns: [index, ('FIRST', 'ONE'), ('FIRST', 'TWO'), ('SECOND', 'ONE'), ('SECOND', 'TWO')]
Index: []

In [80]: df.columns
Out[80]:
Index([u'index', u'('FIRST', 'ONE')', u'('FIRST', 'TWO')',
       u'('SECOND', 'ONE')', u'('SECOND', 'TWO')'],
      dtype='object')

So when looking at the .columns attribute, you can see it are now actually strings and no longer tuples.

Comment From: elDan101

Ok, I realised this now too. Sorry for that. I still use MultiIndex with SQL. When I read from the database, I needed to implementent a "MultiIndex recovery" to obtain the same table.

Comment From: jorisvandenbossche

@elDan101 Personally, I think the easiest will be to first join the MultiIndex with eg _ before writing to sql, and then you can easily split it again to a multiindex afterwards.

Something like

# before
df.columns = [a + '_' + b for a, b in df.columns]

# after
df.columns = df.columns.split('_')

(only works of course if you don't have _ in your names itself)