Pandas Optimize data type - Nineya|java/go/python

Hi guys, i'm user of mysql we have an "function" PROCEDURE ANALYSE https://dev.mysql.com/doc/refman/5.5/en/procedure-analyse.html

it get all "dataframe" and show what's the best "dtype", could we do something like it in Pandas?

thanks!

Comment From: TomAugspurger

Could you show an example of this? You want to find the smallest dtype that can contain each column's values?

Comment From: jreback

you can do this.

In [5]: df = pd.DataFrame({'A': 1, 'B': np.iinfo(np.int32).max-1, 'C': np.iinfo(np.int32).max+1}, index=[0])
   ...: 

In [6]: df
Out[6]: 
   A           B           C
0  1  2147483646  2147483648

In [7]: df.dtypes
Out[7]: 
A    int64
B    int64
C    int64
dtype: object

In [8]: df.apply(lambda x: pd.to_numeric(x, downcast='integer')).dtypes
Out[8]: 
A     int8
B    int32
C    int64
dtype: object

Comment From: rspadim

from this example of mysql: https://dev.mysql.com/doc/refman/5.7/en/procedure-analyse.html

it's a CHAR(1) field mysql> select * from charac; +--------+ | charac | +--------+ | A | | B | | C | | D | | E | +--------+

but when you execute procedure analyse it report that you can use ENUM() to optimize searchs and data structure

mysql> select * from charac procedure analyse()\G ***** 1. row **** Field_name: world.charac.charac Min_value: A Max_value: E Min_length: 1 Max_length: 1 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 1.0000 Std: NULL Optimal_fieldtype: ENUM('A','B','C','D','E') NOT NULL *<------------------------ 1 row in set (0.00 sec)

obvious procedure() have options when it should change a string to a enum (string -> categorical)

mysql> select * from charac procedure analyse(5,24)\G ***** 1. row **** Field_name: world.charac.charac Min_value: A Max_value: E Min_length: 1 Max_length: 1 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 1.0000 Std: NULL Optimal_fieldtype: CHAR(1) NOT NULL *<------ now optimal fieldtype is char(1)