Hi guys, i'm user of mysql we have an "function" PROCEDURE ANALYSE https://dev.mysql.com/doc/refman/5.5/en/procedure-analyse.html
it get all "dataframe" and show what's the best "dtype", could we do something like it in Pandas?
thanks!
Comment From: TomAugspurger
Could you show an example of this? You want to find the smallest dtype that can contain each column's values?
Comment From: jreback
you can do this.
In [5]: df = pd.DataFrame({'A': 1, 'B': np.iinfo(np.int32).max-1, 'C': np.iinfo(np.int32).max+1}, index=[0])
...:
In [6]: df
Out[6]:
A B C
0 1 2147483646 2147483648
In [7]: df.dtypes
Out[7]:
A int64
B int64
C int64
dtype: object
In [8]: df.apply(lambda x: pd.to_numeric(x, downcast='integer')).dtypes
Out[8]:
A int8
B int32
C int64
dtype: object
Comment From: rspadim
from this example of mysql: https://dev.mysql.com/doc/refman/5.7/en/procedure-analyse.html
it's a CHAR(1) field mysql> select * from charac; +--------+ | charac | +--------+ | A | | B | | C | | D | | E | +--------+
but when you execute procedure analyse it report that you can use ENUM() to optimize searchs and data structure
mysql> select * from charac procedure analyse()\G ***** 1. row **** Field_name: world.charac.charac Min_value: A Max_value: E Min_length: 1 Max_length: 1 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 1.0000 Std: NULL Optimal_fieldtype: ENUM('A','B','C','D','E') NOT NULL *<------------------------ 1 row in set (0.00 sec)
obvious procedure() have options when it should change a string to a enum (string -> categorical)
mysql> select * from charac procedure analyse(5,24)\G ***** 1. row **** Field_name: world.charac.charac Min_value: A Max_value: E Min_length: 1 Max_length: 1 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 1.0000 Std: NULL Optimal_fieldtype: CHAR(1) NOT NULL *<------ now optimal fieldtype is char(1)