- I have two csv file below:
test1.csv
COL1,COL2
10124420001034504343,1
10124420001095178632,2
10124420000004627498,3
test2.csv
COL1,COL2
10124420001034504343,a
10124420001095178632,b
20124400001011907368,c
- why df1 and df3 are different? (merge(d1, d2) can't get the output I want)
In [1]: import pandas as pd
In [2]: df1 = pd.read_csv('./test1.csv')
In [3]: df2 = pd.read_csv('./test2.csv')
In [4]: df1.dtypes
Out[4]:
COL1 uint64
COL2 int64
dtype: object
In [5]: df2.dtypes
Out[5]:
COL1 object
COL2 object
dtype: object
In [6]: df1['COL1'] = df1['COL1'].astype('object')
In [7]: df1.dtypes
Out[7]:
COL1 object
COL2 int64
dtype: object
In [8]: merge1 = pd.merge(df1, df2, on='COL1')
In [9]: merge1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
COL1 0 non-null object
COL2_x 0 non-null int64
COL2_y 0 non-null object
dtypes: int64(1), object(2)
memory usage: 0.0+ bytes
In [10]: df3 = pd.read_csv('./test1.csv', dtype={'COL1': object})
In [11]: df3.dtypes
Out[11]:
COL1 object
COL2 int64
dtype: object
In [12]: merge2 = pd.merge(df3, df2, on='COL1')
In [13]: merge2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
COL1 2 non-null object
COL2_x 2 non-null int64
COL2_y 2 non-null object
dtypes: int64(1), object(2)
memory usage: 64.0+ bytes
Output of pd.show_versions
Comment From: gfyoung
Take a look at your elements in df1, df2, and df3 once you read them in.
df2 has object dtype because it cannot cast COL1 to an integer data-type (the last element causes overflow). Thus, it casts everything to string.
df3 specifies that COL1 be object, which is interpreted by read_csv to hold strings instead of integers. Thus, df2 and df3 have matching strings, causing you to join properly.
df1 reads COL1 as int because the integers can be held in uint64. When you can cast to object with .astype, it no longer means casting to string but rather just changing the data-type of the container holding these elements. If you extract the elements individually, they'll still be integers. Thus, when you try to join, you're joining integers or strings, causing it to fail.
A little weird, but the interpretation of object varies slightly depending on whether you're coming from CSV or operating within pandas.
Comment From: qiaobz
clear now. Interesting.