- I have two csv file below:
test1.csv
COL1,COL2
10124420001034504343,1
10124420001095178632,2
10124420000004627498,3
test2.csv
COL1,COL2
10124420001034504343,a
10124420001095178632,b
20124400001011907368,c
- why df1 and df3 are different? (merge(d1, d2) can't get the output I want)
In [1]: import pandas as pd
In [2]: df1 = pd.read_csv('./test1.csv')
In [3]: df2 = pd.read_csv('./test2.csv')
In [4]: df1.dtypes
Out[4]:
COL1 uint64
COL2 int64
dtype: object
In [5]: df2.dtypes
Out[5]:
COL1 object
COL2 object
dtype: object
In [6]: df1['COL1'] = df1['COL1'].astype('object')
In [7]: df1.dtypes
Out[7]:
COL1 object
COL2 int64
dtype: object
In [8]: merge1 = pd.merge(df1, df2, on='COL1')
In [9]: merge1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
COL1 0 non-null object
COL2_x 0 non-null int64
COL2_y 0 non-null object
dtypes: int64(1), object(2)
memory usage: 0.0+ bytes
In [10]: df3 = pd.read_csv('./test1.csv', dtype={'COL1': object})
In [11]: df3.dtypes
Out[11]:
COL1 object
COL2 int64
dtype: object
In [12]: merge2 = pd.merge(df3, df2, on='COL1')
In [13]: merge2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
COL1 2 non-null object
COL2_x 2 non-null int64
COL2_y 2 non-null object
dtypes: int64(1), object(2)
memory usage: 64.0+ bytes
Output of pd.show_versions
Comment From: gfyoung
Take a look at your elements in df1
, df2
, and df3
once you read them in.
df2
has object
dtype because it cannot cast COL1
to an integer data-type (the last element causes overflow). Thus, it casts everything to string.
df3
specifies that COL1
be object
, which is interpreted by read_csv
to hold strings instead of integers. Thus, df2
and df3
have matching strings, causing you to join properly.
df1
reads COL1
as int
because the integers can be held in uint64
. When you can cast to object
with .astype
, it no longer means casting to string
but rather just changing the data-type of the container holding these elements. If you extract the elements individually, they'll still be integers. Thus, when you try to join, you're joining integers or strings, causing it to fail.
A little weird, but the interpretation of object
varies slightly depending on whether you're coming from CSV or operating within pandas
.
Comment From: qiaobz
clear now. Interesting.