Pandas an merge problem, about "read_csv" and "astype"

I have two csv file below:

test1.csv

COL1,COL2
10124420001034504343,1
10124420001095178632,2
10124420000004627498,3

test2.csv

COL1,COL2
10124420001034504343,a
10124420001095178632,b
20124400001011907368,c

why df1 and df3 are different? (merge(d1, d2) can't get the output I want)

In [1]: import pandas as pd

In [2]: df1 = pd.read_csv('./test1.csv')

In [3]: df2 = pd.read_csv('./test2.csv')

In [4]: df1.dtypes
Out[4]:
COL1    uint64
COL2     int64
dtype: object

In [5]: df2.dtypes
Out[5]:
COL1    object
COL2    object
dtype: object

In [6]: df1['COL1'] = df1['COL1'].astype('object')

In [7]: df1.dtypes
Out[7]:
COL1    object
COL2     int64
dtype: object

In [8]: merge1 = pd.merge(df1, df2, on='COL1')

In [9]: merge1.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
COL1      0 non-null object
COL2_x    0 non-null int64
COL2_y    0 non-null object
dtypes: int64(1), object(2)
memory usage: 0.0+ bytes

In [10]: df3 = pd.read_csv('./test1.csv', dtype={'COL1': object})

In [11]: df3.dtypes
Out[11]:
COL1    object
COL2     int64
dtype: object

In [12]: merge2 = pd.merge(df3, df2, on='COL1')

In [13]: merge2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
COL1      2 non-null object
COL2_x    2 non-null int64
COL2_y    2 non-null object
dtypes: int64(1), object(2)
memory usage: 64.0+ bytes

Output of `pd.show_versions`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 8.1 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.21.0 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.13.3 scipy: 0.19.0 pyarrow: None xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

Take a look at your elements in df1, df2, and df3 once you read them in.

df2 has object dtype because it cannot cast COL1 to an integer data-type (the last element causes overflow). Thus, it casts everything to string.

df3 specifies that COL1 be object, which is interpreted by read_csv to hold strings instead of integers. Thus, df2 and df3 have matching strings, causing you to join properly.

df1 reads COL1 as int because the integers can be held in uint64. When you can cast to object with .astype, it no longer means casting to string but rather just changing the data-type of the container holding these elements. If you extract the elements individually, they'll still be integers. Thus, when you try to join, you're joining integers or strings, causing it to fail.

A little weird, but the interpretation of object varies slightly depending on whether you're coming from CSV or operating within pandas.

Comment From: qiaobz

clear now. Interesting.

Pandas an merge problem, about "read_csv" and "astype"

Output of pd.show_versions

Output of `pd.show_versions`