Code Sample
import pandas as pd
import numpy as np
left = pd.DataFrame(
columns=['A'],
index=pd.Index([], name='id', dtype=np.int64)
)
right = pd.DataFrame(
[[5]],
columns=['B'],
index=pd.Index([42], name='id', dtype=np.int64)
)
res_idx_dtype = left.join(right).index.dtype
assert res_idx_dtype == np.int64
Problem description
When executing a left join on two dataframes I would expect the resulting index type to be of the same type my left dataframe was.
I ran into this issue while using dask. I executed another join on the resulting dataframe which resulted in an error, because the index was of type object instead of np.int64.
Expected Output
>>> left.join(right).index.dtype
dtype('int64')
Output of pd.show_versions()
Comment From: TomAugspurger
Looks like a bug. Is it just for empty dataframes?
Comment From: jreback
try this on master / 0.24.2 iirc this is fixed
Comment From: michcio1234
It still happens for me with 0.24.2.
Comment From: michcio1234
Looks like a bug. Is it just for empty dataframes?
@TomAugspurger yes, if left dataframe is not empty, the result is as expected.
Comment From: kayibal
Unfortunately not fixed in master either:
>>> import pandas as pd
>>> pd.__version__
'0.25.0.dev0+754.gbd09a5904'
>>> import numpy as np
>>>
>>> left = pd.DataFrame(
... columns=['A'],
... index=pd.Index([], name='id', dtype=np.int64)
... )
>>> right = pd.DataFrame(
... [[5]],
... columns=['B'],
... index=pd.Index([42], name='id', dtype=np.int64)
... )
>>> res_idx_dtype = left.join(right).index.dtype
>>> assert res_idx_dtype == np.int64
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
Comment From: TomAugspurger
Are you interested in investigating where things go wrong?
On Tue, Jun 18, 2019 at 10:21 AM Alan Höng notifications@github.com wrote:
Unfortunately not fixed in master either:
import pandas as pd pd.version '0.25.0.dev0+754.gbd09a5904' import numpy as np
left = pd.DataFrame( ... columns=['A'], ... index=pd.Index([], name='id', dtype=np.int64) ... ) right = pd.DataFrame( ... [[5]], ... columns=['B'], ... index=pd.Index([42], name='id', dtype=np.int64) ... ) res_idx_dtype = left.join(right).index.dtype assert res_idx_dtype == np.int64 Traceback (most recent call last): File "
", line 1, in AssertionError — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/26925?email_source=notifications&email_token=AAKAOIXR5K6ULK3DDGURSXDP3D4QJA5CNFSM4HZAX6EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX67YKA#issuecomment-503184424, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIWAMFQAQN3T4JRFMXTP3D4QJANCNFSM4HZAX6EA .
Comment From: kayibal
Sure I can have a look!
Comment From: kayibal
From a quick look I think this line could be the culprit:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L787
introduced by this commit: https://github.com/pandas-dev/pandas/commit/d16cce89521c84fcb9c7b7bb2e95629a6fe7acb7
I can try to remove it and run all the tests although it will take me a while to set up my env for pandas.
Comment From: TomAugspurger
Ah I think you've found it. It does seem pretty deliberate, but I'm hopeful that it can just be removed and things will be OK thanks to a recent refactor.
The commit you linked to is just a reorganization. The old filename was pandas/tools/merge.py.
On Tue, Jun 18, 2019 at 3:23 PM Alan Höng notifications@github.com wrote:
From a quick look I think this line could be the culprit:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L787
introduced by this commit: d16cce8 https://github.com/pandas-dev/pandas/commit/d16cce89521c84fcb9c7b7bb2e95629a6fe7acb7
I can try to remove it and run all the tests although it will take me a while to set up my env for pandas.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/26925?email_source=notifications&email_token=AAKAOIRX45P7RK6G7JWNJQTP3E73FA5CNFSM4HZAX6EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX73HPQ#issuecomment-503296958, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQJYJFCJUOGBPOOBRTP3E73FANCNFSM4HZAX6EA .
Comment From: kayibal
I solved it like this and looks like so far all tests for reshape/merge/test_merge.py
are passing:
if len(join_index) == 0:
if self.how == 'right':
join_index = join_index.astype(self.right.index.dtype)
elif self.how == 'left':
join_index = join_index.astype(self.left.index.dtype)
else:
join_index = join_index.astype(object)
Any idea on how to handle inner
and outer
?
Comment From: TomAugspurger
Not off the top of my head. It may be best to make a PR at this point if you're interested.
Comment From: lukemanley
closing as this is fixed on main and seems to be tested in a few places (e.g. test_join_empty
)