Expected
Regardless of the arguments specified, the return of DataFrame.join
always has the left side's index values and name. In the case of how='right'
, there should be np.nan
in the index when the right side does not 'match' a row on the left.
In this case:
>>> pd.DataFrame([], columns=['gene'], index=pd.Index([], name='family')).join(s), how='right', on='gene')
gene gene_id
family
NaN NaN 1
Actual / reproduce
Right index is returned for empty dataframe:
>>> import pandas as pd
>>> s = pd.Series([1], name='gene_id', index=pd.Index([1], name='gene')
>>> pd.DataFrame([], columns=['gene'], index=pd.Index([], name='family')).join(s), how='right', on='gene')
gene gene_id
gene
1 NaN 1
Note that it doesn't happen with non-empty frames:
>>> pd.DataFrame([1], columns=['gene'], index=pd.Index([1], name='family')).join(s), how='right', on='gene')
gene gene_id
family
1 1 1
Version
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.5-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.17.0
nose: None
pip: 7.1.2
setuptools: 18.5
Cython: None
numpy: 1.10.1
scipy: 0.16.1
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0b1
pymysql: None
psycopg2: None
Comment From: jreback
Regardless of the arguments specified, the return of DataFrame.join always has the left side's index values and name. In the case of how='right', there should be np.nan in the index when the right side does not 'match' a row on the left.
where does this come from?
In this case there are no matches, yet you need an index. What would you have it do?
Comment From: timdiels
The docstring leaves it unspecified, or at least it's not very clear; so combined with a python interpreter, I assume it always returns the left side's index. Further, I interpreted how
as inner, left outer, right outer, outer join. Knowing which index will be returned in each case is very useful for scripts at least.
Now, back to this case. I have an algorithm whose inputs sometimes contains an empty DataFrame. This will cause a lot of NaN
s in outer joins and that's fine for my case. I don't have to check for it being empty anywhere, it just keeps working; except in the case I mentioned above. It breaks like so:
df = df.join(s, how='right', on='gene')
# later on
df.reset_index().groupby('gene_id')['family'] # <-- no 'family' column if input df was empty, yet worked just fine when input df wasn't empty
Comment From: jreback
ok, can you make a short tests which exercises all of these cases (e.g. 4 x how x (left empty, right non-empty + left non-empty, right empty + both empty + both non-empty), and show results? this can confirm what is existing behavior.
essentialy this is a nested loop over the cases (and which can directly be turned into a test)
Comment From: mroeschke
Looks like this needs more information of the expected behavior of multiple cases, so closing until there is more clarity what the intended behavior should be