Pandas Bug: Conversion to series with datetime items

Code Sample, a copy-pastable example if possible

# Setup
import pandas as pd
dates = pd.to_datetime(range(1,300), unit='D', origin=pd.Timestamp('2000-01-01'))
se = pd.Series(range(1,300),dates)
df = pd.DataFrame(se).reset_index()

# The problem:
>>> df
         index    0
0   2000-01-02    1
1   2000-01-03    2
2   2000-01-04    3
3   2000-01-05    4
4   2000-01-06    5
5   2000-01-07    6
..         ...  ...
294 2000-10-22  295
295 2000-10-23  296
296 2000-10-24  297
297 2000-10-25  298
298 2000-10-26  299

[299 rows x 2 columns]
>>> pd.Series(df['index'], index = df[0])
0
1     2000-01-03
2     2000-01-04
3     2000-01-05
4     2000-01-06
5     2000-01-07
         ...
295   2000-10-23
296   2000-10-24
297   2000-10-25
298   2000-10-26
299          NaT
Name: index, Length: 299, dtype: datetime64[ns]

Problem description

The commands above should not generate any NaT.

In a real life case, I tried to do the same thing with 56355 rows and had 53913 NaT at the end.

Expected Output

No NaT My current work around is: df.set_index('0')

Output of `pd.show_versions()`

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.13.0 scipy: 0.19.0 xarray: None IPython: 6.0.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: 2.4.8 xlrd: 1.0.0 xlwt: None xlsxwriter: 0.9.6 lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

pd.Series(df['index'], index = df[0]) is effectively a .reindex(). You are asking this to align the passed values to the passed index.

In [15]: df['index'].reindex(df[0])
Out[15]: 
0
1     2000-01-03
2     2000-01-04
3     2000-01-05
4     2000-01-06
5     2000-01-07
6     2000-01-08
         ...    
294   2000-10-22
295   2000-10-23
296   2000-10-24
297   2000-10-25
298   2000-10-26
299          NaT
Name: index, Length: 299, dtype: datetime64[ns]

You could in theory do this:

In [16]: pd.Series(df['index'].values, df[0].index)
Out[16]: 
0     2000-01-02
1     2000-01-03
2     2000-01-04
3     2000-01-05
4     2000-01-06
5     2000-01-07
         ...    
293   2000-10-21
294   2000-10-22
295   2000-10-23
296   2000-10-24
297   2000-10-25
298   2000-10-26
Length: 299, dtype: datetime64[ns]

This is the correct idiom

In [17]: df.set_index(0)
Out[17]: 
         index
0             
1   2000-01-02
2   2000-01-03
3   2000-01-04
4   2000-01-05
5   2000-01-06
6   2000-01-07
..         ...
294 2000-10-21
295 2000-10-22
296 2000-10-23
297 2000-10-24
298 2000-10-25
299 2000-10-26

[299 rows x 1 columns]

Comment From: catkfr

Thanks for the quick reply and for explaining the proper way to do things.

As often, there are multiple ways to do this and I had found the df.set_index(0)

But don't you think there is an underlying bug here? As mentionned, I stubbled upon this working on a dataframe with 56355 and 50+ columns. If what I did was wrong, maybe an error would have helped. Here, I had no errors, just almost 54000 NaT in a series that I wanted to use for other things.

Comment From: jreback

no there is 1 way to do this particular operation

you are fighting alignment by trying to manually construct a Series. it is correctly operating by actually aligning the passes series to the index. it is unexpected since you are passing a Series as the index and expecting it to align in ITS index

Comment From: catkfr

Ah ok: good to know

Actually the context was that I was building a dictionary from a database. One column was a reference and another was a date. My objective was to build a dictionary from ref to date.

The data had multiple rows per reference and date. I extracted these two columns, checked for duplicates then wanted to use to_dict() so needed to create a Series one way or the other.

Is there also 1 proper way to do this?

Comment From: jreback

you are best off asking these kinds of questions on SO with a reproducible example

Comment From: catkfr

ok. Thanks for your help

Pandas Bug: Conversion to series with datetime items

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`