Pandas Inconsistency between read_csv(... parse_dates) and read_csv followed by astype('datetime')

Write time-zones out in csv file

Datetimes are written to csv in readable UTC. Using pd.read_csv('file.csv', parse_dates=['date']) expects UTC. astype('datetime64') expects readable dates (with no timezone specified) to be local (which also makes sense). But it leads to the following:

    import pandas as pd
    from StringIO import StringIO

    df = pd.DataFrame({'date' : [0]})
    df['date'] = df.date.astype('datetime64[s]')
    print df.date[0]

    s = StringIO()
    df.to_csv(s)
    s = s.getvalue()
    print 'csv:'
    print s


    with_parse_dates = pd.read_csv(StringIO(s), parse_dates=['date'])
    print 'parse_date:', with_parse_dates.date[0]


    with_astype = pd.read_csv(StringIO(s))
    with_astype.date = with_astype.date.astype('datetime64')
    print 'astype    :', with_astype.date[0]


    with_astypeZ = pd.read_csv(StringIO(s))
    with_astypeZ.date[0] += 'Z'
    with_astypeZ.date = with_astypeZ.date.astype('datetime64')
    print 'astype w/Z:', with_astypeZ.date[0]

output (for me):

1970-01-01 00:00:00
csv:
,date
0,1970-01-01 00:00:00

parse_date: 1970-01-01 00:00:00
astype    : 1970-01-01 08:00:00
astype w/Z: 1970-01-01 00:00:00

Would be easily fixed by indicating UTC in the csv output, possibly with just the 'Z' suffix.

Comment From: jreback

currently astype('datetime64') will not work (though in your example it is innocuous as it doesn't do anything with your datetime being 0). Datetimes are stored as datetime64[ns] and the only valid transforms are to object, or to extract values (e.g. hours,minutes, etc) from them. The astype to another time unit is really just a display issue.

(this is not an answer to your question exactly, but is conceptually part of the problem)

But you do make a valid point, how to store datetimes with timezones (which btw are stored as object type only) in a csv? I like the idea of storing the fully suffixed string with timezone.

Comment From: endolith

But you do make a valid point, how to store datetimes with timezones (which btw are stored as object type only) in a csv?

Isn't that why you add +00:00 to the end of a datestamp? Pandas doesn't recognize this, though.

Putting these formats in CSV file:

2015-05-01T04:13:00+00:00
2015-05-01 04:13:00+00:00
2015-05-01 04:13:00Z

are all parsed as dtype='datetime64[ns]', even though these represent UTC and so I would expect dtype='datetime64[ns, UTC]'.

Comment From: jorisvandenbossche

Isn't that why you add +00:00 to the end of a datestamp?

Indeed, that already happens with latest pandas by adding the offset.

Pandas doesn't recognize this, though.

Note that the offset will be taken into account, the result will only be a naive datetime by default with read_csv. But you can get timezone-aware datetimes with to_datetime:

In [129]: pd.to_datetime(['2015-05-01T04:13:00+00:00'], utc=True)
Out[129]: DatetimeIndex(['2015-05-01 04:13:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

Note that the result will always be UTC, but with the offset taken into account (there is no way to know from the string repr which timezone it should be, so UTC is the only option)

Comment From: jorisvandenbossche

I am closing this: with latest pandas and numpy, all examples in the initial issue give the same expected result.

Comment From: endolith

@jorisvandenbossche

(there is no way to know from the string repr which timezone it should be, so UTC is the only option)

Well if it has a Z at the end then it is definitely UTC, right?

Comment From: jorisvandenbossche

And it also gives you UTC:

In [71]: pd.to_datetime(['2015-05-01T04:13:00Z'], utc=True)
Out[71]: DatetimeIndex(['2015-05-01 04:13:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

but only if you specify to return timezone-aware datetimes.