Pandas datetools.parse interface - Nineya|java/go/python

[ ] document in docstring / cookbook / timeseries.rst usage of combing integer columns into YYYYMMDD and parsing to datetimes
[ ] clean up imports of _parse/parse from dateutils
[ ] to_period to create PeriodIndex

I started to make a PR #5885 to fix what I thought was a typo before realizing that this was intentional. It still doesn't make much sense to me though why I would want this return of datetime, _result, resolution. Maybe the whole approach could use a refactor. Otherwise what am I missing?

My typical use case for datetools.parse is something like

dates = map(lambda x : parse(' '.join(x)), zip(df.day, df.month, df.year))

A couple of questions. 1. Is there a better way to do this vectorized to datetime operation? AFAICT pd.to_datetime doesn't actually use the 'advanced' parsing for quarterly and monthly dates. 2. Should this all be unified? Assuming I haven't missed it, should there be, e.g., a function pd.parse_dates that is a general parser for both strings and works on array-like input, deprecating datetools.parse, datetools.parse_time_string, and datetools.to_datetime. This function could also have a flag to return Period or TimeStamp objects with frequency information instead of the current return of the parsed object and resolution. Given that I'm having to do things like

dates = [x[0] for x in map(lambda x : parse(' '.join(x)), zip(df.day, df.month, df.year))]

Thoughts?

Comment From: jreback

I you already have integer day,month, year

just do this, and make sure that you specify format '%Y%m%d'; its specially optimized to handle this (whether its an integer or a string)

In [14]: i = pd.date_range('20000101',periods=10000)

In [15]: df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))

In [17]: %timeit pd.to_datetime(df.year*10000+df.month*100+df.day,format='%Y%m%d')
100 loops, best of 3: 8.4 ms per loop

Parsing the actual string

In [18]: ds = df.apply(lambda x: "%04d%02d%02d" % (x.year,x.month,x.day),axis=1)

In [21]: %timeit pd.to_datetime(ds)
1 loops, best of 3: 519 ms per loop

pd.to_datetime IS the general purpose parser (and will fallback to datetools for dates it hasn't optimized code for parsing); you can always pass a strptime format

Comment From: jseabold

A couple of comments. This is not completely obvious. I'll see about adding this and similar examples to the docstring. In general, more examples in docstrings with common patterns would be welcome.

to_datetime is not general enough because it doesn't subsume the abilities of pd.parse_time_string. It only handles things that can be put into the strftime format. For example, very common to have dates in this format in economics.

pd.to_datetime(["1980m1", "1980m2"])

It'd be nice to have a function that handles this and dates handled by dateutil.parse.

And this returns an array for some reason, which I find to be odd. I see the box keyword, but since your example returns a Series, not a DatetimeIndex as indicated, wouldn't it make sense for everything just to return a Series?

Shouldn't to_datetime have parse in the name? This is what I tab-complete on. I know I want to "parse" the dates, I guess I wanted to parse them "to_datetime", but the former seems the obvious name in the namespace (to me). pd.parse_to_datetime?

Comment From: jreback

not sure where the name came from.

I agree creating a date from columns is not so obvious (a little more obvious if you are reading it in with read_csv).

would appreciate the combing example that as a docstring / doc example / cookbook - I thought about how to make it 'automatically' do it but don't want to change the API...if you think of someway great! (could have a convenience function, but not big on that)

These will return a Series if you pass a Series/array. The boxing is internally used by the DatetimeIndex parser (which just calls this); you can use it if you want an Index instead. It also returns a scalar Timestamp if only 1 value is passed as a scalar.

I could see the format argument taking a function for parsing. How would you have your example parsed? (prob as a PeriodIndex? maybe should have a to_period (and in 0.13 have to_timedelta

I don't think can change to_datetime, but could alias to parse_to_datetime no biggie on that

Comment From: jseabold

Also maybe import parse from dateutils as _parse to discourage its use. I had no idea it wasn't really intended to be part of the public API.

Comment From: jreback

I think the import is a big 'confused' as its in several places (some as _parse some as parse). I belive the point was to allow it as a convience in parsing dates with read_csv but that has evolved to not be that necessary.

Comment From: jreback

ok...i'll create a todo list at the top of this PR then

Comment From: jseabold

My example would be parsed like

map(pd.datetools.parse_time_string, ["1980q1","1980q2"])

Though apparently 1980m12 is recognized as minute not month. I'd have m\d+ and q\d+ default to parsing as month and quarter, though I see that this has time_string in the title so probably not appropriate here. Odd that it tries to parse quarters here.

Not unheard of formats commonly handled by econometric/statistical software.

[~/]
[7]: map(sm.tsa.datetools.date_parser, ["1998m1", "1998m2"])
[7]: [datetime.datetime(1998, 1, 31, 0, 0), datetime.datetime(1998, 2, 28, 0, 0)]

[~/]
[8]: map(sm.tsa.datetools.date_parser, ["1998QI", "1998QII"])
[8]: [datetime.datetime(1998, 3, 31, 0, 0), datetime.datetime(1998, 6, 30, 0, 0)]

[~/]
[9]: map(sm.tsa.datetools.date_parser, ["1998mX", "1998mXI"])
[9]: [datetime.datetime(1998, 10, 31, 0, 0), datetime.datetime(1998, 11, 30, 0, 0)]

Comment From: jseabold

I don't much care if I get Periods or TimeStamps, etc. As far as I'm concerned (as a user) they're pretty much the same. As a developer, I've written most software to be agnostic about what it gets as long as it can infer a frequency.

Comment From: jreback

mentioned in SO

Comment From: gfyoung

@jreback : Is this issue even relevant anymore? IINM, pd.datetools.parse doesn't exist.

Comment From: jreback

look at the imports they are * so yes

Comment From: gfyoung

@jreback : I don't understand what you mean by that.

Comment From: jreback

do a dir() on the namespace

Comment From: gfyoung

Again, do not follow. Just try the following:

>>> import pandas as pd
>>> pd.datetools.parse
...
AttributeError: module 'pandas.core.datetools' has no attribute 'parse'

Comment From: jreback

dir()

Comment From: gfyoung

Again, do not follow. Just try the following:

>>> import pandas as pd
>>> 'parse' in dir(pd)
False
>>> 'parse' in dir(pd.datetools)
False

I'm really not understanding what you're saying here.

Comment From: jreback

I guess it's gone

all of these issues are either fixed then or elsewhere (to_period needs a standalone issue)

Comment From: jreback

http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0181-enhancements-assembling

Comment From: jreback

@gfyoung if u would create an issue for to_period would've great cc @sinhrks

Comment From: gfyoung

to_period for DatetimeIndex I presume?

Comment From: jreback

no it's like to_datetime but creates PeriodIndexes

Comment From: gfyoung

Got it, done: #14108