- [ ] document in docstring / cookbook / timeseries.rst usage of combing integer columns into YYYYMMDD and parsing to datetimes
- [ ] clean up imports of
_parse/parse
from dateutils - [ ]
to_period
to createPeriodIndex
I started to make a PR #5885 to fix what I thought was a typo before realizing that this was intentional. It still doesn't make much sense to me though why I would want this return of datetime, _result, resolution. Maybe the whole approach could use a refactor. Otherwise what am I missing?
My typical use case for datetools.parse is something like
dates = map(lambda x : parse(' '.join(x)), zip(df.day, df.month, df.year))
A couple of questions.
1. Is there a better way to do this vectorized to datetime operation? AFAICT pd.to_datetime doesn't actually use the 'advanced' parsing for quarterly and monthly dates.
2. Should this all be unified? Assuming I haven't missed it, should there be, e.g., a function pd.parse_dates
that is a general parser for both strings and works on array-like input, deprecating datetools.parse, datetools.parse_time_string, and datetools.to_datetime. This function could also have a flag to return Period or TimeStamp objects with frequency information instead of the current return of the parsed object and resolution. Given that I'm having to do things like
dates = [x[0] for x in map(lambda x : parse(' '.join(x)), zip(df.day, df.month, df.year))]
Thoughts?
Comment From: jreback
I you already have integer day,month, year
just do this, and make sure that you specify format '%Y%m%d'; its specially optimized to handle this (whether its an integer or a string)
In [14]: i = pd.date_range('20000101',periods=10000)
In [15]: df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))
In [17]: %timeit pd.to_datetime(df.year*10000+df.month*100+df.day,format='%Y%m%d')
100 loops, best of 3: 8.4 ms per loop
Parsing the actual string
In [18]: ds = df.apply(lambda x: "%04d%02d%02d" % (x.year,x.month,x.day),axis=1)
In [21]: %timeit pd.to_datetime(ds)
1 loops, best of 3: 519 ms per loop
pd.to_datetime
IS the general purpose parser (and will fallback to datetools for dates it hasn't
optimized code for parsing); you can always pass a strptime format
Comment From: jseabold
A couple of comments. This is not completely obvious. I'll see about adding this and similar examples to the docstring. In general, more examples in docstrings with common patterns would be welcome.
to_datetime
is not general enough because it doesn't subsume the abilities of pd.parse_time_string. It only handles things that can be put into the strftime format. For example, very common to have dates in this format in economics.
pd.to_datetime(["1980m1", "1980m2"])
It'd be nice to have a function that handles this and dates handled by dateutil.parse.
And this returns an array for some reason, which I find to be odd. I see the box keyword, but since your example returns a Series, not a DatetimeIndex as indicated, wouldn't it make sense for everything just to return a Series?
Shouldn't to_datetime
have parse in the name? This is what I tab-complete on. I know I want to "parse" the dates, I guess I wanted to parse them "to_datetime", but the former seems the obvious name in the namespace (to me). pd.parse_to_datetime
?
Comment From: jreback
not sure where the name came from.
I agree creating a date from columns is not so obvious (a little more obvious if you are reading it in with read_csv
).
would appreciate the combing example that as a docstring / doc example / cookbook - I thought about how to make it 'automatically' do it but don't want to change the API...if you think of someway great! (could have a convenience function, but not big on that)
These will return a Series if you pass a Series/array. The boxing is internally used by the DatetimeIndex parser (which just calls this); you can use it if you want an Index instead. It also returns a scalar Timestamp if only 1 value is passed as a scalar.
I could see the format
argument taking a function for parsing. How would you have your example parsed? (prob as a PeriodIndex
? maybe should have a to_period
(and in 0.13 have to_timedelta
I don't think can change to_datetime
, but could alias to parse_to_datetime
no biggie on that
Comment From: jseabold
Also maybe import parse from dateutils as _parse to discourage its use. I had no idea it wasn't really intended to be part of the public API.
Comment From: jreback
I think the import is a big 'confused' as its in several places (some as _parse
some as parse
). I belive the point was to allow it as a convience in parsing dates with read_csv
but that has evolved to not be that necessary.
Comment From: jreback
ok...i'll create a todo list at the top of this PR then
Comment From: jseabold
My example would be parsed like
map(pd.datetools.parse_time_string, ["1980q1","1980q2"])
Though apparently 1980m12 is recognized as minute not month. I'd have m\d+
and q\d+
default to parsing as month and quarter, though I see that this has time_string in the title so probably not appropriate here. Odd that it tries to parse quarters here.
Not unheard of formats commonly handled by econometric/statistical software.
[~/]
[7]: map(sm.tsa.datetools.date_parser, ["1998m1", "1998m2"])
[7]: [datetime.datetime(1998, 1, 31, 0, 0), datetime.datetime(1998, 2, 28, 0, 0)]
[~/]
[8]: map(sm.tsa.datetools.date_parser, ["1998QI", "1998QII"])
[8]: [datetime.datetime(1998, 3, 31, 0, 0), datetime.datetime(1998, 6, 30, 0, 0)]
[~/]
[9]: map(sm.tsa.datetools.date_parser, ["1998mX", "1998mXI"])
[9]: [datetime.datetime(1998, 10, 31, 0, 0), datetime.datetime(1998, 11, 30, 0, 0)]
Comment From: jseabold
I don't much care if I get Periods or TimeStamps, etc. As far as I'm concerned (as a user) they're pretty much the same. As a developer, I've written most software to be agnostic about what it gets as long as it can infer a frequency.
Comment From: jreback
mentioned in SO
Comment From: gfyoung
@jreback : Is this issue even relevant anymore? IINM, pd.datetools.parse
doesn't exist.
Comment From: jreback
look at the imports they are * so yes
Comment From: gfyoung
@jreback : I don't understand what you mean by that.
Comment From: jreback
do a dir() on the namespace
Comment From: gfyoung
Again, do not follow. Just try the following:
>>> import pandas as pd
>>> pd.datetools.parse
...
AttributeError: module 'pandas.core.datetools' has no attribute 'parse'
Comment From: jreback
dir()
Comment From: gfyoung
Again, do not follow. Just try the following:
>>> import pandas as pd
>>> 'parse' in dir(pd)
False
>>> 'parse' in dir(pd.datetools)
False
I'm really not understanding what you're saying here.
Comment From: jreback
I guess it's gone
all of these issues are either fixed then or elsewhere (to_period needs a standalone issue)
Comment From: jreback
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0181-enhancements-assembling
Comment From: jreback
@gfyoung if u would create an issue for to_period would've great cc @sinhrks
Comment From: gfyoung
to_period
for DatetimeIndex
I presume?
Comment From: jreback
no it's like to_datetime but creates PeriodIndexes
Comment From: gfyoung
Got it, done: #14108