Pandas frequency of time series read with read_csv

I f I read a regular time series (with a fixed frequency) with read_csv, the resulting DataFrame has no freuqency.


In [82]: lines = ['2012-07-27T13:%02d, %d' % (min, index) for index, min in enumerate(range(1, 59))]

In [83]: input = StringIO.StringIO('\n'.join(lines))

In [84]: df = pd.read_csv(input, parse_dates=True, index_col=0, names='A')

In [85]: df.index
Out[85]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-07-27 13:01:00, ..., 2012-07-27 13:58:00]
Length: 58, Freq: None, Timezone: None

What I would like to have, is a possibility to set the frequency without resampling (or maybe read_csv could guess a frequency).

Comment From: badgley

Alright going to try to knock this sucker out -- where should I be touching the code? Looks like TextParser already has a date_parser flag in it. But when I trace that down, it runs into some Cython code that feels intimidating. I suspect there already exists a way to specify the frequency of a timeseries index -- is it just a matter of passing some sort of parameter to that part of the code?

Comment From: dalejung

I don't think the parser can guess. In order to verify that it has a fixed freq, it would need to digest all the datetimes anyways. So it wouldn't really save you much since creating a fixed freq DatetimeIndex is quick.

Comment From: badgley

@dalejung thanks for the reply. I think the whole point of the ticket is -- even though making the DatetimeIndex is fast, it comes up so often that we might as well toss a flag in. Are you saying we should just close this down and not fix it?

Comment From: dalejung

@gbadge I was primarily referring to the ability to guess.

My point about the fixed freq is that I think the original issue is that he has to resample AFTER parse_dates, which is slow. Instead, he can just not parse dates and set the index to a separately created DatetimeIndex.

http://nbviewer.maxdrawdown.com/3422791/default.ipynb

Comment From: bmu

@gbadge thanks for your help.

@dalejung I'm usually working with time series files with regular or irregular time steps, but I don't know before if they are regular or not.

Your approach is good for regular time series. Something like this could be implemented with an additional keyword 'freq' to read_csv.

Comment From: patricktokeeffe

This should not implemented using a freq argument if the intent is to pass in some numeric value. If the read_csv function is called with parse_dates=True then the date parsing function should attempt to determine the frequency on it's own. It's nonsense to expect the frequency to be known and passed to the parsing function. If the user can specify a value for freq they might as well reindex it after loading.

Maybe I could understand adding a determine_freq boolean argument...

Better still (for the user), parse the dates then try and determine the frequency anyway using pd.infer_freq. It took me quite a while to find this function; since it exists, I'm a little puzzled the date parser doesn't use it by by default.

Comment From: jreback

the dates are created but unless you turn them into a DatetimeIndex, e.g. via pd.to_datetime(...., box=True) then you won't have automatic freq inferring (and its not completely free to do this, that's why its done lazily).

Comment From: jreback

not an enhancement, use pd.infer_freq

Comment From: dacoex

[Edit: corrected code snippet]

So what is the best way to set the frequency directly from the csv file dates?

df = df.asfreq(pd.infer_freq(df.index))

And why can a infer_freq no be part of the read_csv function

Comment From: dacoex

similar issue: https://github.com/pydata/pandas/issues/5613