Consider the following DataFrame:
0F001tTt1Q 0F00dqgq6a 0F00ewG3TT
1911-01-01 00:00:00.000000 complete no_creatives complete
2016-07-11 04:14:55.281017 None None None
2016-07-11 04:14:55.836767 None None None
2016-07-11 04:14:56.379465 None None None
2016-07-11 04:14:57.078314 None None None
2016-07-11 04:15:48.984814 None None None
2016-07-11 04:27:35.625008 None None None
2016-07-11 04:28:17.095956 None None None
2016-07-11 04:29:35.750822 None None None
2016-07-11 04:30:31.932919 None None None
I want to resample it on the following dates with a forward fill:
dates = pd.date_range(date(2016, 7, 11), date(2016, 7, 18))
This doesn't completely do it:
> df.reindex(drange, method='ffill')
0F001tTt1Q 0F00dqgq6a 0F00ewG3TT
2016-07-11 complete no_creatives complete
2016-07-12 None None None
2016-07-13 None None None
2016-07-14 None None None
2016-07-15 None None None
2016-07-16 None None None
2016-07-17 None None None
2016-07-18 None None None
Oddly, I have to do a ffill
"twice" (below, in the same line):
>> df.reindex(drange, method='ffill').ffill()
0F001tTt1Q 0F00dqgq6a 0F00ewG3TT
2016-07-11 complete no_creatives complete
2016-07-12 complete no_creatives complete
2016-07-13 complete no_creatives complete
2016-07-14 complete no_creatives complete
2016-07-15 complete no_creatives complete
2016-07-16 complete no_creatives complete
2016-07-17 complete no_creatives complete
2016-07-18 complete no_creatives complete
Is that expected? Am I missing something?
Comment From: jreback
that's as expected, pls read the difference between reindex WHILE filling and a reindex after.
Comment From: shoyer
To be clear, ffill
in reindex only fills in labels that don't have an exact match, e.g., if any of drange
don't exactly match the index to df
in your example. The DataFrame values themselves are never inspected in reindex
.
Comment From: amelio-vazquez-reina
@Thanks @shoyer and @jreback
I'm still missing something. Let's say we have:
> df['0FyAlcmk6U']
1911-01-01 00:00:00.000000 3450.60
2016-07-15 13:02:47.662124 3700.60
2016-07-15 16:10:26.526653 3600.60
2016-07-16 13:17:31.411824 3660.60
2016-07-17 16:53:00.053440 3671.86
The reindex
operation seems to be losing data (i.e. a second fffill
here would thus be hopeless).
> df.reindex(drange, method='ffill')['0FyAlcmk6U']
2016-07-11 3450.60
2016-07-12 NaN
2016-07-13 NaN
2016-07-14 NaN
2016-07-15 NaN
2016-07-16 NaN
2016-07-17 NaN
2016-07-18 3671.86
with:
> drange
DatetimeIndex(['2016-07-11', '2016-07-12', '2016-07-13', '2016-07-14',
'2016-07-15', '2016-07-16', '2016-07-17', '2016-07-18'],
dtype='datetime64[ns]', freq='D')
What's the alternative here?
The goal is to sample the original dataframe with a pre-specified set of dates, doing a ffill
as needed.
Comment From: jreback
@amelio-vazquez-reina .reindex
picks out dates given a range. I suspect you either want .asfreq
or more probably, .resample
Comment From: amelio-vazquez-reina
Thanks @jreback That's what I thought, but .asfreq
and .resample
do not take a sequence of dates (or timestamps) right? They only take sampling frequencies, which is why I thought I would need to use .reindex
(or a combination of both types of methods).
Comment From: amelio-vazquez-reina
In the opening example, the first date is 1911. Even if I can deduce that I need a daily sampling frequency (or DateOffset
object), sampling with one point per day would be an overkill here, given that I am only interested in the last few days (e.g. 7 days).
Comment From: jreback
honestly I have no idea what u actually are trying to do
give the docs another read / the examples are very good
Comment From: amelio-vazquez-reina
Thanks @jreback I will certainly do that, but here's one more try (if you don't mind):
Super simple example with a single column:
1911-01-01 00:00:00.000000 something
2016-07-01 04:14:55.281017 None
2016-07-11 04:14:55.836767 None
2016-07-08 04:14:56.379465 None
2016-07-11 04:14:57.078314 None
2016-07-10 04:15:48.984814 something_else
2016-07-11 04:27:35.625008 None
2016-07-11 04:28:17.095956 None
2016-07-13 04:29:35.750822 None
2016-07-15 04:30:31.932919 None
I want to sample the above with forward fill so that I the right data on a specific set of days. That is, the above would just result in:
2016-07-01 something
2016-07-08 something
2016-07-13 something_else
2016-07-14 something_else
2016-07-15 something_else
Note that the first date starts in 1911 so doing a .asfreq
here to get daily data first (to then subsample) would be an overkill (note that I only want 5 data points).
Things could be worse if we are talking about forward fills at the second level I only need the last few seconds!
Does that make sense now?
Comment From: amelio-vazquez-reina
I renamed this to reflect the fact that the problem revolves around "sparse sampling" . Hopefully that clarifies things a bit :).
Comment From: shoyer
The simplest solution to your problem is probably to either use .dropna()
or .fillna()
to drop or fill NA values first, and then use .reindex(..., method='ffill')
afterwards to find the nearest values to your query. Pandas doesn't have a method for reindexing while skipping missing entries in DataFrame.values
.