Pandas Sparse sampling timestamps with filling

Consider the following DataFrame:

                           0F001tTt1Q    0F00dqgq6a 0F00ewG3TT
1911-01-01 00:00:00.000000   complete  no_creatives   complete
2016-07-11 04:14:55.281017       None          None       None
2016-07-11 04:14:55.836767       None          None       None
2016-07-11 04:14:56.379465       None          None       None
2016-07-11 04:14:57.078314       None          None       None
2016-07-11 04:15:48.984814       None          None       None
2016-07-11 04:27:35.625008       None          None       None
2016-07-11 04:28:17.095956       None          None       None
2016-07-11 04:29:35.750822       None          None       None
2016-07-11 04:30:31.932919       None          None       None

I want to resample it on the following dates with a forward fill:

dates = pd.date_range(date(2016, 7, 11), date(2016, 7, 18))

This doesn't completely do it:

> df.reindex(drange, method='ffill')

           0F001tTt1Q    0F00dqgq6a 0F00ewG3TT
2016-07-11   complete  no_creatives   complete
2016-07-12       None          None       None
2016-07-13       None          None       None
2016-07-14       None          None       None
2016-07-15       None          None       None
2016-07-16       None          None       None
2016-07-17       None          None       None
2016-07-18       None          None       None

Oddly, I have to do a ffill "twice" (below, in the same line):

>> df.reindex(drange, method='ffill').ffill()
           0F001tTt1Q    0F00dqgq6a 0F00ewG3TT
2016-07-11   complete  no_creatives   complete
2016-07-12   complete  no_creatives   complete
2016-07-13   complete  no_creatives   complete
2016-07-14   complete  no_creatives   complete
2016-07-15   complete  no_creatives   complete
2016-07-16   complete  no_creatives   complete
2016-07-17   complete  no_creatives   complete
2016-07-18   complete  no_creatives   complete

Is that expected? Am I missing something?

Comment From: jreback

that's as expected, pls read the difference between reindex WHILE filling and a reindex after.

Comment From: shoyer

To be clear, ffill in reindex only fills in labels that don't have an exact match, e.g., if any of drange don't exactly match the index to df in your example. The DataFrame values themselves are never inspected in reindex.

Comment From: amelio-vazquez-reina

@Thanks @shoyer and @jreback

I'm still missing something. Let's say we have:

> df['0FyAlcmk6U']
1911-01-01 00:00:00.000000    3450.60
2016-07-15 13:02:47.662124    3700.60
2016-07-15 16:10:26.526653    3600.60
2016-07-16 13:17:31.411824    3660.60
2016-07-17 16:53:00.053440    3671.86

The reindex operation seems to be losing data (i.e. a second fffill here would thus be hopeless).

> df.reindex(drange, method='ffill')['0FyAlcmk6U']
2016-07-11    3450.60
2016-07-12        NaN
2016-07-13        NaN
2016-07-14        NaN
2016-07-15        NaN
2016-07-16        NaN
2016-07-17        NaN
2016-07-18    3671.86

with:

> drange
DatetimeIndex(['2016-07-11', '2016-07-12', '2016-07-13', '2016-07-14',
               '2016-07-15', '2016-07-16', '2016-07-17', '2016-07-18'],
              dtype='datetime64[ns]', freq='D')

What's the alternative here?

The goal is to sample the original dataframe with a pre-specified set of dates, doing a ffill as needed.

Comment From: jreback

@amelio-vazquez-reina .reindex picks out dates given a range. I suspect you either want .asfreq or more probably, .resample

Comment From: amelio-vazquez-reina

Thanks @jreback That's what I thought, but .asfreq and .resample do not take a sequence of dates (or timestamps) right? They only take sampling frequencies, which is why I thought I would need to use .reindex (or a combination of both types of methods).

Comment From: amelio-vazquez-reina

In the opening example, the first date is 1911. Even if I can deduce that I need a daily sampling frequency (or DateOffset object), sampling with one point per day would be an overkill here, given that I am only interested in the last few days (e.g. 7 days).

Comment From: jreback

honestly I have no idea what u actually are trying to do

give the docs another read / the examples are very good

Comment From: amelio-vazquez-reina

Thanks @jreback I will certainly do that, but here's one more try (if you don't mind):

Super simple example with a single column:

1911-01-01 00:00:00.000000       something
2016-07-01 04:14:55.281017       None
2016-07-11 04:14:55.836767       None
2016-07-08 04:14:56.379465       None
2016-07-11 04:14:57.078314       None
2016-07-10 04:15:48.984814       something_else
2016-07-11 04:27:35.625008       None
2016-07-11 04:28:17.095956       None
2016-07-13 04:29:35.750822       None
2016-07-15 04:30:31.932919       None

I want to sample the above with forward fill so that I the right data on a specific set of days. That is, the above would just result in:

2016-07-01 something
2016-07-08 something
2016-07-13 something_else
2016-07-14 something_else
2016-07-15 something_else

Note that the first date starts in 1911 so doing a .asfreq here to get daily data first (to then subsample) would be an overkill (note that I only want 5 data points).

Things could be worse if we are talking about forward fills at the second level I only need the last few seconds!

Does that make sense now?

Comment From: amelio-vazquez-reina

I renamed this to reflect the fact that the problem revolves around "sparse sampling" . Hopefully that clarifies things a bit :).

Comment From: shoyer

The simplest solution to your problem is probably to either use .dropna() or .fillna() to drop or fill NA values first, and then use .reindex(..., method='ffill') afterwards to find the nearest values to your query. Pandas doesn't have a method for reindexing while skipping missing entries in DataFrame.values.