Pandas resample precision issue

Code Sample, a copy-pastable example if possible

"""Bug pandas."""
from pandas import read_csv, to_datetime

try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

dtf = read_csv(StringIO('Time, Value\n'
                        '2017-11-08 15:14:04.421,1\n'
                        '2017-11-08 15:14:05.528,2\n'
                        '2017-11-08 15:14:06.714,3\n'
                        '2017-11-08 15:14:07.113,4\n'
                        '2017-11-08 15:14:08.282,5\n'
                        '2017-11-08 15:14:08.681,6\n'
                        '2017-11-08 15:14:10.650,7\n'
                        '2017-11-08 15:14:11.826,8\n'
                        '2017-11-08 15:14:12.225,9\n'
                        '2017-11-08 15:14:13.416,10'))

dtf.Time = to_datetime(dtf.Time)
dtf.set_index('Time', inplace=True)
print(dtf)
dtf2 = dtf.resample('1s').bfill(limit=1)
print(dtf2)

Problem description

Current output:

                      Value
Time                       
2017-11-08 15:14:04     1.0
2017-11-08 15:14:05     2.0
2017-11-08 15:14:06     3.0
2017-11-08 15:14:07     4.0
2017-11-08 15:14:08     5.0
2017-11-08 15:14:09     NaN
2017-11-08 15:14:10     7.0
2017-11-08 15:14:11     8.0
2017-11-08 15:14:12     9.0
2017-11-08 15:14:13    10.0

Main issue: NaN is not expected.

Secondary minor issue: bfill with limit is changing type of the column, bfill without limit is not changing the dtype

Expected Output

                      Value
Time                       
2017-11-08 15:14:04     1
2017-11-08 15:14:05     2
2017-11-08 15:14:06     3
2017-11-08 15:14:07     4
2017-11-08 15:14:08     5
2017-11-08 15:14:09     6
2017-11-08 15:14:10     7
2017-11-08 15:14:11     8
2017-11-08 15:14:12     9
2017-11-08 15:14:13    10

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 34.3.0 Cython: None numpy: 1.14.0 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None None

Comment From: jreback

how is this not as expected? you have duplicate values on the 8s value, so you must pick one in the absence of anything else

this is all consistent

In [20]: dtf.resample('1s').last()
Out[20]: 
                      Value
Time                       
2017-11-08 15:14:04     1.0
2017-11-08 15:14:05     2.0
2017-11-08 15:14:06     3.0
2017-11-08 15:14:07     4.0
2017-11-08 15:14:08     6.0
2017-11-08 15:14:09     NaN
2017-11-08 15:14:10     7.0
2017-11-08 15:14:11     8.0
2017-11-08 15:14:12     9.0
2017-11-08 15:14:13    10.0

In [21]: dtf.resample('1s').bfill()
Out[21]: 
                      Value
Time                       
2017-11-08 15:14:04       1
2017-11-08 15:14:05       2
2017-11-08 15:14:06       3
2017-11-08 15:14:07       4
2017-11-08 15:14:08       5
2017-11-08 15:14:09       7
2017-11-08 15:14:10       7
2017-11-08 15:14:11       8
2017-11-08 15:14:12       9
2017-11-08 15:14:13      10

In [22]: dtf.resample('1s').bfill(limit=1)
Out[22]: 
                      Value
Time                       
2017-11-08 15:14:04     1.0
2017-11-08 15:14:05     2.0
2017-11-08 15:14:06     3.0
2017-11-08 15:14:07     4.0
2017-11-08 15:14:08     5.0
2017-11-08 15:14:09     NaN
2017-11-08 15:14:10     7.0
2017-11-08 15:14:11     8.0
2017-11-08 15:14:12     9.0
2017-11-08 15:14:13    10.0

Comment From: jreback

To handle times that are duplicates you need some kind of aggregating operation, last/mean etc.

In [29]: dtf.resample('1s').last().bfill(limit=1)
Out[29]: 
                      Value
Time                       
2017-11-08 15:14:04     1.0
2017-11-08 15:14:05     2.0
2017-11-08 15:14:06     3.0
2017-11-08 15:14:07     4.0
2017-11-08 15:14:08     6.0
2017-11-08 15:14:09     7.0
2017-11-08 15:14:10     7.0
2017-11-08 15:14:11     8.0
2017-11-08 15:14:12     9.0
2017-11-08 15:14:13    10.0

Comment From: VelizarVESSELINOV

8.2 => 8 8.6 => 9

@jreback Do you agree about rounding precision expected behavior?

Comment From: VelizarVESSELINOV

how is this not as expected?

@jreback Do you expect when you change the default value of limit that the dtype of the output will change?

Comment From: jreback

there is no rounding. These are just IN the same interval.

Comment From: jreback

If you want to round you can do that, but that is not what resample is.

Comment From: VelizarVESSELINOV

@jreback at least for users like me we expect precision from resampling 8 + epsilon should be resampled to 8 and not 9. It is fine if you have another point of view, I respect your opinion. I like the tolerance, even if is providing unprecise things, just people like me we need to avoid to use nonprecise methods and packages.

Comment From: TomAugspurger

@VelizarVESSELINOV can you suggest any changes to http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling to make it clearer? I'm wondering what made you think it was rounding instead of binning.

Comment From: jreback

@VelizarVESSELINOV you are missing the point. 8 is NOT resampled to 9, they are simply put in the same bin. There IS no rounding happening here. You have duplicated values per bin. Then backfill uses the last value. This is quite precise actually.

Comment From: VelizarVESSELINOV

Pandas BDFLs should be more open to other domains that macroeconomics where the resampling is high level yearly/quarter very basic ‘bining’.

There is another world of IoT, high-frequency times series data, data fusion where precision of resampling is mission critical. When you analyze SpaceX data from different not aligned time sensors and epsilon precision of one sensor can misalign your analysis is not a great thing.

Now I understand better why R and Matlab continue taking importance because of their scientific approach of what it is statistically correct and precise. Personally, I hope pandas and other Python packages will manage to evolve to cover the sensors data fusion world. Where I think Python and lead data management tool of today pandas is lagging behind the competition and showing no interest to cover the basics for sensor data fusion options.

Just take a look the resample documentation of Matlab https://www.mathworks.com/help/signal/ref/resample.htm and tell us what level of sophistication the Pandas function is covering.

Ok with the binning concept after that is important to know what is start end of the bin. In sensor world for the sample :07 seconds the logical bin interval will be :06.5 to :07.5 and not :07 to :08, of course there are exceptions.

After that how you take if there is more than one sample is a very important option, the most logical is nearest. But also you can take average (or median, mode) with different weighting factors (equal, triangular, etc.)

Comment From: TomAugspurger

Pandas BDFLs should be more open to other domains that macroeconomics where the resampling is high level yearly/quarter very basic ‘bining’.

I'm very confused. This doesn't have anything to do with "openness to other domains". It has to do with our definition of resample. It's a groupby-like operation.

df.resample('1s') is like df.groupby(df.index.second). You wouldn't expect that groupby to round 8.7 to 9 would you?

Comment From: VelizarVESSELINOV

If you use the nearest option yes 8.7 should be resampled to 9. If you also defined the boundary intervals to be centered 9 should cover samples from 8.5 to 9.5. Please try to read more what resample option are doing in Matlab or R world and the output plots for QC for the resampling methods: Matlab resample

I'm seeing that from your example you are coming from the SQL group-by world.

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`