Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = DataFrame(
      {
          "Timestamp": [pd.Timestamp(i) for i in range(3)],
          "Food": ["apple", "apple", "banana"],
      }
  )

>>> dfg = df.groupby(Grouper(freq="1D", key="Timestamp"))
>>> dfg.value_counts()

../../core/groupby/generic.py:1800: in value_counts
    result_series = cast(Series, gb.size())
../../core/groupby/groupby.py:2323: in size
    result = self.grouper.size()
../../core/groupby/ops.py:881: in size
    ids, _, ngroups = self.group_info
pandas/_libs/properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
../../core/groupby/ops.py:915: in group_info
    comp_ids, obs_group_ids = self._get_compressed_codes()
../../core/groupby/ops.py:941: in _get_compressed_codes
    group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

labels = [array([0]), array([0, 1, 2]), array([0, 0, 1])], shape = (1, 3, 2)
sort = True, xnull = True

    def get_group_index(
        labels, shape: Shape, sort: bool, xnull: bool
    ) -> npt.NDArray[np.int64]:
        """
        For the particular label_list, gets the offsets into the hypothetical list
        representing the totally ordered cartesian product of all possible label
        combinations, *as long as* this space fits within int64 bounds;
        otherwise, though group indices identify unique combinations of
        labels, they cannot be deconstructed.
        - If `sort`, rank of returned ids preserve lexical ranks of labels.
          i.e. returned id's can be used to do lexical sort on labels;
        - If `xnull` nulls (-1 labels) are passed through.

        Parameters
        ----------
        labels : sequence of arrays
            Integers identifying levels at each location
        shape : tuple[int, ...]
            Number of unique levels at each location
        sort : bool
            If the ranks of returned ids should match lexical ranks of labels
        xnull : bool
            If true nulls are excluded. i.e. -1 values in the labels are
            passed through.

        Returns
        -------
        An array of type int64 where two elements are equal if their corresponding
        labels are equal at all location.

        Notes
        -----
        The length of `labels` and `shape` must be identical.
        """

        def _int64_cut_off(shape) -> int:
            acc = 1
            for i, mul in enumerate(shape):
                acc *= int(mul)
                if not acc < lib.i8max:
                    return i
            return len(shape)

        def maybe_lift(lab, size) -> tuple[np.ndarray, int]:
            # promote nan values (assigned -1 label in lab array)
            # so that all output values are non-negative
            return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)

        labels = [ensure_int64(x) for x in labels]
        lshape = list(shape)
        if not xnull:
            for i, (lab, size) in enumerate(zip(labels, shape)):
                lab, size = maybe_lift(lab, size)
                labels[i] = lab
                lshape[i] = size

        labels = list(labels)

        # Iteratively process all the labels in chunks sized so less
        # than lib.i8max unique int ids will be required for each chunk
        while True:
            # how many levels can be done without overflow:
            nlev = _int64_cut_off(lshape)

            # compute flat ids for the first `nlev` levels
            stride = np.prod(lshape[1:nlev], dtype="i8")
            out = stride * labels[0].astype("i8", subok=False, copy=False)

            for i in range(1, nlev):
                if lshape[i] == 0:
                    stride = np.int64(0)
                else:
                    stride //= lshape[i]
>               out += labels[i] * stride
E               ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (3,)

../../core/sorting.py:182: ValueError

Issue Description

DataFrameGroupBy.value_counts fails with a Grouper with a freq, while it works for a SeriesGroupBy. There is already a test for the SeriesGroupBy implementation named test_series_groupby_value_counts_with_grouper.

Expected Behavior

In this case, the dataframe has only one column, so it should return a similar result to the SeriesGroupBy implementation:

>>> dfg["Food"].value_counts()
Timestamp   Food  
1970-01-01  apple     2
            banana    1
Name: Food, dtype: int64

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in https://github.com/pandas-dev/pandas/pull/46940.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 997f84bd8fd99952c1ea464b7794c989ccdf402e python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.13.0-44-generic Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.UTF-8 pandas : 1.1.0.dev0+8026.g997f84bd8f.dirty numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 setuptools : 57.5.0 pip : 20.0.2 Cython : 0.29.30 pytest : 7.1.2 hypothesis : 6.46.2 sphinx : 4.5.0 blosc : 1.10.6 feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.3.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.4 brotli : None fastparquet : 0.8.1 fsspec : 2022.3.0 gcsfs : 2022.3.0 matplotlib : 3.5.2 numba : None numexpr : 2.8.1 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 8.0.0 pyreadstat : 1.1.6 pyxlsb : None s3fs : 2022.3.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.36 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None

Comment From: simonjayhawkins

Thanks @LucasG0 for the report and investigation.

DataFrameGroupBy.value_counts was added in pandas-1.4 #44267

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.

would be ideal to backport a fix to 1.4.x, but to do this would need to restrict changes to the bug fix only.

Comment From: mroeschke

The core issue here is when filtering to check which columns should be in the value count

https://github.com/pandas-dev/pandas/blob/e65a30e3ebdb7572a943d097882c241789569669/pandas/core/groupby/generic.py#L1802

And groupers with frequencies (as well as resample I would suspect), always set in_axis=False.

https://github.com/pandas-dev/pandas/blob/e65a30e3ebdb7572a943d097882c241789569669/pandas/core/groupby/ops.py#L1263

There appear no easy way to set in_axis=True and may have further ramifications downstream, so I think a more sensible fix suggested above is somehow combining the Series implementation with the DataFrame implementation which is out of scope for a point release so removing the milestone.

Comment From: rhshadrach

I now get the expected result on main. This was fixed by #50507