Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import pyarrow as pa
# First case: just one column. It gives the error below
pd.DataFrame( { 'A': [ 'a1', pd.NA ] }, dtype = pd.ArrowDtype( pa.dictionary( pa.int32(), pa.utf8() ) ) ).value_counts( dropna = False )
# Second case: more than one column. It gives the wrong result below
pd.concat( [
pd.DataFrame( { 'A': [ 'a1', 'a2' ], 'B': [ 'b1', pd.NA ] }, dtype = pd.ArrowDtype( pa.string() ) ),
pd.DataFrame( { 'C': [ 'c1', 'c2' ], 'D': [ 'd1', pd.NA ] }, dtype = pd.ArrowDtype( pa.dictionary( pa.int32(), pa.utf8() ) ) )
], axis = 1 ).value_counts( dropna = False )
Issue Description
First Case
It gives the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 pd.DataFrame( { 'A': [ 'a1', pd.NA ] }, dtype = pd.ArrowDtype( pa.dictionary( pa.int32(), pa.utf8() ) ) ).value_counts( dropna = False )
File C:\Python\Lib\site-packages\pandas\core\frame.py:7519, in DataFrame.value_counts(self, subset, normalize, sort, ascending, dropna)
7517 # Force MultiIndex for a list_like subset with a single column
7518 if is_list_like(subset) and len(subset) == 1: # type: ignore[arg-type]
-> 7519 counts.index = MultiIndex.from_arrays(
7520 [counts.index], names=[counts.index.name]
7521 )
7523 return counts
File C:\Python\Lib\site-packages\pandas\core\indexes\multi.py:533, in MultiIndex.from_arrays(cls, arrays, sortorder, names)
530 if len(arrays[i]) != len(arrays[i - 1]):
531 raise ValueError("all arrays must be same length")
--> 533 codes, levels = factorize_from_iterables(arrays)
534 if names is lib.no_default:
535 names = [getattr(arr, "name", None) for arr in arrays]
File C:\Python\Lib\site-packages\pandas\core\arrays\categorical.py:3069, in factorize_from_iterables(iterables)
3065 if len(iterables) == 0:
3066 # For consistency, it should return two empty lists.
3067 return [], []
-> 3069 codes, categories = zip(*(factorize_from_iterable(it) for it in iterables))
3070 return list(codes), list(categories)
File C:\Python\Lib\site-packages\pandas\core\arrays\categorical.py:3069, in <genexpr>(.0)
3065 if len(iterables) == 0:
3066 # For consistency, it should return two empty lists.
3067 return [], []
-> 3069 codes, categories = zip(*(factorize_from_iterable(it) for it in iterables))
3070 return list(codes), list(categories)
File C:\Python\Lib\site-packages\pandas\core\arrays\categorical.py:3042, in factorize_from_iterable(values)
3037 codes = values.codes
3038 else:
3039 # The value of ordered is irrelevant since we don't use cat as such,
3040 # but only the resulting categories, the order of which is independent
3041 # from ordered. Set ordered to False as default. See GH #15457
-> 3042 cat = Categorical(values, ordered=False)
3043 categories = cat.categories
3044 codes = cat.codes
File C:\Python\Lib\site-packages\pandas\core\arrays\categorical.py:451, in Categorical.__init__(self, values, categories, ordered, dtype, fastpath, copy)
447 if dtype.categories is None:
448 if isinstance(values.dtype, ArrowDtype) and issubclass(
449 values.dtype.type, CategoricalDtypeType
450 ):
--> 451 arr = values._pa_array.combine_chunks()
452 categories = arr.dictionary.to_pandas(types_mapper=ArrowDtype)
453 codes = arr.indices.to_numpy()
AttributeError: 'Index' object has no attribute '_pa_array'
Indeed, the same error is returned also if no pd.NA
is present.
Second case
It gives the following result:
A B C D
a1 b1 c1 d1 1
a2 <NA> c2 d1 1
Name: count, dtype: int64
Note that in second line D is d1 and not <NA>
.
A more complete example in this JupyterLab notebook: value_counts() Bug.pdf
Expected Behavior
The expected behavior is analogous to the result obtained with the NumPy backend.
First case
A
a1 1
<NA> 1
Name: count, dtype: int64
Second case
A B C D
a1 b1 c1 d1 1
a2 <NA> c2 <NA> 1
Name: count, dtype: int64
Installed Versions
Comment From: rhshadrach
Thanks for the report, confirmed on main. It appears that changing
https://github.com/pandas-dev/pandas/blob/9501650e22767f8502a1e3edecfaf17c5769f150/pandas/core/arrays/categorical.py#L450
to be
if isinstance(values, Index):
arr = values._data._pa_array.combine_chunks()
else:
arr = values._pa_array.combine_chunks()
resolves. Further investigations and PRs to fix are welcome!
Even with the above fix, we still do not see NA values because of a bug in groupby. I've opened https://github.com/pandas-dev/pandas/issues/60567 for this.
Comment From: hardeybisey
take
Comment From: NOBODIDI
take
Comment From: NOBODIDI
Thanks for the report, confirmed on main. It appears that changing
https://github.com/pandas-dev/pandas/blob/9501650e22767f8502a1e3edecfaf17c5769f150/pandas/core/arrays/categorical.py#L450
to be
python if isinstance(values, Index): arr = values._data._pa_array.combine_chunks() else: arr = values._pa_array.combine_chunks()
resolves. Further investigations and PRs to fix are welcome!
Even with the above fix, we still do not see NA values because of a bug in groupby. I've opened #60567 for this.
I used this method but instead I did:
if values.__class__.__name__ == 'Index':
so that Index does not need to be imported, this version fixed the issue. I am open to feedback.
Comment From: hardeybisey
Hi @NOBODIDI , I noticed you’ve already submitted a PR for this issue. I started working on it and was planning to submit mine by morning my time. In the future, it would be great if we could sync up to avoid overlaps by confirming whether the issue is still being actively worked on, especially when it’s been recently assigned.
Comment From: yanweiSu
take
Comment From: ghost
Gio phai làm sao. Mai ra vcbank đang kí lại tk va sinh trac
Comment From: ghost
Vào Th 4, 25 thg 12, 2024 lúc 16:53 Thịnh Lương @.***> đã viết:
_xác nhập/thông tin. @TandT&All. -you like. #phone:1:9:4+:2:8:2/0ne/. And. Family<one:(-)
Comment From: chilin0525
The bug can still be reproduced on the current main branch. I will continue the work from https://github.com/pandas-dev/pandas/pull/60569/files to attempt a fix.
Comment From: chilin0525
take