One of the provided (BaseMethods) Extension tests is test_factorize_empty
. I don't have a way of passing this test given how factorize
works for ExtensionArray
.
def test_factorize_empty(self, data):
codes, uniques = pd.factorize(data[:0])
expected_codes = np.array([], dtype=np.intp)
expected_uniques = type(data)._from_sequence([], dtype=data[:0].dtype)
tm.assert_numpy_array_equal(codes, expected_codes)
self.assert_extension_array_equal(uniques, expected_uniques)
Code Example
The relevant portion of my GenotypeArray
class __init__
method, when an empty array of values is passed.
def __init__(self,
values: Union[List[Genotype], 'GenotypeArray', np.ndarray],
dtype: Optional[GenotypeDtype] = None,
copy: bool = False):
"""Initialize assuming values is a GenotypeArray or a numpy array with the correct underlying shape"""
# If the dtype is passed, ensure it is the correct type
if GenotypeDtype.is_dtype(dtype):
self._dtype = dtype
elif dtype is None:
self._dtype = None
else:
raise ValueError(f"The passed dtype '{dtype}' is not a GenotypeDtype")
# Load the values
# ---------------
if isinstance(values, np.ndarray) and (values.dtype == GenotypeDtype._record_type):
# Stored data format
self._data = values
elif len(values) == 0:
# Return an empty Genotype Array
if self._dtype is not None:
self._data = np.array(values, dtype=GenotypeDtype._record_type)
else:
raise ValueError("Cannot create a Genotype Array with neither values nor a dtype")
Problem description
The ValueError is raised during this test. I don't have another solution- the GenotypeArray
has an associated Variant
instance as part of the associated GenotypeDtype
class. It is possible to access this value from an instance (if values
isn't empty) or from the dtype
, but a GenotypeArray without this is meaningless.
The actual problem occurs in _reconstruct_data
defined in core/algorithms.py
:
def _reconstruct_data(values, dtype, original):
"""
reverse of _ensure_data
Parameters
----------
values : ndarray
dtype : pandas_dtype
original : ndarray-like
Returns
-------
Index for extension types, otherwise ndarray casted to dtype
"""
if is_extension_array_dtype(dtype):
values = dtype.construct_array_type()._from_sequence(values)
elif is_bool_dtype(dtype):
values = values.astype(dtype, copy=False)
# we only support object dtypes bool Index
if isinstance(original, ABCIndexClass):
values = values.astype(object, copy=False)
elif dtype is not None:
values = values.astype(dtype, copy=False)
return values
This would be fixed by passing the dtype to _from_sequence
:
if is_extension_array_dtype(dtype):
values = dtype.construct_array_type()._from_sequence(values, dtype)
I can't think of any reason this would be a problem.
Expected Output
Passing the test if the factorize
method on the ExtensionArray class works correctly.
Output of pd.show_versions()
Comment From: mroeschke
Based on the feedback in https://github.com/pandas-dev/pandas/pull/31253 looks like a use case is needed to show where this change is needed so closing