Pandas 'test_factorize_empty' test failure with custom ExtensionDtype

One of the provided (BaseMethods) Extension tests is test_factorize_empty. I don't have a way of passing this test given how factorize works for ExtensionArray.

    def test_factorize_empty(self, data):
        codes, uniques = pd.factorize(data[:0])
        expected_codes = np.array([], dtype=np.intp)
        expected_uniques = type(data)._from_sequence([], dtype=data[:0].dtype)

        tm.assert_numpy_array_equal(codes, expected_codes)
        self.assert_extension_array_equal(uniques, expected_uniques)

Code Example

The relevant portion of my GenotypeArray class __init__ method, when an empty array of values is passed.

    def __init__(self,
                 values: Union[List[Genotype], 'GenotypeArray', np.ndarray],
                 dtype: Optional[GenotypeDtype] = None,
                 copy: bool = False):
        """Initialize assuming values is a GenotypeArray or a numpy array with the correct underlying shape"""
        # If the dtype is passed, ensure it is the correct type
        if GenotypeDtype.is_dtype(dtype):
            self._dtype = dtype
        elif dtype is None:
            self._dtype = None
        else:
            raise ValueError(f"The passed dtype '{dtype}' is not a GenotypeDtype")

        # Load the values
        # ---------------
        if isinstance(values, np.ndarray) and (values.dtype == GenotypeDtype._record_type):
            # Stored data format
            self._data = values

        elif len(values) == 0:
            # Return an empty Genotype Array
            if self._dtype is not None:
                self._data = np.array(values, dtype=GenotypeDtype._record_type)
            else:
                raise ValueError("Cannot create a Genotype Array with neither values nor a dtype")

Problem description

The ValueError is raised during this test. I don't have another solution- the GenotypeArray has an associated Variant instance as part of the associated GenotypeDtype class. It is possible to access this value from an instance (if values isn't empty) or from the dtype, but a GenotypeArray without this is meaningless.

The actual problem occurs in _reconstruct_data defined in core/algorithms.py:

def _reconstruct_data(values, dtype, original):
    """
    reverse of _ensure_data

    Parameters
    ----------
    values : ndarray
    dtype : pandas_dtype
    original : ndarray-like

    Returns
    -------
    Index for extension types, otherwise ndarray casted to dtype
    """

    if is_extension_array_dtype(dtype):
        values = dtype.construct_array_type()._from_sequence(values)
    elif is_bool_dtype(dtype):
        values = values.astype(dtype, copy=False)

        # we only support object dtypes bool Index
        if isinstance(original, ABCIndexClass):
            values = values.astype(object, copy=False)
    elif dtype is not None:
        values = values.astype(dtype, copy=False)

    return values

This would be fixed by passing the dtype to _from_sequence:

    if is_extension_array_dtype(dtype):
        values = dtype.construct_array_type()._from_sequence(values, dtype)

I can't think of any reason this would be a problem.

Expected Output

Passing the test if the factorize method on the ExtensionArray class works correctly.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.4.0-18362-Microsoft machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.0rc0 numpy : 1.17.2 pytz : 2019.2 dateutil : 2.8.0 pip : 19.3.1 setuptools : 41.2.0 Cython : None pytest : 5.3.2 hypothesis : None sphinx : 2.2.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.11.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : 5.3.2 s3fs : None scipy : 1.3.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Comment From: mroeschke

Based on the feedback in https://github.com/pandas-dev/pandas/pull/31253 looks like a use case is needed to show where this change is needed so closing

Pandas 'test_factorize_empty' test failure with custom ExtensionDtype

Code Example

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`