Pandas API: Should factorize(categorical) return a Categorical for uniques?

Returning a categorical feels more natural to me

In [11]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[11]: (array([0, 0, 1]), array([0, 1]))

That's kind of what we do for a DatetimeIndex with TZ:

In [10]: pd.factorize(pd.Series(pd.DatetimeIndex(['2017', '2017'], tz='US/Eastern')))
Out[10]:
(array([0, 0]),
 DatetimeIndex(['2017-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None))

Comment From: jorisvandenbossche

Shouldn't it rather give the same as:

In [17]: pd.factorize(['a', 'a', 'c'])
Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object))

(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes)

When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical)

Comment From: TomAugspurger

Yeah returning the factorized codes is certainly a bug.

I ask because we need to think about how to handle other 3rd party extension array types, and we’re currently inconsistent internally.

From: Joris Van den Bossche notifications@github.com Sent: Friday, February 16, 2018 1:57:30 AM To: pandas-dev/pandas Cc: Tom Augspurger; Author Subject: Re: [pandas-dev/pandas] API: Should factorize(categorical) return a Categorical for uniques? (#19721)

Shouldn't it rather give the same as:

In [17]: pd.factorize(['a', 'a', 'c']) Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object))

(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes)

When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/19721#issuecomment-366167022, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIrPbUAM5lHs1b0b0kGieUhZegY8Eks5tVTTpgaJpZM4SHhOm.

Comment From: TomAugspurger

Some other weirdness

In [1]: import pandas as pd

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'b']))
Out[2]: (array([0, 0, 1]), array([0, 1]))

In [3]: pd.factorize(pd.Index(pd.Categorical(['a', 'a', 'b'])))
Out[3]:
(array([0, 0, 1]),
 CategoricalIndex([nan, nan], categories=['a', 'b'], ordered=False, dtype='category'))

In [4]: pd.factorize(pd.Series(pd.Categorical(['a', 'a', 'b'])))
Out[4]: (array([0, 0, 1]), Int64Index([0, 1], dtype='int64'))

Out[3] should be CategoricalIndex(['a', 'b']) or Categorical(['a', 'b']), or ndarray(['a', 'b']),

Out[4] should be similar.

Comment From: jorisvandenbossche

In general, I think the type of the uniques should be the same array-like as the input (or in case of Series/Index, the array-like that is backing the Series). So eg for DatetimeIndex with tz or Series of such values, this should in the future give a DatetimeTZArray (and that it now gives a DatetimeIndex is then logical I think). For numerical Series/Index/array, it is a normal numpy array. So in the future for a certain extension type (Index with such values, Series with such values of Array class itself), would return an extension array of its type?

But, I would personally deviate from this rule for Categoricals, and in those cases return the array-like of which the categories are made up, instead of a Categorical itself.

Comment From: jreback

the return values now are all as expected by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

Comment From: jreback

hmm actually these do look a bit odd factorizing should be in the categories not the codes

Comment From: jorisvandenbossche

the return values now are all as expected by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

Jeff, did you look at the examples Tom showed? It is clearly not returning the " uniques (categories)" in case of categorical input.

And the general question is not only about the current behaviour, but what to do for the new extension arrays

Comment From: TomAugspurger

Cross posted :)

Let's limit it to just fixing

In [2]: pd.factorize(pd.Series(pd.Categorical(['a', 'a'])))
Out[2]: (array([0, 0]), Int64Index([0], dtype='int64'))

to return the unique categories.

Out[2]: (array([0, 0]), array(['a']))

Comment From: jorisvandenbossche

Yes, agreed the above should be fixed.

But, I think we still need to discuss what it should do for ExtensionArrays. Currently the docstring says that ndarray is returned for arrays, and Index for Index/Series, which is also what it does:

In [74]: for vals in [np.array([0, 1, 2]), np.array(['a', 'b', 'c']), pd.DatetimeIndex([1, 2, 3], tz='UTC'), pd.Categorical(['a', 'b', 'a'])]:
    ...:     for cont in [lambda x: x, pd.Series, pd.Index]:
    ...:         obj = cont(vals)
    ...:         codes, uniques = pd.factorize(obj)
    ...:         print(vals.dtype, type(obj).__name__, '  ->  ', type(uniques).__name__)
    ...:         
    ...:         
int64 ndarray   ->   ndarray
int64 Series   ->   Int64Index
int64 Int64Index   ->   Int64Index
<U1 ndarray   ->   ndarray
<U1 Series   ->   Index
<U1 Index   ->   Index
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex  # this deviation is normal since I used index as input (since there is no array-like yet)
datetime64[ns, UTC] Series   ->   DatetimeIndex
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex
category Categorical   ->   ndarray
category Series   ->   Int64Index
category CategoricalIndex   ->   CategoricalIndex

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

Comment From: jorisvandenbossche

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

Although, of course, in the idea that those unique values will typically be not that many in number (or at least much shorter than the input array), it is maybe not a problem to have a materialized ExtensionArray stored in Index. (on the other hand, I think array-likes would be a more logical return value here, and I think we only return Index because of the absence of arrays likes for all types up to now)

Comment From: TomAugspurger

As for now I think we will not yet support an extension array to be stored in an Index?

Correct, though there was interest in that on Twitter.

Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

I see use-cases for either ndarray[object] or ExtensionArray, since they support different operations. Does this warrant a new keyword to factorize?

Comment From: TomAugspurger

I'm revisiting this now. I think that the rule should be that uniques has the same type as the input (except Series -> Index)

factorize(ndarray)[1] -> ndarray
factorize(ExtensionArray)[1] -> ExtensionArray
factorize(Serise/Indxe)[1] -> Index

In particular, we'd change pd.factorize(Categorical) to return a Categorical for uniques.

Consider the case of unobserved categories. I'd like to treat these two arrays differently

In [6]: c1 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b'])

In [7]: c2 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b', 'c'])

If we return an ndarray for uniques, these will be the same:

In [8]: pd.factorize(c1)[1]
Out[8]: array(['a', 'b'], dtype=object)

In [9]: pd.factorize(c2)[1]
Out[9]: array(['a', 'b'], dtype=object)

But if we return a Categorical, we can preserve the unobsesrved categories in uniques.dtype. That also would avoid the somewhat strange situation when moving to a CategoricalIndex

In [10]: pd.factorize(pd.CategoricalIndex(c1))[1]
Out[10]: CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

In [11]: pd.factorize(pd.CategoricalIndex(c2))[1]
Out[11]: CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

Comment From: jorisvandenbossche

I still think that factorize(Series[ExtensionArray])[1] -> ExtensionArray might make more sense than returning an Index.

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

Agreed that the dtype should be the same. But here, I also find it strange why factorize(categorical) would return a Categorical, while factorize(Series[categorical]) would return CategoricalIndex. I mean, I don't see any justification or logic for this difference? (apart from historical reasons, which might be enough to keep it like that and do the same for ExtensionArrays ..)

Comment From: TomAugspurger

apart from historical reasons

100% agreed. If I were starting from scratch I would say factorize(series)[1] -> array (ndarray or ExtensionArray). I'm just not sure that changing it is worthwhile. And why stop there? Shouldn't factorize(index) return an array? Index and Series are both just containers, so why box one and not the other?

I'll think about if there's a deprecation path, and if we even want to go down this path.

Comment From: mroeschke

Looks to now indeed return a Categorical. I guess this behavior seems right and could use a test if it doesn't exist

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[2]:
(array([0, 0, 1]),
 ['a', 'c']
 Categories (2, object): ['a', 'c'])