Pandas crosstab is not respecting filtered scope when dtype=category

First of all, I have to admit that I have some trouble finding a catchy title for this issue. Better suggestions are welcome...

There is some issue with crosstab, when applied on a DataFrame which has been derived from a DataFrame consisting of categorized columns.

Let's consider the following example df:

import pandas as pd
df = pd.DataFrame({'col0':list('abbabcabcab'), 'col1':[1,1,1,1,2,3,2,2,3,3,1],
                  'crit':[1,1,1,1,1,0,0,0,0,0,0]})

We apply some filter according to some criterion which happens to eliminate all occurrences of 'c' in 'col0' (and '3' in 'col1'). Then we build a crosstab:

df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1)

The result is as expected:

col1    1   2
col0        
a   2   0
b   2   1

Now we try the same on categorized columns:

for col in df.columns:
    df[col] = df[col].astype('category')
df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1

In this case, value 'c' and '3' are again listed, although they shouldn't be:

col1    1   2   3
col0            
a   2   0   0
b   2   1   0
c   0   0   0

My guess is that the reason for this behavior can be found in the lookup-table of the category, which does not account for values that are not represented any longer: df_filtered.col0.cat.categories Output: Index(['a', 'b', 'c'], dtype='object'

So, either this lookup-table has to be fixed upon filtering, or pd.crosstab has to respect this condition.

Comment From: gfyoung

@rumbin : I don't think there's anything wrong with the table output IMO. It is correctly reflecting that there are no crossed-tabbed factors with either 'c' or '3'. In addition, filtering out values does not change the dtype of the individual columns. That is why you still see the 'c' and '3'.

Comment From: gfyoung

I should also add that in your first example, the dtype is just int or object so c is not an inherent part of the dtype in any of the columns, whereas that is not the case for your second example.

Comment From: jreback

this issue is related to this: https://github.com/pydata/pandas/issues/10772

dropna=True is the default and I think should be respected here. Though one could make a case that since its a categorical we should include all of the original categories (as that's kind of the point).

Comment From: gfyoung

@jreback : How does the dropna arg relate to this issue though?

Comment From: jreback

if we accept the premise that we don't want to always represent categorical results (IOW we treat them like regular results), then a filtered row should be dropped.

Comment From: gfyoung

@jreback : And by dropped, you mean not represented in the crosstab, even though the data is categorical?

Comment From: rumbin

In my eyes, from the user point of view, the dtype representation should not influence the resulting crosstab. After filtering the original table, some values are gone. So they should not be represented in any crosstab that we perform on a filtered table, regardless of what the categories are.

In my particular case, I am using categorial dtypes just for the sake of reduced RAM consumption and improved speed. Without categories I would easily end up swapping to disk every now and then. So there are even columns of measurement values represented by categories. After creating some crosstabs on filtered tables, it took me quite long to actually find out, why there are columns/rows wth all zeros. Looking up the respective column labels in the filtered tables leads to no results, so why are they listed in those crosstabs? This is definitely not what I as a user expected.

@gfyoung: I understand, that I am sort of abusing the concept of categorical data here, if we consider categories being something like, e.g., selling features, which typically consist of a handful, rarely changing members. In my case, categories can easily consist of several thousand members (e.g. serial numbers or part numbers), where the filtering operation is crucial for concentrating just on a small subset withing these categories. But when the crosstab does not respect the previously applied filtering operation, this is sort of useless.

Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the sublte differences.

Comment From: jreback

@rumbin ok with that approach. This is essentially a 2-d value_counts; the difference is that .value_counts has an index, here it does not.

want to do a PR?

Comment From: rumbin

I'm afraid, the damage of any coding help of mine would be greater than any benefit, as I am fairly new to python (1-2 month) and my further programming "background" is not more than bash...

Comment From: jreback

not a bad way to learn! heres the contribution guide to get started

Comment From: jankatins

In my case, categories can easily consist of several thousand members (e.g. serial numbers or part numbers)

IMO this is again a argument to get pd.String() and not use Category...

Comment From: jankatins

Conclusion: In my eyes, the resulting crosstab should by default not depend on the dtype, as an occasional user will certainly not be aware of the subtle differences.

IMO "Categories" are more similar to "ints" than "strings": the whole range of possible values has a meaning, not only individual values (at the least when there is some order between these values). In a histogram you wouldn't leave out some ranges just because they are empty and similarly you shouldn't leave out empty categories.

Comment From: jorisvandenbossche

I agree with @janschulz

BTW, for this, there is the .cat.remove_unused_categories() method to explicitly get only the used categories.

Comment From: gfyoung

@jorisvandenbossche : Didn't know that! Good to know for future reference!

Comment From: jreback

closing, but if @rumbin of @gfyoung (or anyone else!) wants to submit a documentation PR for an example (in the crosstab section, maybe cross-linking to categorical), would be great.