Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

I have a dataset with a few categories that are coded as being the presence or absence of an intervention. I am fitting a simple linear regression using OLS. Multiple interventions may be present at the same time so I am adding an effect for each effect. However the dummy variable is not being encoded the way I want it to be and it make the effects hard to interpret.

        data = {
    "cat1": [0,0,0,1,1,1,1],
    "cat2": [0,0,0,0,0,1,1],
    "cat3": [1,1,0,0,1,1,1],
    "cat4": [0,1,0,1,1,1,1],
    }

    #load data into a DataFrame object:
    dftest = pd.DataFrame(data)

    # variable to store the label mapping into
    label_map = {}

    for fact in dftest.columns :
        dftest[fact], label_map[fact] = pd.factorize(dftest[fact])
        print(label_map[fact] )

Produces the following dummy coding output...

    Int64Index([0, 1], dtype='int64')
    Int64Index([0, 1], dtype='int64')
    Int64Index([1, 0], dtype='int64')
    Int64Index([0, 1], dtype='int64')

Issue Description

How do I ensure that 0 in the original mapping in is always the dummy for all features? Can I specify which level in a factor should be the dummy?

Expected Behavior

for fact in dftest.columns :
    dftest[fact], label_map[fact] = pd.factorize(dftest[fact])
    print(label_map[fact] )

The following output should be reproduced when prints the list of key mappings if this were to be fixed

Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')

Installed Versions

INSTALLED VERSIONS ------------------ commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.7.12.final.0 python-bits : 64 OS : Linux OS-release : 4.19.0-22-cloud-amd64 Version : #1 SMP Debian 4.19.260-1 (2022-09-29) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.5 numpy : 1.21.6 pytz : 2022.2.1 dateutil : 2.8.2 pip : 22.2.2 setuptools : 59.8.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 7.33.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None fsspec : 2022.8.2 fastparquet : None gcsfs : 2022.8.2 matplotlib : 3.5.3 numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : 0.17.9 pyarrow : 9.0.0 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.41 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.55.2

Comment From: phofl

The order is preserved, did you try sort=True in factorize?

Comment From: phofl

Closing for now, please ping to reopen when you can address comments