Research

  • [X] I have searched the [pandas] tag on StackOverflow for similar questions.

  • [X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/73906430/python-pandas-1-3-5-to-1-4-0-breaking-changes-got-array-instead-of-string

Question about pandas

Hello !

I'm encountering an error with the update of pandas version from 1.3.5 to the major version 1.4.0. It is still happening on all subversion 1.4.2 and 1.4.4.

Here is my code:

    print(df.T.to_dict().values())
    df = df.reset_index().groupby(['startTime']).agg({
        'startTime': np.unique,
        'endTimes': lambda field: list(field),
        'durationSplit': lambda field: list(field),
        'split': lambda field: list(field),
    })
    print(df.T.to_dict().values())

With version 1.35. it print:

dict_values([{'startTime': '1970-01-01T10:30:00', 'endTimes': '1970-01-01T13:00:00', 'durationSplit': None, 'split': None}])
dict_values([{'startTime': '1970-01-01T10:30:00', 'endTimes': ['1970-01-01T13:00:00'], 'durationSplit': [None], 'split': [None]}])

With versions 1.4.0, 1.4.2, 1.4.4 (1.5.0 too) it print:

dict_values([{'startTime': '1970-01-01T10:30:00', 'endTimes': '1970-01-01T13:00:00', 'durationSplit': None, 'split': None}])
dict_values([{'startTime': array(['1970-01-01T10:30:00'], dtype=object), 'endTimes': ['1970-01-01T13:00:00'], 'durationSplit': [None], 'split': [None]}])

I cannot find any breaking change about that with pandas or found someone else with the same problem.

I only get a new warning here which say:

FutureWarning: Dropping invalid columns in SeriesGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the function.

Do you have more information or can explain me what is going on ? or how can I do something similar differently :')

Thank you by advance for your help !

Comment From: MarcoGorelli

can you make your example reproducible please, and also try on version 1.5.0?

Comment From: voldardard

Thank you for your quick answer

You can reproduce it with this little script:

#!/bin/python
import pandas as pd
import numpy as np
from collections import OrderedDict

print("Pandas version: {}".format(pd.__version__))
print("Numpy version: {}".format(np.__version__))

times=[OrderedDict([('startTime', '1970-01-01T10:30:00'), ('endTimes', '1970-01-01T13:00:00'), ('durationSplit', None), ('split', None)]), OrderedDict([('startTime', '1970-01-01T13:30:00'), ('endTimes', '1970-01-01T16:00:00'), ('durationSplit', None), ('split', None)])]
print(times)

data_frame = pd.DataFrame(data=times)
print(data_frame.T.to_dict().values())
data_frame = data_frame.reset_index().groupby(['startTime']).agg({
    'startTime': np.unique,
    'endTimes': lambda field: list(field),
    'durationSplit': lambda field: list(field),
    'split': lambda field: list(field),
})
print(data_frame.T.to_dict().values())

Here with 1.3.5 I gett:

Pandas version: 1.3.5
Numpy version: 1.22.4
[OrderedDict([('startTime', '1970-01-01T10:30:00'), ('endTimes', '1970-01-01T13:00:00'), ('durationSplit', None), ('split', None)]), OrderedDict([('startTime', '1970-01-01T13:30:00'), ('endTimes', '1970-01-01T16:00:00'), ('durationSplit', None), ('split', None)])]
dict_values([{'startTime': '1970-01-01T10:30:00', 'endTimes': '1970-01-01T13:00:00', 'durationSplit': None, 'split': None}, {'startTime': '1970-01-01T13:30:00', 'endTimes': '1970-01-01T16:00:00', 'durationSplit': None, 'split': None}])
dict_values([{'startTime': '1970-01-01T10:30:00', 'endTimes': ['1970-01-01T13:00:00'], 'durationSplit': [None], 'split': [None]}, {'startTime': '1970-01-01T13:30:00', 'endTimes': ['1970-01-01T16:00:00'], 'durationSplit': [None], 'split': [None]}])

And with 1.4.2 I get:

Pandas version: 1.4.2
Numpy version: 1.22.4
[OrderedDict([('startTime', '1970-01-01T10:30:00'), ('endTimes', '1970-01-01T13:00:00'), ('durationSplit', None), ('split', None)]), OrderedDict([('startTime', '1970-01-01T13:30:00'), ('endTimes', '1970-01-01T16:00:00'), ('durationSplit', None), ('split', None)])]
dict_values([{'startTime': '1970-01-01T10:30:00', 'endTimes': '1970-01-01T13:00:00', 'durationSplit': None, 'split': None}, {'startTime': '1970-01-01T13:30:00', 'endTimes': '1970-01-01T16:00:00', 'durationSplit': None, 'split': None}])
dict_values([{'startTime': array(['1970-01-01T10:30:00'], dtype=object), 'endTimes': ['1970-01-01T13:00:00'], 'durationSplit': [None], 'split': [None]}, {'startTime': array(['1970-01-01T13:30:00'], dtype=object), 'endTimes': ['1970-01-01T16:00:00'], 'durationSplit': [None], 'split': [None]}])

I get the same result than 1.4.2 with 1.5.0

Comment From: MarcoGorelli

Thanks @voldardard

From git bisect, looks like this was caused by https://github.com/pandas-dev/pandas/pull/44122

cc @jbrockmendel

https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=107651890

Comment From: jbrockmendel

When you use lambda field: list(field), shouldn't it be non-surprising to get a list back?

We got rid of the libreduction paths because they caused tons of edge cases and inconsistencies that were a nightmare to debug.

Comment From: voldardard

Yes it is normal to get list for endTimes, durationSplit and split. But the problem is for the startTime where we had it in string from version 0.x of pandas until version 1.4.0 . And now it is a list. But there is only one value needed here and we don't want a list.

It will need extra devlopement to change each of our frontends to be able to get it from a list even if there is one value only.