Pandas ENH: DataFrame.astype(dtype: dict) should work in the presence of superfluous keys.

Currently, DataFrame.astype(dtype_dict: dict) requires that the dict keys are a subset of the DataFrame's columns. This feels like an unnecessary restriction, in my opinion it would suffice / be more intuitive if it would roughly perform:

for col in df.columns:
   if col in dtype:
      df[col] = df.col.astype(dtype_dict[col])

The fact that this currently raises an error becomes annoying, for example if one needs to repair data types after they became destroyed by a stacking operation - one needs to slice the dtype_dict-dictionary by the column keys every time!

Of course, a strong argument can be made that raising an error is a good idea to prevent users from erroneously believing the type-casting was performed in the case when a key was miss-typed.

Describe the solution you'd like

I propose to consider either one of the following changes:

Option 1: If df.columns is a subset of dtype_dict, do not raise an error if superfluous keys are present. In this case all columns are identified and there is a negligible chance that there is an error due to a typo. Option 2: Extent the functionality of the already present errors='ignore' option to also ignore superfluous keys in the dtype_dict.

Comment From: mroeschke

Thanks for the suggestion, but I would be -1 on this enhancement as I think promoting the more explicitly behavior is better as well as your other point:

Of course, a strong argument can be made that raising an error is a good idea to prevent users from erroneously believing the type-casting was performed in the case when a key was miss-typed

Comment From: attack68

I agree with @mroeschke. You can also also get around it with dict comprehension:

df.astype({c: d for c, d in _dtypes.items() if c in df.columns})

Comment From: mwaskom

A counterpoint would be that pd.read_csv accepts a dict for dtype= and does not error if keys in the dict do not match columns in the frame.

This also feels similar to df.rename which has errors="ignore" as default, allowing one to choose a stricter/safer interpretation if desired.

So there is a consideration of API consistency as well as what is the optimal API for this individual method.

The usecase here would be to have a single dictionary of column names -> dtypes for a project, something I've done in the past that feels nice and explicit/organized. At various points in the process of data manipulation/analysis, you may end up with dataframes that have only a subset of the columns but not naturally have a an explicit representation of the subtset of column names that you want.

This alternate suggestion is not sufficient IMO

You can also also get around it with dict comprehension: df.astype({c: d for c, d in _dtypes.items() if c in df.columns})

This does not work if you're writing a chain of operations, something like

(
    df
    .melt(...)  # Create new columns with values in the column index, that may not be typed correctly
    .astype(...)  #  At this point there is no variable allowing access to the current set of column names
)

Comment From: randolf-scholz

@mroeschke

Thanks for the suggestion, but I would be -1 on this enhancement as I think promoting the more explicitly behavior is better as well as your other point.

And that's why I am not arguing for changing the default behaviour, except in the case when one can be relatively sure that no typos happened (i.e. when all columns are present as keys.).

@attack68 Of course one can get around it this way, but as I said earlier, I find this has several disadvantages, mainly that

It becomes really annoying when you have to do it often - in my case this happens when constructing several tables with stack/unstack/melt operations which have a tendency to destroy the datatypes.
It, in my opinion, hurts readability. There is no real useful extra information in the line

df.astype({c: d for c, d in dtypes_dict.items() if c in df.columns})

compared to

df.astype(dtypes_dict)

To me, the former really just feels like it adds useless noise to the code, whereas the latter is to the point. I would kindly ask you to reconsider, and also carefully think about what I am actually proposing - which is not to change the default behaviour, except in cases where it is safe to do so.

Comment From: mwaskom

@randolf-scholz I think this is a good suggestion and it's something I've stumbled over in the past. I might offer a friendly suggestion which is that the tone of your request, e.g.

This is a completely unnecessary restriction, it would clearly suffice if it would roughly perform ...

could be decreasing enthusiasm for the idea. It reads (to me) as unnecessarily combative, and I think you might find more success by focusing on the value of your proposed enhancement rather than the (perceived) flaws of the existing API.

Comment From: mwaskom

Another consideration is thaterrors="ignore" would be ambiguous here in a way that it isn't with df.rename. What happens if you ask for an illegal conversion of a column that does exist (say, casting a non-numeric string column to float)?

Comment From: attack68

When I reconsider the the proposal carefully I must raise two points for discussion:

I propose the following changes: 1. If df.columns is a subset of dtype, do not raise an error if superfluous keys are present - In this case all columns are identified and there is a negligible chance that there is an error due to a typo. 2. Extent the functionality of the already present errors='ignore' option to also ignore superfluous keys in the dtype-dictionary.

There may be a reduced chance of an error due to a typo in case 1, but that does not necessarily imply the chance of error in the programme overall is below a threshold of significance that your condition makes it acceptable. It may be or may not be. The more complicated logic error behaviour may (or may not) also have unwanted side effects to other programmes.

For 2. @mwaskom makes a similar observation to my own.

Were we discussing the issue from an original design perspective I'm not sure what I would prefer, and perhaps that is why the API is inconsistent in parts because this is one of those subjective areas. However, having an established MO, means at least for me, the bar must be suitably high and the argument suitably strong and proven that a change is warranted.

But pandas is open source, feel free to contribute a quick/draft PR if you feel convincing enough and see hows its taken..

Comment From: jbrockmendel

I agree with @mroeschke here, -1 on changing.