Pandas Deprecate Series / DataFrame.append

I think that we should deprecate Series.append and DataFrame.append. They're making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result.

These are also apparently popular methods. DataFrame.append is around the 10th most visited page in our API docs.

Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single concat.

Comment From: jreback

+1 from me (though i will usually be plus on deprecating things generally)

yeah free here these are a foot gun

Comment From: erfannariman

+1, it's better to have one method, which is pandas.concat, also it's more flexible with the list of dataframes and the option to concat over axis 0 / axis 1.

Comment From: shoyer

Strong +1 from me!

Just look at all the (bad) answers to this StackOverflow question: https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe

Comment From: jreback

we should also deprecate expansion indexing as well (which is an implicit append)

Comment From: AlexKirko

+1 from me There is really no reason to have this when we have concat available. Especially, because IIRC append works by calling concat and I don't think append abstracts away enough to keep it.

Comment From: achapkowski

How do you expand a dataframe by a single row without having to create a whole dataframe then?

Comment From: TomAugspurger

I'd recommend thinking about why you need to expand by a single row. Can those updates be batched before adding to the DataFrame?

If you know the label you want to set it at, then you can use .loc[key] = ... to expand the index without having to create an intermediate. Otherwise you'll need to create a DataFrame and use concat.

Comment From: darindillon

Disagree. Appending a single row is useful functionality and very common. Yes, we understand its inefficient; but as TomAugspurger himself said, this is the 10th most commonly referenced page on the help, so clearly lots of people have this use case of adding a single row to the end. We can tell ourselves we're removing the method to "encourage good design" but people still want this functionality, so they'll just use the workaround of creating a new DataFrame with a single row and concat'ing, but that just requires the user to write even more code to still get the exact same performance hit, so how have we made anyone's life better?

Comment From: taylor-schneider

Not being able to add rows to a data structure makes no sense. It's one thing to not add the inplace argument but to deprecate the feature is nutts.

Comment From: achapkowski

@TomAugspurger using df.loc[] requires me to know the length of the dataframe. and create code like this:

df.iloc[len(df) + 1] = <new row>

This feel like overly complex syntax for an API that makes data operations simple. Internally df.append or series.append could just do what is shown above, but don't dirty up the user interface.

Why not take a page from lists, the append method is quick because it has pre-allocates slots in advanced. Modify the internals post DataFrame/Series creation to have 1000 empty hidden rows slotted and ready to have new information. If/When the slots are filled, then the DF/Series would expand it outside the view of the user.

Comment From: TomAugspurger

loc requires you to know the label you want to insert it at, not the length.

Why not take a page from lists, the append method is quick because it has pre-allocates slots in advanced.

You could perhaps suggest that to NumPy. I don't think it would work in practice given the NumPy data model.

Comment From: achapkowski

Is Numpy deprecating the append method? If not, why deprecate it here?

Numpy doc: https://numpy.org/doc/stable/reference/generated/numpy.append.html

Comment From: MarcoGorelli

Shall we make this happen and get a deprecation warning in for 1.4 so these can be removed in 2.0? If there's no objections, I'll make a PR later (or anyone following along can, that's probably the fastest way to move the conversation forward)

Comment From: achapkowski

@MarcoGorelli my question still stands, why is this being done?

Comment From: darindillon

Yes, why are we doing this? It seems like we're removing a VERY popular feature (the 10th most visited help page according to the OP) just because that feature is slow. But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this?

Comment From: jreback

there is a ton of discussion pls read in full

this has long been planned as inplace operations make the code base inordinately complex and offer very little benefit

Comment From: achapkowski

@jreback I don't see tons of discussion in this issue, please point me to the discussion that I might be better informed. From what I see is a community asking you not to do this.

Comment From: MarcoGorelli

There's a long discussion here on deprecating inplace: #16529

But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this?

I'd argue that this is still an improvement, because then it would be clearer to users that this is a slow feature - with the status quo, people are likely to think it's analogous to list.append

What's your use-case for append? What does it do that you can't do without 1-2 lines of code which call concat? If you want to make a case for keeping it, please show a minimal example of where having append is a significant improvement

Comment From: neinkeinkaffee

take

Comment From: gesoos

Any chance we can get a note in the documentation on this?

Comment From: MarcoGorelli

@gesoos agreed, there should probably be a ..deprecated:: note in the docstring - do you want to open a PR for this?

Comment From: behrenhoff

I understand that appending can be inefficient, but we use it in non performance critical code, i.e. I don't care. Appending some artificial data (usually based on the data already in the DF) is a very common use-case for us. And I would like to mention why were are using append all over our code base instead of concat: concat loses all are the attributes while they appending works just fine.

import pandas
df1 = pandas.DataFrame({"a": [1]})
df2 = pandas.DataFrame({"a": [2]})
df1.attrs["metadata-xy"] = 42

print(df1.append(df2).attrs)  # keeps the attrs of df1
print(pandas.concat([df1, df2]).attrs)  # no attrs in result

Comment From: MarcoGorelli

Thanks @behrenhoff - maybe this is a case for concat preserving attrs then? Do you want to open a separate issue for that?

Comment From: achapkowski

I don't think you should deprecate append until there is parity with concat.

Comment From: ndevenish

Just had warning for this this appear all over our output. First thing I checked was going to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html to find the reasons why and whether the suggested replacement was strictly equivalent.

So this appears to be deprecated in code but not the documentation? I can imagine other people also finding this discrepant.

There's a long discussion here on deprecating inplace: #16529

The word "append" only shows on that page as a link to this issue, so one can be forgiven for not finding it when looking for discussion on this deprecation.

Comment From: MarcoGorelli

So this appears to be deprecated in code but not the documentation? I can imagine other people also finding this discrepant.

There's a PR open to add this to the docstring

Comment From: ndevenish

Awesome. Maybe it'd be worth considering adding "Check deprecations are documented" to the release validation process. Most (but not all) of the 1.4.0 deprecation list have them.

Comment From: wumpus

I wrote code that implements append() using a 2-level self-tuning set of accumulators that runs plenty fast without using too much memory. Of course it uses pd.concat() under the hood, and as you can see I ended up finding lots of small differences between the semantics of deprecated df.append() and pd.concact().

I don't think it's a good idea to force all of your end users to write this kind of code.

https://github.com/wumpus/pandas-appender

Comment From: dkamacharov19

Disagree. Appending a single row is useful functionality and very common. Yes, we understand its inefficient; but as TomAugspurger himself said, this is the 10th most commonly referenced page on the help, so clearly lots of people have this use case of adding a single row to the end. We can tell ourselves we're removing the method to "encourage good design" but people still want this functionality, so they'll just use the workaround of creating a new DataFrame with a single row and concat'ing, but that just requires the user to write even more code to still get the exact same performance hit, so how have we made anyone's life better?

As a pandas user and a novice coder, I don't understand why this comment is being overlooked yet has been upvoted the most. The rationale behind this decision seems arbitrary and appears to ignore a significant contingent of the population that might be using the pandas library. I count myself as a user and would urge you to consider your user base that might not have efficiency in mind when using this function. I can ensure that when utilizing pandas and append, a significant portion of the population does not have computational efficiency in mind. If that was a primary concern, Python would likely not be the language of choice let alone pandas or the append function. There does not appear to be a 1 to 1 replacement when using the concat function as a replacement for append and as another user has already commented, I don't believe it should be deprecated until that is addressed.

Comment From: TomAugspurger

The replacement is to build up a list or dictionary of records ahead of time, and then pass that to pandas.

Comment From: MarcoGorelli

I don't believe it should be deprecated until that is addressed.

It's not been removed yet, for now there's just the warning.

I've opened #45824 for the attrs limitation mentioned above

If people notice other limitations, they can open issues about them, and by the time append will have been removed, they'll have been addressed

Comment From: dkamacharov19

I understand it is not getting deprecated right away, and I also understand there's a better way to do this. Again, my comment was completely ignored for the sake of making an argument that there are, as the commenter I quoted stated, ways to encourage "good design". Why deprecate a perfectly usable function that is clearly popular with your user base? Not here to debate code with you as I don't code for a living. Just wanted to share an alternative viewpoint to consider. Sometimes the most efficient way is not always the right approach, despite what you might believe will encourage better coding. Why break a function that has widespread usage? Seems counterintuitive to me and again rather arbitrary.

Comment From: seanbow

It took me way too long to figure out that I need to replace

df.append(series)

with

pd.concat([df, series.to_frame().T])

I agree that this functionality is very common and should be included, maybe by a name other than append. it's for a case where I really do need to append just one bar at a time and efficiency isn't very important.

edit: Ok, it's worse. I want to append a dict to the end of a dataframe as a row now and it's going to require some other hacky method when .append() was working just fine.

Comment From: wumpus

I'd like to point out again that I have an efficient, scalable implementation of df.append:

https://github.com/wumpus/pandas-appender

Comment From: CassWindred

This is a very frustrating decision, it is extremely common to have to append individual rows to a Dataframe, or individual elements to a series, and the "intended" way of doing this is much harder to read, requires more lines of code and is far less intuitive. I am not very experienced with Panda's, and now each time I need to add something to a DataFrame I pretty much have to look this up every time, whereas append() is very obvious.

Yes, it may be slower, but for my use case the effect is negligible, and impossible to batch into a single call, at least not without making the code much slower and harder to read.

I've been using Panda's for a few bits and pieces over the last couple of years and 90% of the time I am using Panda's entirely because it gives a bunch of tools and operations that makes it much more painless to work with certain types of data, with easy and intuitive operations to do things that would otherwise take several lines of code to do with raw Python data structures. Very rarely does the efficiency and speed of these oprations matter on any human-perceptible scale. I'm using Pandas to make programming faster, not the program itself, and avoiding append() makes a very common operation an order of magnitude more painful.

Please reconsider this change, people need to append to DataFrames, and they won't stop doing so after the functionality is removed, they will just write more fragile and unintuitive code to do it instead.

Comment From: ChayanBansal

Is the DatetimeIndex.append method also going to be deprecated?

Comment From: TomAugspurger

Does anyone have any non-trivial examples that are worse off after this deprecation? I'm happy to help write docs on the transtion. I have one semi realistic example at https://tomaugspurger.github.io/modern-4-performance.html, where you have a directory of CSV files to concatenate together.

The "bad" way, using append:

files = glob.glob('weather/*.csv')
columns = ['station', 'date', 'tmpf', 'relh', 'sped', 'mslp',
           'p01i', 'vsby', 'gust_mph', 'skyc1', 'skyc2', 'skyc3']

# init empty DataFrame, like you might for a list
weather = pd.DataFrame(columns=columns)

for fp in files:
    city = pd.read_csv(fp, columns=columns)
    weather.append(city)

The "good" way, using concat

files = glob.glob('weather/*.csv')
weather_dfs = [pd.read_csv(fp, names=columns) for fp in files]
weather = pd.concat(weather_dfs)

That's dealing with DataFrame.append(another_dataframe). I gather that some / most of the difficulties expressed here are from workflows where you're appending a dictionary? Anyone able to share a workflow like that?

Comment From: wumpus

I run parameter surveys over millions of combinations, which are represented as a dataframe. The answers come back over 10s of hours, and are appended one by one to an output dataframe.

Comment From: TomAugspurger

@wumpus do you have a minimal example?

Comment From: wumpus

The usecase is hidden in the guts of a middleware package https://github.com/wumpus/paramsurvey -- if you look at the example in the README, you can see what the user sees (results returned as a dataframe)

My code that does df.append() efficiently and scalably is https://github.com/wumpus/pandas-appender

Comment From: seanbow

My use is in collecting financial data online, I get a dict every 5 minutes or so and append it to an existing data frame in memory to do analysis on. Waiting to collect more than one row at a time doesn't make any sense.

Comment From: ppatheron

I use append in a client application where it is very niche - this application is running on production and now I have to update the code to use concat?

I am not an expert Python Programmer, but this way of using append is really useful in my use case, as I need to add the contents of a dictionary, to another dictionary which has a list. But concat does not work for that! I index the dictionary with the list, and then append the contents of the other dictionary into that list.

What will happen now? When will this deprecation happen? So not cool :/

Comment From: MarcoGorelli

I get a dict every 5 minutes or so and append it to an existing data frame in memory to do analysis on

Isn't that relatively straightforward with concat though?

>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6
>>> df.append({'a': 4, 'b': 7}, ignore_index=True)
   a  b
0  1  4
1  2  5
2  3  6
3  4  7
>>> pd.concat([df, pd.DataFrame({'a': 4, 'b': 7}, index=[3])])
   a  b
0  1  4
1  2  5
2  3  6
3  4  7

I need to add the contents of a dictionary, to another dictionary which has a list. But concat does not work for that!

Can you show a minimal reproducible example please?

When will this deprecation happen?

Version 2.0, I believe

So not cool :/

Please be constructive

Comment From: ppatheron

@MarcoGorelli Apologies, not trying to be rude but I am a bit stressed.

So I loop through a DataFrame which contains multiple rows of customer data, which needs to be appended to a JSON/Dictionary object:

ntwrk_src_data_load = {}

                    header_ntwrk = {
                        "header": {
                            "sender": "test",
                            "receiver": "test",
                            "model": "test",
                            "messageVersion": "test",
                            "messageId": "test",
                            "type": "network",
                            "creationDateAndTime": datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f')
                        },
                        "network": []
                    }

                    ntwrk_src_data_load.update(header_ntwrk)

                    for row in payload_1.itertuples(index=False):
                        ntwrk_src_list_1 = {
                            "creationDateTime": datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f'),
                            "documentStatusCode": "test",
                            "documentActionCode": "test",
                            "lastUpdateDateTime": datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f'),
                            "pickUpLocation": {
                                "locationId": row[0]
                            },
                            "dropOffLocation": {
                                "locationId": row[1]
                            },
                            "transportEquipmentTypeCode": {
                                "value": row[2]
                            },
                            "freightCharacteristics": {
                                "transitDuration": {
                                    "value": row[3]
                                },
                                "loadingDuration": {
                                    "value": row[4]
                                }
                            },
                            "sourcingInformation": [
                                {
                                    "sourcingMethod": "",
                                    "sourcingItem": {
                                        "itemId": row[5].lstrip('0')
                                    },
                                    "sourcingDetails": {
                                        "effectiveFromDate": row[6],
                                        "effectiveUpToDate": row[7],
                                        "priority": row[8],
                                        "sourcingPercentage": row[9],
                                        "majorThresholdShipQuantity": {
                                            "value": row[10]
                                        },
                                        "minorThresholdShipQuantity": {
                                            "value": row[11]
                                        }
                                    }
                                }
                            ]

                        }

                        ntwrk_src_data_load['network'].append(ntwrk_src_list_1)
                    json_data_1 = json.dumps(ntwrk_src_data_load_1)

This adds all the contents of the rows which I require to my dictionary, which is then dumped as a JSON format.

I send this JSON file via an API to the client. How would I concat the looped rows into the specific list inside the dictionary as above?

Comment From: MarcoGorelli

I suspect you want something like ntwrk_src_data_load['network'] = pd.concat([ntwrk_src_data_load['network'], pd.DataFrame(ntwrk_src_list_1, index=[0])]) (replace 0 with whatever you want the index of the new row to be), but without a minimal reproducible example (please see here for how to write one) it's hard to say more

Comment From: ppatheron

I've tried what you mentioned, but receive a TypeError:

TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid

I'll try my best with the minimal code:

import pandas as pd from hdbcli import dbapi import numpy as np import hana_ml.dataframe as dataframe import time import math import logging import threading from datetime import datetime

connection = dbapi.connect(address='<<IP>>',port='<<PORT>>',user='<<USER>>',password='<PASSWORD') cursor = connection.cursor()

df = pd.read_sql('''SQL STATEMENT"''', connection) <-- This brings in all the required fields from the DB connection above for a specific table

`ntwrk_src_data_load = {}

                header_ntwrk = {
                    "header": {
                        "sender": "",
                        "receiver": "",
                        "model": "",
                        "messageVersion": "",
                        "messageId": "",
                        "type": "network",
                        "creationDateAndTime": date)
                    },
                    "network": []
                }` <-- the "network" object is the list that I have to populate from the dictionary below.

The contents of the above df is then looped through, and each row is indexed into the JSON/Dictionary structure that the customer requires it in.

`ntwrk_src_data_load_1.update(header_ntwrk)

                for row in payload_1.itertuples(index=False):
                    ntwrk_src_list_1 = {
                        "rows_to_be_populated": row[0]
                    }

                    ntwrk_src_data_load_1['network'].append(ntwrk_src_list_1)
                json_data_1 = json.dumps(ntwrk_src_data_load_1)`

the "ntwrk_src_list_1" is the object that returns multiple "lists" that has to be inserted into "ntwrk_src_data_load_1" object. So essentially, each row in the payload has it's own structure inside the dictionary/JSON file.

Comment From: MarcoGorelli

That's neither minimal nor reproducible, sorry - if you want support for your specific use-case, please read through and follow this https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports and then try again

Comment From: ppatheron

Let's try again:


import pandas as pd

df = pd.DataFrame({'name' : ['Paul', 'Jessie'],
                  'surname': ['Lessor', 'Spolander'],
                  'address': ['61 Gravel Road', '2 Pointer Streer']})

payload_for_loading = {}

header_for_json = {
    "header": {
        "sender": "System",
        "date": "2022-03-14"
    },
    "client": []
}

This "client" list needs to be populated with the DataFrame created above.

payload_for_loading.update(header_for_json)

for row in df.itertuples(index=False):
    payload_dict = {
        "client_name": row[0],
        "client_surname": row[1],
        "client_address": row[2]
    }

    payload_for_loading['client'].append(payload_dict)
json_payload = json.dumps(payload_for_loading)

This code produces the result I require, however, how would the .append function be changed with .concat?

Comment From: MarcoGorelli

Looks like payload_for_loading['client'] is a list? In which case, your code will continue working as usual

It's DataFrame.append that's being deprecated, not append of the Python built-in list

Comment From: ppatheron

Perfect - apologies for any confusion, and thank you so much for your assistance. I noticed that my logging is using the DataFrame.append, and not my payload code. I will still have to update my code to use concat but I've already tested that and it's working.

Comment From: wumpus

@MarcoGorelli I'm not sure what you think is constructive, but I've mentioned repeatedly that I have a scalable wrapper that preserves df.append semantics, while preventing everyone having to independently write the same code. Am I not being constructive?

Comment From: MarcoGorelli

Hey @wumpus ,

My "please be constructive" was in response to the comment "So not cool :/" and wasn't directed at you

Thanks for your input and for a link to your wrapper

Comment From: jreback

@wumpus that was in reference to another conversation (not you)

your wrapper is not incorporated to pandas / likely won't be in any event

Comment From: wumpus

I agree that I appear to be wasting my time, despite having a solution to the root problem. What am I doing wrong?

Comment From: jreback

@wumpus what u wrote might be fine for your use but it's not going to be possible to do this lazy type of evaluation in a reliable way in pandas itself

sure it could be done but would lead to a large amount of edge cases that would lead to a very brittle / complex soln

Comment From: wumpus

My code is fully lazy. I agree that there are probably edge cases -- easy to see because append() and concat() are wildy different.

Comment From: lazypandabear

I wish not to deprecate series / dataframe.append. There are scenarios in my code that I could not done using pd.concat. For Example, I created a list of records that has missing series, and it requires me to do groupby for me to be able to identify them. Then I created a list of those records and then iterate each of those and append the missing series using df.append. I cannot find a way on how to do this with pd.concat.

Comment From: MarcoGorelli

Usual response - please provide a minimal reproducible example

Comment From: rahilbhansali

My two cents on concat vs append (since I use it quite extensively in my algotrading platform):

Append has been incredibly useful for me and I've used it in probably 12-15 places in my codebase. I use dataframes to load price data into memory for fast compute and at times need to append new rows (e.g. orders placed) to a dataframe. Given the size, I use Dataframes almost as a replacement for lists since its far more nimble.
Append until now - allowed me to quickly and easily add a dictionary to an existing dataframe. Concat now requires me to create a dataframe with 1 or more rows and then concat it with my existing dataframe vs. simply just adding a dictionary to the existing dataframe.

Concat seems to be a vertical merge of two dataframes (extend rows) vs. merge which horizontally merges two dataframes (i.e. extends columns basis common keys). If anything - concat intuitively does not suggest appending so here's what I propose: 1. Deprecate Append if its indeed slower (but...) 2. Allow concat to add dictionaries to the dataframe (along with support for arrays). Also instead of doing pd.concat - why can't we simply do df.concat([df1, df2]) which adds data from df1 and df2 to df? 3. Rename Concat to Append - honestly append is a more intuitive word used across the board - I can choose to append from another df or rows directly.

As for an example: Previously I use to use this: self.df_balances = self.df_balances.append(trade_date_balance.to_dict(), ignore_index=True)

Now its replaced with (a little annoying): new_balances_row_df = pd.DataFrame(trade_date_balance.to_dict(), index=[0]) self.df_balances = pd.concat([self.df_balances, new_balances_row_df], ignore_index=True)

For context - df_balances is a dataframe I maintain to save daily balances during my backtesting engine runs which allows me to compute funds available for investing. As I loop through my backtesting dates, I keep inserting this into the dataframe at the end of the day so I can quickly access it later when needed. Eventually, I output the df into a csv so that I can manually verify there is no calculation or settlement error (from a funds perspective).

I do use .loc to make updates - however, it isn't intuitive because you need to know the index or the label - which honestly doesn't matter when you append - and from my knowledge - I don't think .loc supports adding a dictionary.

Comment From: erfannariman

Allow concat to add dictionaries to the dataframe (along with support for arrays). Also instead of doing pd.concat - why can't we simply do df.concat([df1, df2]) which adds data from df1 and df2 to df?

I think allowing concat to add dictionaries is a fair point, since it is mentioned multiple times in this topic. Not sure about df.concat([df1, d2]), it's just as easy to use pd.concat([df, df1, df2]).

Comment From: rahilbhansali

@erfannariman - agreed - its not hard. But merge also uses the same lingo - df.merge(df1), since concat is just a merger of rows from two dfs (in some sense) - might as well stick to the same writing style as merge?

Not a big one - but was a comment for consistency.

Comment From: marc-moreaux

I feel like code readability is so much better with append that concat. I understand that append is not in-place and that it is less efficient than concat.

Even though: append feels more pythonic to me that concat does.

I often use it with single row dictionaries, Series or DataFrames and I feel that my code is more readable this way... Would it make sense to get new appends like: - df.append_dict - df.append_serie - df.append_frame

Comment From: MarcoGorelli

Would it make sense to get new appends like:

-1 on adding even more methods to the API, and very confident that there'd be broad consensus on this among pandas devs

Examples of how to do these, though, would be good candidates for the docs Tom said he'd help write

I understand that append is not in-place and that it is less efficient than concat.

If you're just appending a single row, there shouldn't be much difference in efficiency. If you're appending multiple, then that's where append encourages inefficient code, which is why it's been deprecated. Here's an example from the awesome library ArviZ where the append deprecation "forced" them to write better code: https://github.com/arviz-devs/arviz/pull/1973/files

Comment From: achapkowski

@MarcoGorelli so the question is: is the deprecation being reconsidered?

Comment From: MarcoGorelli

No, what makes you think that?

There can be docs to help the transition (which you'd be welcome to help out with, see the contributing guide if you're interested)

Comment From: achapkowski

@MarcoGorelli clearly the community is saying this is bad. What will it take to stop this?

Comment From: MarcoGorelli

What will it take to stop this?

I'd suggest starting with a minimal reproducible example indicating why you think append needs to stay

Comment From: achapkowski

Does this meet the needs @MarcoGorelli of a simple sample?

append is simple to understand everyone knows list().append. pd.concat is more like list.extend. Though for pushing lots of data, extend is better on a list, for one row, append is fine. The dev team is pushing everyone to go to the extend like method on a list.

What we have and should stay:

import pandas as pd
data = [{'simple' : 'example 1'}, {'simple' : 'example 2'}, {'simple' : 'example 3'}]
pd.DataFrame(data).append({'simple' : "example 4"}, ignore_index=True)

Now let's append with concat:

Example 1: Error

df = pd.DataFrame(data)
pd.concat([df, {'simple' : "example 4"}])
Traceback (most recent call last):
  Python Shell, prompt 18, line 1
    # Used internally for debug sandbox under external interpreter
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-dev\Lib\site-packages\pandas\core\reshape\concat.py", line 295, in concat
    sort=sort,
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-dev\Lib\site-packages\pandas\core\reshape\concat.py", line 370, in __init__
    raise TypeError(msg)
builtins.TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid

Example 2: Error

df = pd.DataFrame(data)
df.concat([{'simple' : "example 4"]) # method doesn't exist

Example 3: I need to create a whole new dataframe for 1 row

df = pd.DataFrame(data)
df1 = pd.DataFrame(data=[{'simple' : 'example 4'}])
pd.concat([df, df1]) # no error finally

Output:

pd.concat([df, df1])
      simple
0  example 1
1  example 2
2  example 3
0  example 4

A bit of a note on example 3, the pd.concat is a method within Pandas, not on the object whereas append is right on the DataFrame. We have overhead for 1 row creating a dataframe. This seems like overkill. Plus now I have to reset my index with concat.

So if I were to break it down:

append exists on the dataframe and is common function used throughout the python ecosystem
pd.concat method doesn't exist on the DataFrame. That means a user has to search for the function.
Adding a single row requires you to create a dataframe, Users cannot just push a dictionary.
Both methods have a place in the API, keep them both and instruct users when one is better over the other.
pd.concat causes users to have to manage the indexes themselves. append will increase to the next iteration.

Comment From: MarcoGorelli

You can do

>>> pd.concat([pd.DataFrame(data), pd.DataFrame({'simple': 'example 4'}, index=[len(data)])])
      simple
0  example 1
1  example 2
2  example 3
3  example 4

which doesn't seem more complicated that using append

>>> pd.DataFrame(data).append({'simple' : "example 4"}, ignore_index=True)
<stdin>:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
      simple
0  example 1
1  example 2
2  example 3
3  example 4

append is simple to understand everyone knows list().append

Yes, that's exactly the issue - to quote the original post: "They're making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result."

Comment From: phofl

We have overhead for 1 row creating a dataframe. This seems like overkill. Plus now I have to reset my index with concat.`

Creating a DataFrame is exactly what happens under the hood -> no overhead with concat
You can simply set ignore_index=True for concat, no need to call reset_index

Comment From: achapkowski

I wasn't looking for solutions for my example... I knew this is what would happen...

Look I know there are ways around this, but why not just make append do what the concat method in the background? Keep both, you your functionality and the community gets to keep a well known function name?

It seems like a solid ask and compromise.

Comment From: MarcoGorelli

but why not just make append do what the concat method in the background?

It already does:

https://github.com/pandas-dev/pandas/blob/3b163de02f666a2342e18468cba7d6c286f526bf/pandas/core/frame.py#L9253-L9310

The issue isn't for when you're appending a single row, but for when you're appending many (e.g. in a loop) - in that case, having append encourages bad and inefficient code

Example:

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(range(10_000))
   ...: dfs = [df] * 100

In [4]: %%timeit
   ...: df_result = dfs[0]
   ...: for df in dfs[1:]:
   ...:    df_result = df_result.append(df)
   ...: 
1.39 s ± 51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %%timeit
   ...: df_result = pd.concat(dfs)
   ...: 
   ...: 
3.6 ms ± 76.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comment From: achapkowski

Then just state the purpose the method in the doc. You are shoehorning in all use cases into one method, when two methods are fine.

Comment From: martin-martin

I haven't seen the impact on chaining style pandas code mentioned in the discussion above (maybe it's discussed elsewhere?), so here's what I'm wondering:

Writing Chaining Style pandas Code

Deprecating pandas.DataFrame.append() will remove a seemingly intuitive possibility to add a row (or rows) to a data frame while writing pandas code in a chained style:

fruits = pd.DataFrame(
    {
    "name": ["apple", "pear", "avocado"],
    "image": ["🍏", "🍐", "🥑"]
    }
)

veggies = pd.DataFrame(
    {
    "name": ["tomato", "carrot", "avocado"],
    "image": ["🍅", "🥕", "🥑"]
    }
)

both_fruit_and_vegetable = (
    fruits
    .append({"name": "tomato", "image": "🍅"}, ignore_index=True)  # Forgot the tomato is a fruit, too!
    .merge(veggies)
    # ... Add other chained operations
)

print(both_fruit_and_vegetable)

# OUTPUT:
#
#       name image
# 0  avocado     🥑
# 1   tomato     🍅

I'm not sure how often you'd want to add rows to a data frame like this, and I understand you could achieve the same using pandas.DataFrame.merge(), e.g. in this minimal example:

both_fruit_and_vegetable = (
    fruits
    .merge(pd.DataFrame({"name": ["tomato"], "image": ["🍅"]}), how="outer")
    .merge(veggies)
)

I'm showing the merge functionality as an example method also because pandas has the instance-level method pandas.DataFrame.merge() as a wrapper for the lower-level pandas.merge().

I thought that this wrapper exists to make chained-style pandas possible for merge operations (and at least a few others think so too), but please correct me if I'm wrong.

Alternatives to Using `.append()` for Chaining Syntax

So I'm wondering whether there's a suggested alternative for adding a row to a data frame when writing chained style pandas code.

Is the solution to use pandas.DataFrame.merge() with appropriate parameters, or will non-SQL-wizards run into unexpected join behavior that's harder to wrap your head around than a seemingly more straightforward append/concat style concatenation?

Or could it be useful to add an instance-level pandas.DataFrame.concat() method that uses pandas.concat() internally, but opens up the opportunity to chain the operation to other operations using a familiar syntax?

Thanks for your thoughts and work!

Comment From: MarcoGorelli

First of all, that's a great example, thanks!

Though can't concat fit into the chain?

In [7]: both_fruit_and_vegetable = (
   ...:     pd.concat([fruits, pd.DataFrame({'name': ['tomato'], 'image': ["🍅"]})], ignore_index=True)
   ...:     .merge(veggies)
   ...:     # ... Add other chained operations
   ...: )

In [8]: both_fruit_and_vegetable
Out[8]: 
      name image
0  avocado     🥑
1   tomato     🍅

Comment From: martin-martin

Lol, thanks 😋

Your example works in this specific case, where .append() is the first thing I do. But it doesn't work when I'd want to concat somewhere lower down in the chain, e.g.:

both_fruit_and_vegetable = (
    fruits
    .merge(veggies)    
    .append(pd.DataFrame({'name': ['tomato'], 'image': ["🍅"]}), ignore_index=True)
 )

I can't chain pd.concat() onto a previous chain link, which is possible with df.append()

Comment From: shoyer

You can also use .pipe() for method chaining with arbitrary functions.

Comment From: MarcoGorelli

Sure but you can still fit concat into the chain:

both_fruit_and_vegetable = pd.concat(
    [fruits.merge(veggies), pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})],
    ignore_index=True,
)

Or indeed, as suggested above:

fruits.merge(veggies).pipe(
    lambda df: pd.concat(
        [df, pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})], ignore_index=True
    )
)

If you just need to append a single row, then such workarounds should be fine. If you need to append many rows inside a loop, then not having append will at least not encourage inefficient code

Comment From: dkamacharov19

Sure but you can still fit concat into the chain:

python both_fruit_and_vegetable = pd.concat( [fruits.merge(veggies), pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})], ignore_index=True, )

Or indeed, as suggested above:

python fruits.merge(veggies).pipe( lambda df: pd.concat( [df, pd.DataFrame({"name": ["tomato"], "image": ["🍅"]})], ignore_index=True ) )

If you just need to append a single row, then such workarounds should be fine. If you need to append many rows inside a loop, then not having append will at least not encourage inefficient code

Why is the goalpost constantly being moved here? You requested examples and they have been provided as is demonstrated here. And the answer is to use a workaround why exactly? If append works as intended shouldn’t that be the goal? I have pretty much accepted that the powers to be are not going to listen to feedback as you have convinced yourselves that a problem that doesn’t need fixing or arguably doesn’t even exist needs to be addressed. The solution to bad code is not to remove a tool that has been misused. Simply trying to point out that your reasons are misguided is despite your good intentions.

Comment From: wumpus

pd.concat with a single row at a time is the performance problem.

And as a reminder, I have a demonstration of a high-performance append.

Comment From: MarcoGorelli

DataFrame.append makes a poor analogy to list.append, but it's a poor analogy and it encourages inefficient code.

If you need to append multiple rows, then you should put them all into a list and then call pd.concat on them and you'll get a noticeable performance gain, especially for large-ish DataFrames
If you only need to append a single row, then one of the workarounds suggested above should be fine

The purpose of asking for minimal reproducible examples was to see if anyone had a use-case for which there wasn't a simple workaround.

You're all being listened to, I've read every post in this thread. The arguments for keeping append seem to be: - better legibility - performance isn't a concern - method chaining

None of these strike me as strong enough reasons to keep append: - the workaround above are simple enough and also legible - plenty of people do care about pandas performance - method chaining is still entirely possible with pipe

And as a reminder, I have a demonstration of a high-performance append.

You've already advertised your package here 3 times, please stop

Comment From: wumpus

I was hoping to successfully talk to "the powers that be" about this change. Looking at the repo owners I see that you are the person I wanted to talk to! Glad I was able to get my code in front of you for a review.

Comment From: behrenhoff

Phew, so many new messages to this topic.

First of all, for me there are two points: it is such a common function that it breaks A LOT of code. This is really bad even if the append pattern is a bad one. Does it really hurt so much to keep it? It costs developers a lot of time to remove all the append calls. I love backward compatibility and I think breaking it for no good reason other than "we want to force developers to do it differently" is a very bad idea.

In our code base, we finally managed to remove all append calls, usually replacing the whole function with better code. When we started with Pandas and didn't know how to work efficiently with Pandas, a common pattern was using "manual groupbys", i.e. loop over df.some_column.unique(), apply the selection like df_group = df[df.some_column == value], do the calculation on the group, append to a result. Very bad indeed. My whole point is that this doesn't improve at all when only replacing the append call with concat. Rewriting these loops with an array to collect the DFs and calling concat at the end gets around the deprecation but doesn't fix the whole style of the function. And sometimes it is even more difficult to fix the old code where an experienced pandas developer would only think "WTF". So fixing all these things is a lot of work for no good reason (the old WTF code was tested and working correctly).

@MarcoGorelli wrote:

Thanks @behrenhoff - maybe this is a case for concat preserving attrs then? Do you want to open a separate issue for that?

Actually, I am in favor of getting rid of as many attrs in our code base as possible. I don't like them at all, they were getting used all over the place so that testing became difficult (when every function expects 10 different attrs to exist, you are in hell and your function become less reusable). Therefore we got rid of a lot of attrs. And concat discurages attrs. But yeah, that was another bit of work. So my code base is now free of attrs and free of append. Work done.

Look, I understand getting rid of some old functions is sometimes a good idea but I really really don't like removing such a common function.

None of these strike me as strong enough reasons to keep append:

the workaround above are simple enough and also legible

Simple enough? That's only true if you don't fix the whole thing. If you just replace every append call with a concat call, you win absolutely nothing.

plenty of people do care about pandas performance

So? I don't understand this argument. df.groupby(col).apply is slow as well and not removed. Also: does append affect other functions? Is concat for two dfs faster than append? No? Only if you do multiple appends? But then you need to modify your algorithm (for example collections separate DFs in a list). Are there really cases where append is a problem? My point is: when you replace it with concat, it won't have an impact on the performance unless you change the whole logic. I DO care about performance in Pandas as well - but ONLY in the areas that affect me. Building/appending to a DF is not in the list at all. If you do care about the append aspect, use a better solution for that purpose. (a bit of whataboutism: a lot of groupby functions are slow as hell when there are many groups, that's where I care)

I thought about clicking the reply button since the deprecation is already in, so this post doesn't change anything - but I feel really strong about "keeping compatibility". I want to be able to update pandas without worrying too much.

By the way: how does concat improve this code:

total_df = pd.DataFrame()
for file in glob("*.csv"):
    print(f"reading {file}")
    df = pd.read_csv(file)
    total_df = total_df.append(df).drop_duplicates()

Yes, it is easy to replace:

total_df = pd.DataFrame()
for file in glob("*.csv"):
    print(f"reading {file}")
    df = pd.read_csv(file)
    total_df = pd.concat([total_df, df]).drop_duplicates()

But the performance gain is 0.

Note that this doesn't work (too much RAM usage) - so you cannot blindly rewrite all df.append to use a list and concat at the end:

dfs = []
for file in glob("*.csv"):
    print(f"reading {file}")
    df = pd.read_csv(file)
    dfs.append(df)
total_df = pd.concat(dfs).drop_duplicates()

Note that append is orders of magnitude faster than read_csv in this example. No performance impact at all. Just work to remove the append calls. (and yes, our real code uses a slightly smarter algorithm)

Having seen the examples in this thread, I would even argue that append is a strong code smell in all cases. It's a question of priorities - compatibility vs. trying to enforce a better style. Especially as a new Pandas user you want to append to your toy DF. This should - in my optinion - be an easy task. The append is only a performance problem if you do it over and over again, not in the general case where you only append one DF to another. That's a very big difference.

So at the end a TLDR: * compatibility IS important * blindly replacing every append with a concat doesn't help with performance * gaining performance is only possible if there are many calls to append and you do smart changes to your code (for example filling a list of DFs in a loop and calling concat on the list at the end). * append in a loop is a very strong code smell

Comment From: wumpus

There's a standard database algorithm to speed up appending single rows at a time to a database, that's what pandas-appender uses. That relieves Pandas users from having to make smart changes.

In 2010 I had a 30 petabyte homegrown NoSQL database using this algorithm at my search engine startup.

Comment From: jreback

@behrenhoff

By the way: how does concat improve this code:

total_df = pd.DataFrame() for file in glob("*.csv"): print(f"reading {file}") df = pd.read_csv(file) total_df = > total_df.append(df).drop_duplicates()

Yes, it is easy to replace:

total_df = pd.DataFrame() for file in glob("*.csv"): print(f"reading {file}") df = pd.read_csv(file) total_df = pd.concat([total_df,

this is exactly the reason append is super problematic we have an entire doc note that i guess no one reads that explain as you are doing an exponential copy here (no kidding u run out ram)

so you have proved the point why append is a terrible idea - it's not about readability but easy to fall into traps that are non obvious at first glance

Comment From: wumpus

If only there was a well-known algorithm which was not an exponential copy.

Comment From: behrenhoff

this is exactly the reason append is super problematic we have an entire doc note that i guess no one reads that explain as you are doing an exponential copy here (no kidding u run out ram)

You did not read or not understand what I was saying. The version with append is the one that WORKS, the one with concat at the end runs into memory issues (because there is the small drop_duplicates in the loop that fixes the problem and cannot be moved out).

And yes, you can be smarter, for example ((file1 + file2).drop_dups + (file3 + file4).drop_dups).drop_dups or similar - where + can be concat or append - doesn't matter. I was just proving the point that the suggested way "collect all DFs in a list and concat them all at the end" does not always work.

Comment From: MarcoGorelli

Thanks @behrenhoff , that's a nice example - though can't you still batch the concats? Say, read 10 files at a time, concat them, drop duplicates, repeat...

This seems like a perfect summary of the issue anyway:

it's not about readability but easy to fall into traps that are non obvious at first glance

At some point we should lock the issue, this is taking a lot of attention away from a lot of people, there's been off-topic comments, no compelling use-case for keeping DataFrame.append, and strong agreement among pandas devs (especially those who have been around the longest)

Comment From: behrenhoff

Say, read 10 files at a time, concat them, drop duplicates, repeat...

Yes, that would work. So would 1 million other solutions. In practice, I could even exploit more about the date ordering inside of the files (all files here have a rather long overlapping history, but newer files can overwrite (fix) data in older files, so it is of course a drop_dups with a subset and keep=last). My point is: this is a non-issue because the operation is done once per 6 month or so, the daily operation just adds exactly one file. No point in optimizing this further as long as it works. That is the whole point I was trying to make. You force people to optimize / change code where old code just works and where there is no need to modify it. And the real gains in this example are not in append vs concat but in exploiting knowledge of the input files and reading them in different order or in groups.

Note that I am not saying this is a usecase that can only be done with append. I am saying it that removing a common feature is unnecessary work imposed on many people and that you don't get performance gains for free by only replacing append with concat (you need to do more).

Anyway, end of discussion for me. I already did the work and got rid of all my appends.

I just fear that many people will not upgrade if their code breaks. You are also making it harder for new users. append is a good and common English word, concat is not, at least I can't find it in a dictionary (there is concatenate but it is a word that a lot fewer people know - this might not be a problem for native English speakers though). I would always search for "append", not for "concat" if I didn't knew the proper function name.

Comment From: PolarNick239

Hi, minimal reproducer that was totally broken:

Before:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
    print(a.append(row))
# Output:
#    A    B
#0  1  2.0
#0  3  NaN

After:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
    print(pd.concat([a, row]))
# Output:
#     A    B    0
#0  1.0  2.0  NaN
#A  NaN  NaN  3.0

Also, please, note that if you add deprecation warning in such popular method that is used widely and calls many times per second - this message will be spammed a lot leading to much bigger overhead than you have with allocations and memory copying. So it is beneficial to print such message only on first call.

Comment From: phofl

What are you trying to do? It would be way more efficient to call

pd.concat([a, b], ignore_index=True)

Edit: Or was it on purpose to put A into the Index instead as a column?

Comment From: PolarNick239

I know, this is just an illustration. I was iterating over rows and if row is OK - adding it to another table. I believe that there are much better way via masking and concatenation with taking such masks into account, but I wanted to have code as simple as possible.

Comment From: phofl

Thanks for your response. It is important for us to see usecases that can not be done more efficiently in another way. You are right, checking data can be done way more efficiently via masking and the concatenating the result.

Comment From: PolarNick239

How can I concat such row to another table a (with superset of row's column names) in such case?

Comment From: MarcoGorelli

with

pd.concat([a, row.to_frame().T], ignore_index=True)

Comment From: phofl

You can simply do:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": [3, 4]})

result = pd.concat([a, b.loc[b["A"] > 3]], ignore_index=True)

Just change the greater 3 to a condition that suits your needs. This avoids the iterating over the rows step. If you have to iterate for some reason, you can use the example from @MarcoGorelli

Comment From: PolarNick239

Not all conditions and not every logic can be readable with such single-line expression.

For people who like me want to just get rid of warnings:

import pandas as pd
def pandas_append(df, row, ignore_index=False):
    if isinstance(row, pd.DataFrame):
        result = pd.concat([df, row], ignore_index=ignore_index)
    elif isinstance(row, pd.core.series.Series):
        result = pd.concat([df, row.to_frame().T], ignore_index=ignore_index)
    elif isinstance(row, dict):
        result = pd.concat([df, pd.DataFrame(row, index=[0], columns=df.columns)])
    else:
        raise RuntimeError("pandas_append: unsupported row type - {}".format(type(row)))
    return result

Comment From: wstomv

Here is a use case for Data.Frame.append, that I think makes sense and for which it took me way too long to figure out how to replace it with pandas.concat. (Do note that I am not a seasoned pandas user.)

I have a data frame with numeric values, such as

df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

and I append a single row with all the column sums

totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)

Simple enough. Here are the values of df, totals, and df_append

>>> df
   A  B
0  1  2
1  3  4

>>> totals
A    4
B    6
Name: totals, dtype: int64

>>> df_append
        A  B
0       1  2
1       3  4
totals  4  6

Now, using pd.concat naively:

df_concat_bad = pd.concat([df, totals])

which produces

>>> df_concat_bad
     A    B    0
0  1.0  2.0  NaN
1  3.0  4.0  NaN
A  NaN  NaN  4.0
B  NaN  NaN  6.0

Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column. You cannot fix this with something like axis=1, because that would add the totals as column.

Fortunately, in a comment above, the implementation of DataFrame.append is quoted, and from this one can glean the solution:

df_concat_good = pd.concat([df, totals.to_frame().T])

which yields the desired

>>> df_concat_good
        A  B
0       1  2
1       3  4
totals  4  6

I think users need to be aware of such subtleties. I also posted this on StackOverflow.

Comment From: MarcoGorelli

This was brought up in https://github.com/pandas-dev/pandas/issues/35407#issuecomment-1092892819 , and some other comments in this thread, and would/should be part of the transition docs (see https://github.com/pandas-dev/pandas/issues/46825)

Comment From: javiertognarelli

Worst idea I've seen, why complicate something so easy, I think it's better to have more options/ways to do something than just one strict way. Dataframe.append() was very easy for noobies to add data to a dataframe

Comment From: etale-cohomology

"[...] around the 10th most visited page in our API docs" and they go ahead and deprecate it.

Comment From: mcclaassen

This seems to be decided but, in the future, I would argue against doing these sort of things to improve user's code (and requesting proof why they can't use pd.concat when they disagree). If it improves maintainability, or makes things easier for devs, go for it. But if something is popular and not "correct", let people do what they want to do. The only valid point I've seen here is for removing the 'inplace' argument, everything else resembles nannying.

Comment From: MarcoGorelli

Thanks all for your comments

This is becoming draining - some comments are off-topic, no new arguments are being presented, and some are not particularly respectful.

Locking for now then - if anyone has any new arguments and wants to make them in a respectful manner, no objections to opening a new issue

It's understandable that some people are unhappy with this decision and have to rewrite some code, but for newbies, getting them to write their code in a better way to begin with will be better for them in the long-run.

If the docs on how to use concat are unclear, pull requests are welcome

Pandas Deprecate Series / DataFrame.append

What we have and should stay:

Example 1: Error

Example 2: Error

Example 3: I need to create a whole new dataframe for 1 row

Writing Chaining Style pandas Code

Alternatives to Using .append() for Chaining Syntax

Alternatives to Using `.append()` for Chaining Syntax