Problem description
Currently to append to a DataFrame, the following is the approach:
df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
df = df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')))
append
is a DataFrame or Series method, and as such should be able to modify the DataFrame or Series in place. If in place modification is not required, one may use concat
or set inplace
kwag to False
. It will avoid an explicit assignment operation which is quite slow in Python, as we all know. Further, it will make the expected behavior similar to Python lists, and avoid questions such as these: 1, 2...
Additionally at present, append
is full subset of concat
, and as such it need not exist at all. Given the vast number of functions to append a DataFrame or Series to another in Pandas, it makes sense that each has it's merits and demerits. Gaining an inplace
kwag will clearly distinguish append
from concat
, and simplify code.
I understand that this issue was raised in #2801 a long time ago. However, the conversation in that deviated from the simplification offered by the inplace
kwag to performance enhancement. I (and many like me) are looking for ease of use, and not so much at performance. Also, we expect the data to fit in memory (which is a limitation even with current version of append
).
Expected Code
df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')), inplace=True)
Comment From: shoyer
I am opposed to this for the exact reasons discussed in #2801: it would mislead users who might expect a performance benefit.
Comment From: jreback
Virtually all of pandas methods return a new object, the exception being the indexing operations. Using inplace
is not idiomatic, quite unreadable and not (more) performant at all.
Closing, though if someone thinks that we should add a signature like
(...., inplace=False)
, and then raise a TypeError
if inplace=True
to give a nice error message, then we can reopen for that purpose.
In [2]: df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
...: df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')), inplace=True)
TypeError: append() got an unexpected keyword argument 'inplace'
Comment From: remidebette
In the case of a namedtuple which contains a Series object, the inplace approach would be nice to have as a feature. This would not be related in any way to the performance but would be a way to expose data to users.
Indeed, the nametuple objects are by design providing a way for writing a library and exposing it to a user allowing them to only modify it inplace.
Trying to overwrite an attribute of a namedtuple is intentionally raising AttributeError: can't set attribute
so that the user does not try to affect your library. But mutable attributes are allowed.
Consider the following dummy code:
from collections import namedtuple
from pandas import Series
# ----- Library part ------
sample_schema = {
"name": str,
"some_info": str,
"content": Series
}
my_data_type = namedtuple("MyDataType", sample_schema.keys())
exposed_data = my_data_type(
name="Library data",
some_info="Modify the content as you want",
content=Series({"a": 0})
)
# ----- User code part ------
series_to_be_appended = Series({"b": 0})
# This is forbidden
exposed_data.content = exposed_data.content.append(series_to_be_appended)
# This would be allowed but is not implemented in Series
exposed_data.content.append(series_to_be_appended, inplace=True)
The name
and some_info
attributes are string and therefore immutable. A user would not (easily) be able to affect them. But here the content
can be modified as long as it is not set to a new object altogether.
I would think inplace methods are nice to have on any mutable object in general.
Comment From: rtruxal
So the consensus among the maintainers is that it would be too confusing to have an append()
method which actually appends?
I'd suggest removing the method from DataFrame
entirely, or potentially renaming it. Someone familiar with pandas might find it confusing, but the opposite is currently true for those of us without your level of experience.
Comment From: paulstapor
Agreeing here.
Never got why Pandas affords an API having its own logic rather than sharing the one of Python itself. One can get used to the fact that most pandas methods return objects rather than modifying their objects, although its counter-intuitive. (Pandas standard behavior is imho counter-intuitive for all persons that use more Python than Pandas, which should be most of the user-base). And one can get used to the fact that most Pandas methods behave as a user would expect it when passing inplace=True
as argument.
Can live still with that. But not adding the possibility to specify inplace
for append()
and defaulting just it to False
, which effectively keeps the method for all who want it but greatly helps those who need it, is something I cannot follow. Sorry.
Comment From: aitikgupta
Adding a usecase:
1. Have a lot of csv
files, with few entries in each, many of which have additional columns.
2. Want a combined dataframe, which should consist of the additional columns. (Land right up on pandas.DataFrame.append()
docs)
Columns in other that are not in the caller are added as new columns.
- Above line reassures that I landed up in the right place.
combined_dataframe = pd.DataFrame()
for dataframe in list_of_dataframes_read_from_csvs:
combined_dataframe.append(dataframe, inplace=True)
- This raised an error, checked docs, no
inplace
forappend()
, led me to this issue.