Is your feature request related to a problem?
Dropping table when if_exists=‘replace’ will fail if the table has any objects, like a view, depending on it. Also, some databases like Oracle will end a transaction implicitly when a DLL statement like 'drop table' is issued.
Describe the solution you'd like
Better alternative would be to issue 'truncate table' command instead.
API breaking implications
Should not have any changes on API
Describe alternatives you've considered
As above or a new flag if_exists=‘truncate’
Comment From: ukarroum
One of the problems i can see with using truncate instead of drop is that this add the limitation the the user's dataframe need to have the same columns as the table (which may not always be the case, depending on the usecase).
Comment From: erfannariman
Yes I agree with @ukarroum , also the type of a column can change over time. But then again OP is suggesting to add a new method if_exists='truncate'
which is a bit confusing, because the title states otherwise.
I can see the benefit of truncate over drop, also since the first will be more efficient.
Comment From: tokorhon
There are pros and cons of truncate vs. drop. So maybe new method is the answer. Call it if_exists='delete' or something else. If truncate table is not supported by DBMS then this method should lead to 'delete from table' (which is slower than truncate).
Comment From: fiendish
Could it use truncate if the columns are the same and drop otherwise?
Comment From: mancunian1792
One of the problems i can see with using truncate instead of drop is that this add the limitation the the user's dataframe need to have the same columns as the table (which may not always be the case, depending on the usecase).
I would assume that if the structure of the table changes a lot, then its not a right use case for a relational table in the first place ? Although, i can understand the datatype changing over time would cause an issue. I feel truncate use case is more common and would benefit more people including me.
Comment From: igonro
This is a great idea. I think is better to have both options "replace" and "truncate", sometimes you will need "replace" when you want to delete and insert new columns. But "truncate" is a needed option when you need to keep some columns options in the SQL database. For example, with replace the autoincrement option in the "id" column will disappear.
TLDR:
- replace
: when your db table columns change
- truncate
: when you need to keep your db table columns
Comment From: igonro
Also @lam-juice in #8673 pointed out a behaviour that I was also having in my SQL database:
Is it possible to make pandas delete all the rows instead of dropping the table for if_exists='replace'? Having to drop the table results in deadlocks when there's a simultaneous SELECT going on, while a simple deletion would avoid such deadlocks.
So why not add truncate
option, it solves some problems (deadlocks, column information loss), it preserves the current behaviour of replace
, and I don't think it the implementation would be hard.
Comment From: kylemcmearty
+1 for @igonro 's solution.
I'm having issues with my table view object because replace
drops the table.
Having truncate
and replace
options would be great.
I'm gonna try out this solution for now and see if it works: https://stackoverflow.com/a/67235797
Comment From: igonro
Yes, @kmcmearty; that's an easy workaround (truncating manually and then doing the to_sql with append).
Comment From: gmcrocetti
I recently created a pull request adding the truncate
option.
Comment From: erfannariman
@pandas-dev/pandas-core if anyone or multiple persons have time, could you reply if you're +1 or -1 on this addition. I am +1 since I our team would use this quite often.
@gmcrocetti thank you for taking the time to create the PR. I think it's better that we get some replies from the core team before we spend time on the PR.
Comment From: gmcrocetti
@pandas-dev/pandas-core if anyone or multiple persons have time, could you reply if you're +1 or -1 on this addition. I am +1 since I our team would use this quite often.
@gmcrocetti thank you for taking the time to create the PR. I think it's better that we get some replies from the core team before we spend time on the PR.
Thanks a ton, much appreciated !
Comment From: rhshadrach
could you reply if you're +1 or -1 on this addition.
There are several proposals here, could you clarify which you're referring to? I think it's the addition of truncate
rather than the original request in the OP.
Comment From: gmcrocetti
Hello @rhshadrach , yes you got it right. It is the addition of truncate
indeed.
Comment From: WillAyd
if_exists="truncate"
reads a little strange, since it actually replaces (at least from what I understand with the proposal).
What's the precedent from other libraries for using truncate? I definitely understand that there are performance implications and that for some databases this can affect auto-increment behavior, but I'm also under the impression that the standard replace with drop was chosen with intention.
Is there a prior art for this suggestion in sqlalchemy? What databases support TRUNCATE within a transaction and which don't?
Comment From: shortywz
PostgreSQL and its derivatives support TRUNCATE within a transaction - as an end user looking forward to this feature, the drop implementation requires additional handling to restore permissions, and is not usable on tables with dependencies - TRUNCATE solves both of these problems.
Comment From: gmcrocetti
Hi @WillAyd o/.
The proposal is preserve the behavior of replace
as is today - no changes here. On the other hand adding a new option named truncate
that instead of replacing a table will wipe out the data of the table and "restart" it. I don't think performance is the only pain point. Most of times you don't want to (or can't) recreate the "structure" of the table: indexes, permissions, triggers, etc. All one want is to remove all data and restart from fresh.
AFAIK there's no public API available on sqlalchemy to truncate tables despite this functionally being part of the SQL standard since 2008. There might be a reason for that but I don't know.
Comment From: Dr-Irv
if_exists="truncate"
reads a little strange, since it actually replaces (at least from what I understand with the proposal).
I agree with you @WillAyd , but it does seem that TRUNCATE
is used in SQL to represent the operation of deleting all the rows and then adding new rows to "replace" the ones that are deleted.
I'm +1 on the proposal to add the if_exists="truncate"
option.
Comment From: WillAyd
I am not sure how relevant this is given it was written 15 years ago, but here is what I see upstream for SQLAlchemy:
https://github.com/sqlalchemy/sqlalchemy/issues/1474
The trap here is going to be assuming what works for some databases (i.e. postgres and MSSQL) is going to work across all databases. AFAIU, we have very little (if any) hard-coded statements in our implementation that are RDBMS-specific, and we don't have a very good CI setup to cover non-standard behavior.
So generally I'd say I'm -1 on this, unless it were to be implemented upstream in SQLAlchemy or the DBAPI first. In the interim, users can manage the atomicity of the behavior as needed without too much extra effort, i.e. if you wanted an atomic truncate in postgres I think you could do:
engine = ...
with engine.connect() as conn:
conn.execute("TRUNCATE table foo")
df.to_sql("foo", con=conn, if_exists="append")
Comment From: Dr-Irv
The trap here is going to be assuming what works for some databases (i.e. postgres and MSSQL) is going to work across all databases. AFAIU, we have very little (if any) hard-coded statements in our implementation that are RDBMS-specific, and we don't have a very good CI setup to cover non-standard behavior.
Thanks for this analysis. Given that TRUNCATE
is not supported for all DB's, and there is an alternate solution, I switch my vote to -1.
Comment From: gmcrocetti
Sorry @WillAyd , could you expand on what you meant by that ?
The trap here is going to be assuming what works for some databases (i.e. postgres and MSSQL) is going to work across all databases. AFAIU, we have very little (if any) hard-coded statements in our implementation that are RDBMS-specific, and we don't have a very good CI setup to cover non-standard behavior.
I'm not sure I followed but we don't need to add RDBMS-specific statements. The proposed implementation is using sqlalchemy behind the scenes (.execute(f"TRUNCATE TABLE {table}")
). If the specific database doesn't support it then an exception is raised.
If you're talking about the specifics of how each RDBMS implements the execution of TRUNCATE
then shouldn't pandas delegate this responsibility to the user - of course providing proper documentation ?
Comment From: WillAyd
I'm not sure I followed but we don't need to add RDBMS-specific statements. The proposed implementation is using sqlalchemy behind the scenes (
.execute(f"TRUNCATE TABLE {table}")
).
Although vaguely worded, I wasn't referring to the TRUNCATE
statement itself as much as any accompanying statements that we would have to implement to try and make this behavior work consistently across all RDBMS'es, if that is even possible (not sure it is)
If you're talking about the specifics of how each RDBMS implements the execution of
TRUNCATE
then shouldn't pandas delegate this responsibility to the user - of course providing proper documentation ?
Yes it should, which is why I think adding the truncate
argument as proposed has some flaws. The pandas API is supposed to be atomic and avoid any implementation-defined behavior. Unfortunately with truncate, an expression of the form:
df.to_sql("table", con, if_exists="truncate")
Will yield implementation-defined behavior in case of failure during append. For postgres, it would treat that as an atomic operation and rollback the truncate. But for a DB like Oracle my guess is it would commit the truncation of "table" and leave it without any records.
Since that is tucked away within our API, that puts the onus on us to either bridge those differences or convince people that certain parts of the API are good for some databases yet not for others.
Comment From: gmcrocetti
Got it, thanks for taking the time to reply.
Yeap, I do have to recognized you have a fair point. The proposed feature can lead to potential unexpected behaviors on databases where TRUNCATE
is a DDL instead of a DML (as in postgres and its derivatives). This burden shouldn't be added to pandas as it delegates to SQLalchemy.
I'm gonna wait for the voting to end but in case it is rejected do you believe we could add the outcome of this discussion somewhere into the documentation ? Or going one step further, would be possible to extend the behavior of this function via a third-party (away for good reasons of the core) ?
Comment From: WillAyd
Absolutely +1 to documenting this, either in the I/O tools guide, the to_sql docstrings, or both.
As far as a third-party package goes, are you thinking of an entirely new package just for this feature or are you asking if there's a way to integrate that behavior into our existing to_sql call from a third party? The latter does not exist, but I know I/O plugins has historically been an interest of @datapythonista
Comment From: nickolay
The pandas API is supposed to be atomic and avoid any implementation-defined behavior. Unfortunately with truncate, an expression of the form:
df.to_sql("table", con, if_exists="truncate")
Will yield implementation-defined behavior in case of failure during append.
Same holds for if_exists="replace"
, no? A DROP TABLE preceeding a failed CREATE also won't be rolled back in Oracle and this behavior is inconsistent across databases.
(As an aside: unlike truncate
, if_exists=replace fails to even load unicode data into MSSQL properly, https://github.com/pandas-dev/pandas/issues/35627 since forever, so pretending pandas' database abstractions are or even can be non-leaky is wishful thinking.)
Comment From: gmcrocetti
Nice analysis @nickolay . From what I understood this is indeed what happens today. pandas is not resilient to non-transactional DDL statements (Oracle and its derivations). The existing API is not atomic thus adding the truncate
option will not change existing behavior.
Comment From: tokorhon
Thanks to everyone for great views and opinions regarding this feature request! I may not understand every detail, but for me
- the name of the value of if_exists
is not relevant. It can be something else than truncate
since SQL truncate operation is not available in all databases
- important is that there should be another way to empty a table that drop + recreate. In databases not supporting truncate it can lead to delete from table
.
- as stated in the original post, drop + recreate does not work if there are objects depending of the table.
Comment From: WillAyd
Same holds for
if_exists="replace"
, no? A DROP TABLE preceeding a failed CREATE also won't be rolled back in Oracle and this behavior is inconsistent across databases.
That second link in particular is a good reason why we do not issue these statements directly from within pandas and rely on third parties like SQLAlchemy to abstract this as best as possible. It sounds like your statement may be true for Oracle versions < 11g Release 2 but possibly not thereafter.
Comment From: nickolay
I'm pretty sure what I said about DROP implicitly committing an active transaction applies to the current versions of Oracle as well. I also don't see any evidence of sqlalchemy or pandas using edition-based redefinition for if_exists="replace"
(and would be very surprised if it did so by default!)
But it's true that unlike DROP+CREATE, TRUNCATE TABLE
is an optional and underspecified feature of the SQL standard, so I can see where you're coming from. I was surprised to find that pandas' sqlalchemy backend seems to have 0 instances of raw SQL as of now, and can understand if the maintainers would like to keep it that way.
If that's the case, is @tokorhon's suggestion of providing a mode to run DELETE FROM table;
instead a viable alternative? It IS widely compatible across databases, works fine with transactions, and has an API in sqlalchemy. I agree performance is not the main concern here, as in many scenarios the sqlalchemy-based load process will be the bottleneck before the DELETE
.
Comment From: WillAyd
That's an interesting idea; I think delete may be more viable given it is provided by sqlalchemy
Maybe the keyword becomes if_exists="replace_rows"
to differentiate from the existing case?
Comment From: gmcrocetti
The usage of DELETE FROM
instead of TRUNCATE
for sure has some side effects. But being pragmatic they will achieve the very same result by the end of the day. I'd suggest using delete_rows
to replicate the semantics of what's going on behind the scenes.
Comment From: gmcrocetti
So just to summarize what we all discussed so far and members of the core team @pandas-dev/pandas can vote without reading the whole issue:
if_exists=truncate
1️⃣
The truncate
option would mimic SQL's standard TRUNCATE TABLE
behavior.
Pros
- Performance wise
Truncate
is pretty fast and has very few drawbacks when compared toDELETE FROM
- The table is not deleted, but data is wiped outn - truncate preserves index, permissions and etc.
Cons
- Not widely support by all databases and when supported the implementation varies a ton;
- In some databases running this command creates a implicit commit which can lead to non-atomic actions on pandas.
- SQLAlchemy does not provide an API for that therefore we need to add raw SQL into the codebase.
if_exists=delete_rows
(replace_rows) 2️⃣
don't get hung up on the name (delete_rows
). Anything that represents that rows are deleted but not the table is acceptable.
Pros
- SQLAlchemy provides an API.
- Supported by all SQL databases
Cons
- Not as performant as truncate
- Will activate triggers (in case they exist)
Feel free to vote with 1️⃣ or 2️⃣ reactions.
Comment From: gmcrocetti
Hello @WillAyd,
Have you had the time to think about this matter ? I tried to summarize what we discussed right here ☝️.
@pandas-dev/pandas-core second thoughts are welcome. Let's bring a conclusion to this issue :)
Comment From: WillAyd
I think it makes sense to go the DELETE route. I don't have a strong opinion on naming - maybe delete_replace?
It's a shame that what we have currently as "replace" is better thought of as "recreate" but there's too much history to try and change that
Comment From: torzsmokus
It's a shame that what we have currently as "replace" is better thought of as "recreate" but there's too much history to try and change that
Could be added as a more descriptive alias.
Comment From: torzsmokus
Could it be possible to add a third option that would actually replace (recreate) the table, but keep (save and re-add) the permissions (grants)? As of now, I have to recreate the permissions every time after a table is replaced (and truncate would not be an option to me since the columns may change)
Comment From: gmcrocetti
Could it be possible to add a third option that would actually replace (recreate) the table, but keep (save and re-add) the permissions (grants)? As of now, I have to recreate the permissions every time after a table is replaced (and truncate would not be an option to me since the columns may change)
Hello @torzsmokus ,
We're working to add delete_rows
. The idea is to preserve table's structure but wiping out all data on it (in SQL terms it runs a DELETE * FROM {table}
. Sounds good ?
Comment From: torzsmokus
hi @gmcrocetti , thanks for the swift reply. the development sounds great, but not for my use case – I am afraid you have missed the end of my comment:
(and truncate would not be an option to me since the columns may change)
Comment From: gmcrocetti
@torzsmokus changing columns requires pandas to emit an alter table
statement. This is not in the scope of this issue as this responsibility is delegated to the user as of now.
The scope of this issue is "As a user I do know the structure haven't changed and want to wipe out any previous existing data". I encourage you to create an issue of its own. 💪