Code Sample,
df_name.to_sql('table_name',
schema = 'public',
con = engine,
index = False,
if_exists = 'replace')
Problem description
Im writing a 500,000 row dataframe to a postgres AWS database and it takes a very, very long time to push the data through.
It is a fairly large SQL server and my internet connection is excellent so I've ruled those out as contributing to the problem.
In comparison, csv2sql or using cat and piping into psql on the command line is much quicker.
Comment From: jreback
see here: http://stackoverflow.com/questions/33816918/write-large-pandas-dataframes-to-sql-server-database
with SQLServer you need to import via csv with a bulk upload for efficiency
Comment From: jreback
you might find this useful: http://odo.pydata.org/en/latest/perf.html
Comment From: citynorman
ODO wouldn't work for me, it generates errors I wasn't able to fix, but d6tstack worked fine https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb. You can preprocess with pandas and it uses postgres COPY FROM to make the import quick. Works well with RDS postgres.
Comment From: llautert
Add this code below engine = create_engine(connection_string)
:
from sqlalchemy import event
@event.listens_for(e, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
In my code, to_sql
function was taking 7 min to execute, and now it takes only 5 seconds ;)
Comment From: dean12
Thanks @llautert! That helped a lot!
# dont forget to import event
from sqlalchemy import event, create_engine
engine = create_engine(connection_string)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
Comment From: tim-sauchuk
I've attempted to run this fix, but run into an error message:
AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'fast_executemany'
Anyone know what's going on?
Comment From: bsaunders23
Hey @tim-sauchuk , running into the same error as well, although I found a solution that's been working great which involves a slight edit to the pandas.io.sql.py file (just delete the .pyc file from pycache before importing again to make sure it writes the new version to the compressed file)
https://github.com/pandas-dev/pandas/issues/8953
Comment From: scottcode
Hey @tim-sauchuk , running into the same error as well, although I found a solution that's been working great which involves a slight edit to the pandas.io.sql.py file (just delete the .pyc file from pycache before importing again to make sure it writes the new version to the compressed file)
8953
Issue #8953 that @bsaunders23 mentioned also shows a way to "monkey patch" (fix it at run time). I tried it, and a 20k dataset that took 10+ minutes to upload then took only 4 seconds.
Comment From: vitalious
Thanks @llautert! That helped a lot!
```python
dont forget to import event
from sqlalchemy import event, create_engine
engine = create_engine(connection_string)
@event.listens_for(engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany): if executemany: cursor.fast_executemany = True cursor.commit() ```
Does anyone know how I can implement this solution inside a class with a self.engine instance?
Comment From: DanielOverdevest
Does anyone know how I can implement this solution inside a class with a self.engine instance?
Works for me by refering to self.engine
Example:
self.engine = sqlalchemy.create_engine(connectionString, echo=echo)
self.connection = self.engine.connect()
@event.listens_for(self.engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
print("Listen before_cursor_execute - executemany: %s" % str(executemany))
if executemany:
cursor.fast_executemany = True
cursor.commit()
Comment From: xnejed07
Does not work for me. What pandas and sqlalchemy version are you using?
Comment From: pedrovgp
I tried it running sqlalchemy: 1.2.4-py35h14c3975_0 and 1.2.11-py35h7b6447c_0
but I am getting
AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'fast_executemany'
Comment From: davefanch87
@dean12 @llautert
What does the function call look like in this context? Or in other words, what are you using for the arguments to successfully upload the table?
<# dont forget to import event
from sqlalchemy import event, create_engine
engine = create_engine(connection_string)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()>``
Comment From: lllong33
see here:https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc
Comment From: vaneseltine
I tried it running sqlalchemy: 1.2.4-py35h14c3975_0 and 1.2.11-py35h7b6447c_0
but I am getting
AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'fast_executemany'
You are using psycopg2, which is a postgresql driver. This issue and fix pertain to Microsoft SQL Server using the pyodbc driver.
Comment From: SincerelyUnique
what about add 'dtype' parameter
Comment From: FukoH
Does anyone know how I can implement this solution inside a class with a self.engine instance?
Works for me by refering to
self.engine
Example:
``` self.engine = sqlalchemy.create_engine(connectionString, echo=echo) self.connection = self.engine.connect()
@event.listens_for(self.engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany): print("Listen before_cursor_execute - executemany: %s" % str(executemany)) if executemany: cursor.fast_executemany = True cursor.commit()
```
Have you found out how?
Comment From: pankgeorg
I think the correct answer should be to use https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-batch-mode-fast-execution, if you're trying to save a pandas dataframe
to a postgres
Comment From: KamranMK
A new version of pandas contains method
parameter which could be chosen to be 'multi'. This makes the code run much faster.
Comment From: giuliobeseghi
fast_executemany
can be performed in a single step now (sqlalchemy >= 1.3.0):
engine = sqlalchemy.create_engine(connection_string, fast_executemany=True)
Maybe it's worth mentioning it somewhere in the docs or with an example? It's a particular case not related to pandas, but it's a small addition that could drastically improve the performance in many cases.
Comment From: json2d
A new version of pandas contains
method
parameter which could be chosen to be 'multi'. This makes the code run much faster.
You'd think that setting the chunksize
parameter would be enough to make to_sql
batch insert but nope.
Comment From: xhochy
An alternative for MS SQL users is to also use turbodbc.Cursor.insertmanycolumns
, I've explained that in the linked StackOverflow post: https://stackoverflow.com/a/62671681/1689261
Comment From: jpuerto-psc
For future readers on this, there are two options to use a 'batch_mode' for to_sql. The following are the two combinations:
create_engine(connection_string, executemany_mode='batch', executemany_batch_page_size=x)
or
create_engine(connection_string, executemany_mode='values', executemany_values_page_size=x)
Details on these arguments can be found here: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-fast-execution-helpers
Comment From: feluelle
For postgres users, I recommend setting method
to a callable:
callable with signature (pd_table, conn, keys, data_iter): This can be used to implement a more performant insertion method based on specific backend dialect features.
and call the function from the example code here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method and
Using COPY FROM
is really a lot faster 🚀
Comment From: Jakobhenningjensen
Add this code below
engine = create_engine(connection_string)
:``` from sqlalchemy import event
@event.listens_for(e, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany): if executemany: cursor.fast_executemany = True cursor.commit() ```
In my code,
to_sql
function was taking 7 min to execute, and now it takes only 5 seconds ;)
do you call receive_before_cursor_execute
somewhere (since it needs the argument executemany
to be True) or how is that used with df.to_sql()
?
Comment From: mzw4
For postgres users, I recommend setting
method
to a callable:callable with signature (pd_table, conn, keys, data_iter): This can be used to implement a more performant insertion method based on specific backend dialect features.
and call the function from the example code here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method and
Using
COPY FROM
is really a lot faster 🚀
This is the only solution that worked for me for Postgres, thank you @feluelle! This took my query from 20 min to < 1 min.
I had tried method='multi'
and various chunksizes, but it was slower than without it. Also tried executemany_mode='batch'
in the sqlalchemy engine but it threw an "Invalid value" error and I couldn't find a solution to that.